Deep Learning for Spatial and Temporal Video Localization

141
D OCTORAL T HESIS Deep Learning for Spatial and Temporal Video Localization Author: Rui S U Supervisor: Prof. Dong X U Co-Supervisor: Dr. Wanli OUYANG A thesis submitted in fulfillment of the requirements for the degree of Doctor of Philosophy in the School of Electrical and Information Engineering Faculty of Engineering June 17, 2021

Transcript of Deep Learning for Spatial and Temporal Video Localization

Page 1: Deep Learning for Spatial and Temporal Video Localization

DOCTORAL THESIS

Deep Learning for Spatial andTemporal Video Localization

AuthorRui SU

SupervisorProf Dong XU

Co-SupervisorDr Wanli OUYANG

A thesis submitted in fulfillment of the requirementsfor the degree of Doctor of Philosophy

in the

School of Electrical and Information Engineering

Faculty of Engineering

June 17 2021

Abstract of thesis entitled

Deep Learning for Spatial and Temporal Video

Localization

Submitted by

Rui SU

for the degree of Doctor of Philosophy

at The University of Sydney

in June 2021

In this thesis we propose to develop novel deep learning algorithmsfor the video localization tasks including spatio-temporal action local-ization temporal action localization spatio-temporal visual groundingwhich require to localize the spatio-temporal or temporal locations oftargets from videos

First we propose a new Progressive Cross-stream Cooperation (PCSC)framework for the spatio-temporal action localization task The basicidea is to utilize both spatial region (resp temporal segment proposals)and features from one stream (ie the FlowRGB stream) to help an-other stream (ie the RGBFlow stream) to iteratively generate betterbounding boxes in the spatial domain (resp temporal segments in thetemporal domain) By first using our newly proposed PCSC frameworkfor spatial localization and then applying our temporal PCSC frameworkfor temporal localization the action localization results are progressivelyimproved

Second we propose a progressive cross-granularity cooperation (PCG-TAL) framework to effectively take advantage of complementarity be-tween the anchor-based and frame-based paradigms as well as betweentwo-view clues (ie appearance and motion) for the temporal action

localization task The whole framework can be learned in an end-to-end fashion whilst the temporal action localization performance can begradually boosted in a progressive manner

Finally we propose a two-step visual-linguistic transformer basedframework called STVGBert for the spatio-temporal visual groundingtask which consists of a Spatial Visual Grounding network (SVG-net)and a Temporal Boundary Refinement network (TBR-net) Different fromthe existing works for the video grounding tasks our proposed frame-work does not rely on any pre-trained object detector For all our pro-posed approaches we conduct extensive experiments on publicly avail-able datasets to demonstrate their effectiveness

Deep Learning for Spatial andTemporal Video Localization

by

Rui SU

BE South China University of Technology MPhil University of Queensland

A Thesis Submitted in Partial Fulfilmentof the Requirements for the Degree of

Doctor of Philosophy

at

University of SydneyJune 2021

COPYRIGHT ccopy2021 BY RUI SU

ALL RIGHTS RESERVED

i

Declaration

I Rui SU declare that this thesis titled ldquoDeep Learning for Spatialand Temporal Video Localizationrdquo which is submitted in fulfillment ofthe requirements for the Degree of Doctor of Philosophy represents myown work except where due acknowledgement have been made I fur-ther declared that it has not been previously included in a thesis disser-tation or report submitted to this University or to any other institutionfor a degree diploma or other qualifications

Signed

Date June 17 2021

For Mama and PapaA sA s

ii

Acknowledgements

First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher

I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future

Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything

Rui SU

University of SydneyJune 17 2021

iii

List of Publications

JOURNALS

[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020

[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020

CONFERENCES

[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019

v

Contents

Abstract i

Declaration i

Acknowledgements ii

List of Publications iii

List of Figures ix

List of Tables xv

1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11

2 Literature Review 1321 Background 13

211 Image Classification 13212 Object Detection 15213 Action Recognition 18

22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26

3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29

31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31

32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38

33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44

34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47

Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50

343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58

35 Summary 60

4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62

411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63

42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71

43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81

44 Summary 82

5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84

511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84

52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92

53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98

54 Summary 102

6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104

Bibliography 107

ix

List of Figures

11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6

31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33

32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i

t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi

t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39

33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43

34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57

35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58

41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 are denoted as SRGB0 PRGB

0

and QFlow0 The initial proposals and features from anchor-

based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66

42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80

43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81

51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86

52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89

53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93

xv

List of Tables

31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48

32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49

33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50

34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51

35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52

36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52

37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52

38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54

39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55

310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55

311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55

312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55

41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74

42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75

43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75

44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76

45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77

46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78

51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99

52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100

53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101

1

Chapter 1

Introduction

With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades

In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows

2 Chapter 1 Introduction

Spatio-temporal Action Localization

Framework

Input Videos

Action Tubes

Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes

11 Spatial-temporal Action Localization

Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task

Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each

11 Spatial-temporal Action Localization 3

other

For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation

On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation

Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region

4 Chapter 1 Introduction

proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level

Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level

Our contributions are briefly summarized as follows

bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy

bull We also propose a temporal boundary refinement method to learn

12 Temporal Action Localization 5

class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results

bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios

12 Temporal Action Localization

Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging

Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework

For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other

6 Chapter 1 Introduction

Ground Truth

P-GCN

Ours

Frisbee Catch783s 821s

Frisbee Catch799s 822s

Frisbee Catch787s 820s

Ground Truth

P-GCN

Ours

Golf Swing

Golf Swing

Golf Swing

210s 318s

179s 311s

204s 315s

Ground Truth

P-GCN

Ours

Throw Discus

Hammer Throw

Throw Discus

639s 712s

642s 708s

637s 715s

Time

Time

Video Input Start End

Anchor-based ProposalScore 082

Score 039Score 047

Score 033

Frame Actioness startingending

Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods

Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries

For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and

12 Temporal Action Localization 7

segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method

To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based

8 Chapter 1 Introduction

segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold

(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels

(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner

(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization

13 Spatio-temporal Visual Grounding

Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers

Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of

13 Spatio-temporal Visual Grounding 9

bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene

Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks

Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance

Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the

10 Chapter 1 Introduction

STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]

Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods

Our contributions can be summarized as follows

(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors

(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover

14 Thesis Outline 11

we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods

(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task

14 Thesis Outline

The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows

Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters

Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]

Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

12 Chapter 1 Introduction

Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod

Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future

13

Chapter 2

Literature Review

In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks

21 Background

211 Image Classification

Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters

14 Chapter 2 Literature Review

(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem

Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition

Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional

21 Background 15

neural network is built by stacking such inception modules to achievehigh image classification performance

VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance

Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper

212 Object Detection

The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem

16 Chapter 2 Literature Review

Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time

To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced

Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined

21 Background 17

anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map

Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming

Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps

All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map

18 Chapter 2 Literature Review

213 Action Recognition

Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other

For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities

Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and

21 Background 19

randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization

2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient

Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos

20 Chapter 2 Literature Review

22 Spatio-temporal Action Localization

Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task

Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization

Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization

Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these

22 Spatio-temporal Action Localization 21

two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency

As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization

The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos

A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes

22 Chapter 2 Literature Review

23 Temporal Action Localization

Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error

Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation

Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce

23 Temporal Action Localization 23

the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame

Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods

In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates

Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors

Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the

24 Chapter 2 Literature Review

prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise

Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU

Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction

In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way

24 Spatio-temporal Visual Grounding 25

24 Spatio-temporal Visual Grounding

Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects

Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer

In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects

Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes

One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors

26 Chapter 2 Literature Review

to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance

25 Summary

In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding

In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms

Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows

25 Summary 27

the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results

In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities

In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos

29

Chapter 3

Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization

Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal

30Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios

31 Background

311 Spatial temporal localization methods

Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]

Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization

Third temporal localization is to form action tubes from per-frame

31 Background 31

detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc

Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]

312 Two-stream R-CNN

Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams

Please note that most appearance and motion fusion approaches in

32Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion

32 Action Detection Model in Spatial Domain

Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results

321 PCSC Overview

The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression

The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation

32 Action Detection Model in Spatial Domain 33

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

t

IFlow

t

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(a)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(b)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(c)

Figu

re3

1(a

)O

verv

iew

ofou

rC

ross

-str

eam

Coo

pera

tion

fram

ewor

k(f

our

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

regi

onpr

opos

als

and

feat

ures

from

the

flow

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

and

Stag

e3

whi

leth

ere

gion

prop

osal

san

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2an

dSt

age

4Ea

chst

age

com

pris

esof

two

coop

erat

ion

mod

ules

and

the

dete

ctio

nhe

ad(

b)(c

)The

deta

ilsof

our

feat

ure-

leve

lcoo

pera

tion

mod

ules

thro

ugh

mes

sage

pass

ing

from

one

stre

amto

anot

her

stre

am

The

flow

feat

ures

are

used

tohe

lpth

eR

GB

feat

ures

in(b

)w

hile

the

RG

Bfe

atur

esar

eus

edto

help

the

flow

feat

ures

in(c

)

34Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages

After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33

322 Cross-stream Region Proposal Cooperation

We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream

In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes

The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage

32 Action Detection Model in Spatial Domain 35

The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)

Mathematically let P (i)t and B(i)t denote the set of region propos-

als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)

t is up-dated as P (i)

t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)

t is updated as

B(j)t = G(P (j)

t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)

0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams

With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results

Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the

36Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data

323 Cross-stream Feature Cooperation

To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections

Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci

2 Ci3 Ci

4Ci

5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui

2 Ui3 Ui

4 Ui5

Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are

32 Action Detection Model in Spatial Domain 37

insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement

We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci

2 Ci3 Ci

4 Ci5 l isin 2 3 4 5 Let us use the improvement

of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB

l with the help of the features CFlowl as

followsCRGB

l = fθ(CFlowl )oplus CRGB

l (31)

where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow

l ) nonlinearly extracts the message fromthe feature CFlow

l and then uses the extracted message for improvingthe features CRGB

l

The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB

l and CFlowl To this end

we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB

l in the bottom-up pathway are refined the corre-sponding features maps URGB

l in the top-down pathway are generatedaccordingly

The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages

38Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB

t

and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1

and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-

prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31

(b) Specifically the improved RGB feature FRGBt is obtained by apply-

ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB

t of the RGB stream is also used to help improve theROI-feature FFlow

t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage

324 Training Details

For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process

33 Temporal Boundary Refinement for Action

Tubes

After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this

33 Temporal Boundary Refinement for Action Tubes 39

Initi

al

Segm

ents

(a) I

nitia

l Seg

men

t Gen

erat

ion

Initi

al S

egm

ent

Gen

erat

ion

RG

B o

r Flo

w F

eatu

res

from

Act

ion

Tube

s

=cup

2

0

1

Initi

al S

egm

ent

Gen

erat

ion

Initi

al

Segm

ents

Stag

e 2

Segm

ent F

eatu

reC

oope

ratio

nD

etec

tion

Hea

dSe

gmen

t Fea

ture

Coo

pera

tion

Det

ectio

nH

ead

=cup

1

0

0

1

2

1

2

Impr

oved

Seg

men

t Fe

atur

es

Det

ecte

d Se

gmen

ts

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Impr

oved

Seg

men

tFe

atur

es

Det

ecte

d Se

gmen

ts

(b) T

empo

ral P

CSC

RG

B F

eatu

res

from

A

ctio

n Tu

bes

0

Flow

Fea

ture

s fr

om

Act

ion

Tube

s

0

Actio

nnes

sD

etec

tor (

AD)

Actio

n Se

gmen

tD

etec

tor (

ASD

)Se

gmen

ts fr

om A

D

Uni

on o

f Seg

men

tsSe

gmen

ts fr

om A

SD

Initi

al S

egm

ents

Figu

re3

2O

verv

iew

oftw

om

odul

esin

our

tem

pora

lbo

unda

ryre

finem

ent

fram

ewor

k(a

)th

eIn

itia

lSe

gmen

tG

en-

erat

ion

Mod

ule

and

(b)

the

Tem

pora

lPr

ogre

ssiv

eC

ross

-str

eam

Coo

pera

tion

(T-P

CSC

)M

odul

eTh

eIn

itia

lSe

gmen

tG

ener

atio

nm

odul

eco

mbi

nes

the

outp

uts

from

Act

ionn

ess

Det

ecto

ran

dA

ctio

nSe

gmen

tDet

ecto

rto

gene

rate

the

init

ial

segm

ents

for

both

RG

Ban

dflo

wst

ream

sO

urT-

PCSC

fram

ewor

kta

kes

thes

ein

itia

lseg

men

tsas

the

segm

entp

ropo

sals

and

iter

ativ

ely

impr

oves

the

tem

pora

lact

ion

dete

ctio

nre

sult

sby

leve

ragi

ngco

mpl

emen

tary

info

rmat

ion

oftw

ost

ream

sat

both

feat

ure

and

segm

entp

ropo

sall

evel

s(t

wo

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

segm

entp

ro-

posa

lsan

dfe

atur

esfr

omth

eflo

wst

ream

help

impr

ove

acti

onse

gmen

tdet

ecti

onre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

whi

leth

ese

gmen

tpro

posa

lsan

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2Ea

chst

age

com

pris

esof

the

coop

erat

ion

mod

ule

and

the

dete

ctio

nhe

adT

hese

gmen

tpro

posa

llev

elco

oper

-at

ion

mod

ule

refin

esth

ese

gmen

tpro

posa

lsP

i tby

com

bini

ngth

em

ostr

ecen

tseg

men

tpro

posa

lsfr

omth

etw

ost

ream

sw

hile

the

segm

ent

feat

ure

leve

lcoo

pera

tion

mod

ule

impr

oves

the

feat

ures

Fi tw

here

the

supe

rscr

ipt

iisinR

GB

Flo

w

deno

tes

the

RG

BFl

owst

ream

and

the

subs

crip

ttde

note

sth

est

age

num

ber

The

dete

ctio

nhe

adis

used

toes

tim

ate

the

acti

onse

gmen

tloc

atio

nan

dth

eac

tion

ness

prob

abili

ty

40Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework

To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments

33 Temporal Boundary Refinement for Action Tubes 41

331 Actionness Detector (AD)

We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i

At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the

42Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

testing stage

332 Action Segment Detector (ASD)

Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below

(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture

33 Temporal Boundary Refinement for Action Tubes 43

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventuallyremoved by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively

but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details

(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three

44Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg

(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments

333 Temporal PCSC (T-PCSC)

Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework

33 Temporal Boundary Refinement for Action Tubes 45

from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments

Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results

46Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

34 Experimental results

We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343

341 Experimental Setup

Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset

Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005

To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes

34 Experimental results 47

To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ

Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch

342 Comparison with the State-of-the-art methods

We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted

48Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy

Results on the UCF-101-24 Dataset

The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32

Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05

Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792

Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for

34 Experimental results 49

Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector

IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -

Zolfaghari et al [105] RGB+Flow+Pose 476 - - -

Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219

Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172

Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237

PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289

per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs

Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary

50Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods

Results on the J-HMDB Dataset

For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset

Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05

Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748

We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU

34 Experimental results 51

Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds

IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -

Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -

Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442

Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463

Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476

PCSC (Ours) RGB+Flow 826 822 631 528

threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively

At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work

52Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

343 Ablation Study

In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method

Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset

Stage PCSC wo featurecooperation PCSC

0 755 7811 761 7862 764 7883 767 7914 767 792

Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector

Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631

Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset

Stage Video mAP ()0 6191 6282 631

Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to

34 Experimental results 53

verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement

Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work

54Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset

Stage LoR()1 6112 7193 9524 954

in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes

Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +

34 Experimental results 55

Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset

Stage MABO()1 8412 8433 8454 846

Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset

θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801

Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages

Stage 0 1 2 3 4

Detection speed (FPS) 31 29 28 26 24

Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages

Stage 0 1 2

Detection speed (VPS) 21 17 15

ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main

56Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

344 Analysis of Our PCSC at Different Stages

In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages

In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4

To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification

34 Experimental results 57

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)

345 Runtime Analysis

We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages

58Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detected instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)

346 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams

34 Experimental results 59

and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))

Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)

While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue

60Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

which will be studied in our future work

35 Summary

In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets

61

Chapter 4

PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization

There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework

62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

41 Background

411 Action Recognition

Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction

412 Spatio-temporal Action Localization

The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues

41 Background 63

413 Temporal Action Localization

Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level

64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues

42 Our Approach

In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively

421 Overall Framework

We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is

an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN

i=1 and their temporal durations(tis tie)N

i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi

The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows

42 Our Approach 65

Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature

Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB

i |pRGBi = (tRGB

is tRGBie )NRGB

i=1 and PFlow = pFlowi |pFlow

i =

(tFlowis tFlow

ie )NFlow

i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB

0 PFlow0

for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB

0 SFlow0 and the predicted action categories

Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB

0 and QFlow0 respectively

The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB

0 and QFlow0 respectively

after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)

Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation

66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Anchor Feautres (Output)

Frame Features (Output)

Cross-granularity

RGB-StreamAFC Module

(Stage 1)

Flow-StreamAFC Module

(Stage 2)

RGB1

RGB-StreamAFC Module

(Stage 3)

RGB1Flow

1

Flow-StreamAFC Module

(Stage4)

Features RGB0 Flow

0Flow

0

Segments

Proposals RGB0

Flow0

Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

RGB

RGB

Flow

Message Passing

Flowminus2 RGB

minus2Flowminus1

Flowminus1

RGBminus2

Flow

RGB

Anchor AnchorFrame

Anchor Features (Output)

1 x 1 Conv

1 x 1 Conv

Detected Segments (Output)

Proposal Combination= cup cupRGB

Flowminus1

RGBminus2

Flow

InputProposals

Input Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

1 x 1 Conv

1 x 1 Conv

Proposal Combination= cup cupFlow

RGBminus1

Flowminus2

RGB

Input Features

RGBminus1

Flowminus2

InputProposals

Flowminus2 RGB

minus1 RGBminus2

RGB

RGB

Flow

Flow

Flow

Cross-stream Cross-granularity

Anchor Anchor Frame

Message Passing

Features

Proposals Flow2

Features RGB3 Flow

2 RGB2

ProposalsRGB3

Flow2

Flow2

SegmentsFeatures Flow

2RGB2

SegmentsFeatures Flow

3

RGB3

RGB3

Flow4

Flow4RGB

4

SegmentsFeatures

Detected Segments (Output)

Frame Features (Output)

Message Passing

Outputs

Features RGB1 Flow

0RGB

0

ProposalsRGB1

Flow0

RGB1

RGB1 Flow

2 Flow1

Inputs

(a) Overview

(b) RGB-streamAFC module

(c) Flow-stremAFC module

Cross-stream Message Passing

Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 aredenoted as SRGB

0 PRGB0 and QFlow

0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments

42 Our Approach 67

methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages

422 Progressive Cross-Granularity Cooperation

The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)

To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow

nminus1 and SRGBnminus2 as the input two-stream proposals and

uses preceding two-stream anchor-based features PFlownminus1 and PRGB

nminus2 andflow-stream frame-based feature QFlow

nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB

n and anchor-based feature PRGB

n as well as update the flow-stream frame-based fea-tures as QFlow

n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages

68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes

Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances

423 Anchor-frame Cooperation Module

The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)

Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features

42 Our Approach 69

PRGB0 PFlow

0 and QRGB0 QFlow

0 as well as proposals SRGB0 SFlow

0 as theinput as shown in Sec 421 and Fig 41(a)

Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow

nminus1 to the frame-based features QFlownminus2 both from the flow-stream

The message passing operation is defined as

QFlown =Manchorrarr f rame(PFlow

nminus2 )oplusQFlownminus2 (41)

oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow

n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow

n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB

n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB

nminus2 which flips the superscripts from Flow to RGB QRGB

n also leads to betterframe-based RGB-stream proposals QRGB

n

Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow

n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow

n And then we combine QFlown together with the

two-stream anchor-based proposals SFlownminus1 and SRGB

nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB

n = SRGBnminus2 cup SFlow

nminus1 cup

70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

QFlown This set of proposals not only provides more comprehensive pro-

posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow

n = SFlownminus2 cup SRGB

nminus1 cupQRGBn

Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB

n =M f lowrarrrgb(PFlownminus1 )oplus PRGB

nminus2 forthe RGB stream and PFlow

n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow

nminus2 for flow stream

Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB

n PFlown and

the updated anchor-based features PRGBn PFlow

n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes

424 Training Details

At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the

42 Our Approach 71

boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules

425 Discussions

Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants

Similarity of Results Between Adjacent Stages Note that even though

72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results

Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources

43 Experiment

431 Datasets and Setup

Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples

- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available

- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an

43 Experiment 73

existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1

Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches

Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24

432 Comparison with the State-of-the-art Methods

We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers

THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05

74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds

tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190

REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178

TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369

MGG [46] - - 539 468 374BMN [35] - - 560 474 388

TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491

ours 712 689 651 595 512

our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN

ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and

43 Experiment 75

Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used

tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -

R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398

TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699

ours 4431 2985 547 2885

Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied

tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003

P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385

ours 5204 3592 797 3491

BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN

UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05

76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds

st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -

MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278

ours 849 635 237 292

433 Ablation study

We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework

Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow

n and QRGBn are

directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo

CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow

nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and

43 Experiment 77

Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset

Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822

wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667

wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679

wo PC 4214 4321 4514 4574 4577

PRGBnminus1 and PFlow

nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison

In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB

0 cupPFlow0 for ldquowo CG andQRGB

0 cupQFlow0 for ldquowo PC We have

the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved

78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset

Stage Comparisons LoR ()

1 PRGB1 vs SFlow

0 6712 PFlow

2 vs PRGB1 792

3 PRGB3 vs PFlow

2 8954 PFlow

4 vs PRGB3 971

when the number of stages increases and the results will become stableafter 2 or 3 stages

Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB

1 at Stage 1 andPFlow

2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow

2 among all proposals in PFlow2 A proposal

is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB

1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow

2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB

1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other

Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow

4 and those in PRGB3 are very similar even they are gathered for

different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal

43 Experiment 79

action localization method after Stage 3 which is consistent with the re-sults in Table 45

Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005

In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries

80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

(a) AR50 () for various methods at different number of stages

(b) AR100 () for various methods at different number of stages

(c) AR200 () for various methods at different number of stages

(d) AR500 () for various methods at different number of stages

Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset

43 Experiment 81

Ground Truth

Anchor-based Method

PCG-TAL at Stage 3

977s 1042s

954s 1046s

973s 1043s

Frame-based Method Missed

Video Clip

964s 1046s

97s 1045s

PCG-TAL at Stage 1

PCG-TAL at Stage 2

Time

1042s 1046s977s

954s

954s

Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment

434 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our

82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule

44 Summary

In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework

83

Chapter 5

STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding

Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our

84Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG

51 Background

511 Vision-language Modelling

Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work

The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations

512 Visual Grounding in ImagesVideos

Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et

52 Methodology 85

al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling

In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods

52 Methodology

521 Overview

We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip

k Kk=1 where Vclip

k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN

n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte

t=ts

containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B

86Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Pooled Video Feature

Input Textual Tokens

Spatial VisualG

rouding Netw

ork(SVG

-net)

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

HW T

Video Frames

Video Feature

Initial Object Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]

Textual Query

Word Em

beddingInput Textual Tokens

ST-ViLBERTVidoe C

lipSam

pling

Video FeatureC

lip Feature

Input Textual Tokens

Deconvs

Head

Head

Correlation

MLP

Spatial Pooling

Global Textual Feature

Text-guided Viusal Feature

Initial TubeG

eneration

Initial Object Tube

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

Input Textual Tokens

(a) Overview

of Our M

ethod

(b) Our SVG

-net(c) O

ur TBR

-net

Bounding Box Size

Bounding Box Center

Relevance Scores

Pooled Video Feature

Initial Object

TubePooled Video

Feature

SVG-net-core

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

H T

VideoC

lip Features

Initial Object

Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]Textual Q

ueryW

ord Embedding

Input Textual Tokens

ST-ViLBERTC

lip Feature

Input Textual Tokens

Deconv

Head

Head

Correlation

amp Sigmoid

MLP

Spatial Pooling

Global

Textual Feature

Text-guided Viusal Feature

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

(a) Overview

of Our M

ethod

(b) Our SVG

-net-core Module

(c) Our TB

R-net

Bounding Box Sizes

Bounding Box Centers

Relevance Scores

Pooled Video Feature

Initial Object

Tube

W

BB Centers

BB Sizes

Relevance Scores

SVG-net-core

Initial TubeG

eneration

Spatial Visual Grounding N

etwork (SVG

-net)

Figure51

(a)T

heoverview

ofour

spatio-temporalvideo

groundingfram

ework

STVG

BertO

urtw

o-stepfram

ework

consistsof

two

sub-networksSpatial

Visual

Grounding

network

(SVG

-net)and

Temporal

BoundaryR

efinement

net-w

ork(T

BR-net)w

hichtakes

videoand

textualquerypairs

asthe

inputtogenerate

spatio-temporalobjecttubes

contain-ing

theobject

ofinterest

In(b)the

SVG

-netgenerates

theinitialobject

tubesIn

(c)theTBR

-netprogressively

refinesthe

temporalboundaries

oftheinitialobjecttubes

andgenerates

thefinalspatio-tem

poralobjecttubes

52 Methodology 87

In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections

522 Spatial Visual Grounding

Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames

Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip

k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature

For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two

88Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2

n=1 where en is the i-th textual input token

Multi-modal Feature Learning Given the visual input feature Fclipk

and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details

The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors

Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in

52 Methodology 89

Multi-head AttentionQuery Key Value

Add amp Norm

Multi-head AttentionQueryKeyValue

Add amp Norm

Feed Forward

Add amp Norm

Feed Forward

Add amp Norm

Visual Input (Output from the Last Layer)

Visual Output for the Current Layer

Textual Input (Output from the Last Layer)

Textual Output for the Current Layer

Visual Branch Textual Branch

Spatial Pooling Temporal Pooling

Combination

Multi-head Attention

Query Key ValueTextual Input

Combined Visual Feature

Decomposition

Replication Replication

Intermediate Visual Feature

Visual Input

Norm

(a) One Co-attention Layer in ViLBERT

(b) Our Spatio-temporal Combination and Decomposition Module

Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))

90Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively

Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv

to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box

Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending

52 Methodology 91

frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0

s t0e ) The

initial temporal boundaries (t0s t0

e ) and the predicted bounding boxes bt

form the initial tube prediction result Binit = btt0e

t=t0s

Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel

92Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

axy = exp(minus (xminusx)2+(yminusy)2

2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows

LSVG =λ1Lc(A A) + λ2Lrel(oobj o)

+ λ3Lsize(wxy hxy w h)(51)

where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details

523 Temporal Boundary Refinement

The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube

As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries

52 Methodology 93

Text

ual I

nput

To

kens

Star

ting

amp En

ding

Pr

obab

ilitie

s

Tem

pora

l Ext

ensi

onU

nifo

rm S

ampl

ing

ViLB

ERT

Boun

dary

Reg

ress

ion

Loca

l Fea

ture

Ext

ract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Exte

nded

Stra

ting

amp En

ding

Pos

ition

sAn

chor

Fea

ture

Refi

ned

Star

ting

amp En

ding

Pos

ition

s (

From

Anc

hor-b

ased

Bra

nch)

and

Star

ting

amp En

ding

Reg

ion

Feat

ures

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Initi

al S

tarti

ng amp

End

ing

Posi

tions

Refi

nde

Star

ting

amp En

ding

Pos

ition

s (F

rom

Fra

me-

base

d Br

anch

)

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Anc

hor-

base

d B

ranc

h

Fram

e-ba

sed

Bra

nch

Refi

ned

Obj

ect

Tube

Initi

al O

bjec

t Tu

be

Vide

o Po

oled

Fe

atur

e

Text

ual I

nput

To

kens

Stag

e 1

Stag

e 2

Stag

e 3

Star

ting

Endi

ng

Prob

abilit

yAn

chor

Fea

ture

Extra

ctio

nVi

LBER

TBo

unda

ryR

egre

ssio

nLo

cal F

eatu

reEx

tract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Refi

ned

Posi

tions

(Fro

m A

B)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Initi

al P

ositi

ons

Refi

ned

Posi

tions

(F

rom

FB)

Pool

ed V

ideo

Fe

atur

eTe

xtua

l Inp

ut

Toke

nsPo

oled

Vid

eo

Feat

ure

Anc

hor-

base

d B

ranc

h (A

B)

Fram

e-ba

sed

Bra

nch

(FB

)

Refi

ned

Obj

ect T

ube

from

Initi

al O

bjec

t Tu

be

Pool

ed V

ideo

Fea

ture

Text

ual I

nput

Tok

ens

Stag

e 1

Stag

e 2

Figu

re5

3O

verv

iew

ofou

rTe

mpo

ralB

ound

ary

Refi

nem

entn

etw

ork

(TBR

-net

)w

hich

isa

mul

ti-s

tage

ViL

BER

T-ba

sed

fram

ewor

kw

ith

anan

chor

-bas

edbr

anch

and

afr

ame-

base

dbr

anch

(tw

ost

ages

are

used

for

bett

erill

ustr

atio

n)

The

inpu

tsta

rtin

gan

den

ding

fram

epo

siti

ons

atea

chst

age

are

thos

eof

the

outp

utob

ject

tube

from

the

prev

ious

stag

eA

tst

age

1th

ein

puto

bjec

ttub

eis

gene

rate

dby

our

SVG

-net

94Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage

Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets

Given an anchor with the temporal boundaries (t0s t0

e ) and the lengthl = t0

e minus t0s we temporally extend the anchor before and after the start-

ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0

s minus l4 t0e + l4] and generate the anchor feature F by stacking the

corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension

The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0

s and t0e and generate the new starting

and ending frame positions t1s and t1

e

Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1

s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1

s minusD2+ 1t1s + D2] where D is the length

53 Experiment 95

of the starting duration We then construct the frame-level feature Fs

by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1

s Accordingly we can produce the ending frameposition t1

e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1

s and t1e are generated they can be used as the

input for the anchor-based branch at the next stage

Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow

LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)

where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1

53 Experiment

531 Experiment Setup

Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset

96Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes

-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes

Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU

Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1

|SU | sumtisinSIrt

where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames

53 Experiment 97

between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples

Baseline Methods

-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals

-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes

-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT

-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net

98Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

532 Comparison with the State-of-the-art Methods

We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries

533 Ablation Study

In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method

Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step

It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating

53 Experiment 99

Tabl

e5

1R

esul

tsof

diff

eren

tmet

hods

onth

eV

idST

Gda

tase

tldquo

ind

icat

esth

ere

sult

sar

equ

oted

from

its

orig

inal

wor

k

Met

hod

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)m

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)ST

GR

N

[102

]19

75

257

714

60

183

221

10

128

3V

anill

a(o

urs)

214

927

69

157

720

22

231

713

82

SVG

-net

(our

s)22

87

283

116

31

215

724

09

143

3SV

G-n

et+

TBR

-net

(our

s)25

21

309

918

21

236

726

26

160

1

100Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work

Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875

SVG-net (ours) 1945 2778 1012SVG-net +

TBR-net (ours)2241 3031 1207

the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT

Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively

Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)

53 Experiment 101

Tabl

e5

3R

esul

tsof

our

met

hod

and

its

vari

ants

whe

nus

ing

diff

eren

tnum

ber

ofst

ages

onth

eV

idST

Gda

tase

t

Met

hods

Sta

ges

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

UvI

oU

03

vIoU

0

5m

_vIo

UvI

oU

03

vIoU

0

5SV

G-n

et-

228

728

31

163

121

57

240

914

33

SVG

-net

+IF

B-ne

t1

234

128

84

166

922

12

244

814

57

223

82

289

716

85

225

524

69

147

13

237

728

91

168

722

61

247

214

79

SVG

-net

+IA

B-ne

t1

232

729

57

163

921

78

249

314

31

223

74

299

816

52

222

125

42

145

93

238

530

15

165

722

43

257

114

47

SVG

-net

+T

BR-n

et1

243

229

92

172

222

69

253

715

29

224

95

307

517

02

233

426

03

158

33

252

130

99

182

123

67

262

616

01

102Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods

54 Summary

In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding

103

Chapter 6

Conclusions and Future Work

As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future

61 Conclusions

The contributions of this thesis are summarized as follows

bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level

104 Chapter 6 Conclusions and Future Work

The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets

bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework

bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding

62 Future Work

The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into

62 Future Work 105

one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously

107

Bibliography

[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812

[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100

[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970

[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308

[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139

[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206

[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894

108 Bibliography

[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008

[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802

[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990

[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255

[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941

[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83

[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275

[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636

[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448

[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587

Bibliography 109

[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015

[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056

[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778

[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017

[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708

[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6

[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199

[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014

[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413

[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105

110 Bibliography

[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750

[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165

[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)

[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318

[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344

[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019

[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889

[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898

[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996

[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In

Bibliography 111

Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19

[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017

[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988

[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755

[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682

[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864

[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855

[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37

[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959

[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613

112 Bibliography

[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23

[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043

[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950

[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028

[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946

[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205

[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334

[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759

[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439

Bibliography 113

[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541

[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788

[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99

[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427

[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016

[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565

[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743

[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058

114 Bibliography

[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576

[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)

[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970

[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646

[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)

[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025

[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020

[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473

[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018

[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772

Bibliography 115

[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9

[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019

[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)

[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497

[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008

[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717

[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334

[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36

[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the

116 Bibliography

IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968

[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172

[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321

[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792

[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462

[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154

[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653

[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019

[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947

[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)

Bibliography 117

[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693

[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)

[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524

[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687

[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315

[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102

[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692

[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103

[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809

118 Bibliography

[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075

[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677

[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923

[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019

[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913

  • Abstract
  • Declaration
  • Acknowledgements
  • List of Publications
  • List of Figures
  • List of Tables
  • Introduction
    • Spatial-temporal Action Localization
    • Temporal Action Localization
    • Spatio-temporal Visual Grounding
    • Thesis Outline
      • Literature Review
        • Background
          • Image Classification
          • Object Detection
          • Action Recognition
            • Spatio-temporal Action Localization
            • Temporal Action Localization
            • Spatio-temporal Visual Grounding
            • Summary
              • Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
                • Background
                  • Spatial temporal localization methods
                  • Two-stream R-CNN
                    • Action Detection Model in Spatial Domain
                      • PCSC Overview
                      • Cross-stream Region Proposal Cooperation
                      • Cross-stream Feature Cooperation
                      • Training Details
                        • Temporal Boundary Refinement for Action Tubes
                          • Actionness Detector (AD)
                          • Action Segment Detector (ASD)
                          • Temporal PCSC (T-PCSC)
                            • Experimental results
                              • Experimental Setup
                              • Comparison with the State-of-the-art methods
                                • Results on the UCF-101-24 Dataset
                                • Results on the J-HMDB Dataset
                                  • Ablation Study
                                  • Analysis of Our PCSC at Different Stages
                                  • Runtime Analysis
                                  • Qualitative Results
                                    • Summary
                                      • PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
                                        • Background
                                          • Action Recognition
                                          • Spatio-temporal Action Localization
                                          • Temporal Action Localization
                                            • Our Approach
                                              • Overall Framework
                                              • Progressive Cross-Granularity Cooperation
                                              • Anchor-frame Cooperation Module
                                              • Training Details
                                              • Discussions
                                                • Experiment
                                                  • Datasets and Setup
                                                  • Comparison with the State-of-the-art Methods
                                                  • Ablation study
                                                  • Qualitative Results
                                                    • Summary
                                                      • STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
                                                        • Background
                                                          • Vision-language Modelling
                                                          • Visual Grounding in ImagesVideos
                                                            • Methodology
                                                              • Overview
                                                              • Spatial Visual Grounding
                                                              • Temporal Boundary Refinement
                                                                • Experiment
                                                                  • Experiment Setup
                                                                  • Comparison with the State-of-the-art Methods
                                                                  • Ablation Study
                                                                    • Summary
                                                                      • Conclusions and Future Work
                                                                        • Conclusions
                                                                        • Future Work
                                                                          • Bibliography
Page 2: Deep Learning for Spatial and Temporal Video Localization

Abstract of thesis entitled

Deep Learning for Spatial and Temporal Video

Localization

Submitted by

Rui SU

for the degree of Doctor of Philosophy

at The University of Sydney

in June 2021

In this thesis we propose to develop novel deep learning algorithmsfor the video localization tasks including spatio-temporal action local-ization temporal action localization spatio-temporal visual groundingwhich require to localize the spatio-temporal or temporal locations oftargets from videos

First we propose a new Progressive Cross-stream Cooperation (PCSC)framework for the spatio-temporal action localization task The basicidea is to utilize both spatial region (resp temporal segment proposals)and features from one stream (ie the FlowRGB stream) to help an-other stream (ie the RGBFlow stream) to iteratively generate betterbounding boxes in the spatial domain (resp temporal segments in thetemporal domain) By first using our newly proposed PCSC frameworkfor spatial localization and then applying our temporal PCSC frameworkfor temporal localization the action localization results are progressivelyimproved

Second we propose a progressive cross-granularity cooperation (PCG-TAL) framework to effectively take advantage of complementarity be-tween the anchor-based and frame-based paradigms as well as betweentwo-view clues (ie appearance and motion) for the temporal action

localization task The whole framework can be learned in an end-to-end fashion whilst the temporal action localization performance can begradually boosted in a progressive manner

Finally we propose a two-step visual-linguistic transformer basedframework called STVGBert for the spatio-temporal visual groundingtask which consists of a Spatial Visual Grounding network (SVG-net)and a Temporal Boundary Refinement network (TBR-net) Different fromthe existing works for the video grounding tasks our proposed frame-work does not rely on any pre-trained object detector For all our pro-posed approaches we conduct extensive experiments on publicly avail-able datasets to demonstrate their effectiveness

Deep Learning for Spatial andTemporal Video Localization

by

Rui SU

BE South China University of Technology MPhil University of Queensland

A Thesis Submitted in Partial Fulfilmentof the Requirements for the Degree of

Doctor of Philosophy

at

University of SydneyJune 2021

COPYRIGHT ccopy2021 BY RUI SU

ALL RIGHTS RESERVED

i

Declaration

I Rui SU declare that this thesis titled ldquoDeep Learning for Spatialand Temporal Video Localizationrdquo which is submitted in fulfillment ofthe requirements for the Degree of Doctor of Philosophy represents myown work except where due acknowledgement have been made I fur-ther declared that it has not been previously included in a thesis disser-tation or report submitted to this University or to any other institutionfor a degree diploma or other qualifications

Signed

Date June 17 2021

For Mama and PapaA sA s

ii

Acknowledgements

First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher

I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future

Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything

Rui SU

University of SydneyJune 17 2021

iii

List of Publications

JOURNALS

[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020

[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020

CONFERENCES

[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019

v

Contents

Abstract i

Declaration i

Acknowledgements ii

List of Publications iii

List of Figures ix

List of Tables xv

1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11

2 Literature Review 1321 Background 13

211 Image Classification 13212 Object Detection 15213 Action Recognition 18

22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26

3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29

31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31

32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38

33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44

34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47

Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50

343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58

35 Summary 60

4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62

411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63

42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71

43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81

44 Summary 82

5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84

511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84

52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92

53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98

54 Summary 102

6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104

Bibliography 107

ix

List of Figures

11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6

31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33

32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i

t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi

t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39

33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43

34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57

35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58

41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 are denoted as SRGB0 PRGB

0

and QFlow0 The initial proposals and features from anchor-

based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66

42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80

43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81

51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86

52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89

53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93

xv

List of Tables

31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48

32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49

33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50

34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51

35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52

36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52

37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52

38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54

39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55

310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55

311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55

312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55

41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74

42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75

43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75

44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76

45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77

46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78

51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99

52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100

53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101

1

Chapter 1

Introduction

With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades

In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows

2 Chapter 1 Introduction

Spatio-temporal Action Localization

Framework

Input Videos

Action Tubes

Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes

11 Spatial-temporal Action Localization

Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task

Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each

11 Spatial-temporal Action Localization 3

other

For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation

On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation

Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region

4 Chapter 1 Introduction

proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level

Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level

Our contributions are briefly summarized as follows

bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy

bull We also propose a temporal boundary refinement method to learn

12 Temporal Action Localization 5

class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results

bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios

12 Temporal Action Localization

Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging

Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework

For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other

6 Chapter 1 Introduction

Ground Truth

P-GCN

Ours

Frisbee Catch783s 821s

Frisbee Catch799s 822s

Frisbee Catch787s 820s

Ground Truth

P-GCN

Ours

Golf Swing

Golf Swing

Golf Swing

210s 318s

179s 311s

204s 315s

Ground Truth

P-GCN

Ours

Throw Discus

Hammer Throw

Throw Discus

639s 712s

642s 708s

637s 715s

Time

Time

Video Input Start End

Anchor-based ProposalScore 082

Score 039Score 047

Score 033

Frame Actioness startingending

Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods

Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries

For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and

12 Temporal Action Localization 7

segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method

To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based

8 Chapter 1 Introduction

segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold

(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels

(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner

(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization

13 Spatio-temporal Visual Grounding

Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers

Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of

13 Spatio-temporal Visual Grounding 9

bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene

Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks

Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance

Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the

10 Chapter 1 Introduction

STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]

Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods

Our contributions can be summarized as follows

(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors

(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover

14 Thesis Outline 11

we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods

(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task

14 Thesis Outline

The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows

Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters

Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]

Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

12 Chapter 1 Introduction

Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod

Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future

13

Chapter 2

Literature Review

In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks

21 Background

211 Image Classification

Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters

14 Chapter 2 Literature Review

(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem

Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition

Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional

21 Background 15

neural network is built by stacking such inception modules to achievehigh image classification performance

VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance

Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper

212 Object Detection

The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem

16 Chapter 2 Literature Review

Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time

To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced

Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined

21 Background 17

anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map

Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming

Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps

All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map

18 Chapter 2 Literature Review

213 Action Recognition

Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other

For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities

Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and

21 Background 19

randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization

2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient

Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos

20 Chapter 2 Literature Review

22 Spatio-temporal Action Localization

Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task

Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization

Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization

Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these

22 Spatio-temporal Action Localization 21

two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency

As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization

The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos

A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes

22 Chapter 2 Literature Review

23 Temporal Action Localization

Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error

Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation

Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce

23 Temporal Action Localization 23

the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame

Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods

In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates

Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors

Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the

24 Chapter 2 Literature Review

prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise

Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU

Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction

In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way

24 Spatio-temporal Visual Grounding 25

24 Spatio-temporal Visual Grounding

Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects

Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer

In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects

Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes

One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors

26 Chapter 2 Literature Review

to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance

25 Summary

In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding

In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms

Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows

25 Summary 27

the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results

In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities

In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos

29

Chapter 3

Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization

Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal

30Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios

31 Background

311 Spatial temporal localization methods

Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]

Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization

Third temporal localization is to form action tubes from per-frame

31 Background 31

detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc

Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]

312 Two-stream R-CNN

Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams

Please note that most appearance and motion fusion approaches in

32Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion

32 Action Detection Model in Spatial Domain

Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results

321 PCSC Overview

The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression

The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation

32 Action Detection Model in Spatial Domain 33

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

t

IFlow

t

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(a)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(b)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(c)

Figu

re3

1(a

)O

verv

iew

ofou

rC

ross

-str

eam

Coo

pera

tion

fram

ewor

k(f

our

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

regi

onpr

opos

als

and

feat

ures

from

the

flow

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

and

Stag

e3

whi

leth

ere

gion

prop

osal

san

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2an

dSt

age

4Ea

chst

age

com

pris

esof

two

coop

erat

ion

mod

ules

and

the

dete

ctio

nhe

ad(

b)(c

)The

deta

ilsof

our

feat

ure-

leve

lcoo

pera

tion

mod

ules

thro

ugh

mes

sage

pass

ing

from

one

stre

amto

anot

her

stre

am

The

flow

feat

ures

are

used

tohe

lpth

eR

GB

feat

ures

in(b

)w

hile

the

RG

Bfe

atur

esar

eus

edto

help

the

flow

feat

ures

in(c

)

34Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages

After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33

322 Cross-stream Region Proposal Cooperation

We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream

In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes

The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage

32 Action Detection Model in Spatial Domain 35

The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)

Mathematically let P (i)t and B(i)t denote the set of region propos-

als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)

t is up-dated as P (i)

t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)

t is updated as

B(j)t = G(P (j)

t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)

0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams

With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results

Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the

36Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data

323 Cross-stream Feature Cooperation

To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections

Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci

2 Ci3 Ci

4Ci

5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui

2 Ui3 Ui

4 Ui5

Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are

32 Action Detection Model in Spatial Domain 37

insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement

We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci

2 Ci3 Ci

4 Ci5 l isin 2 3 4 5 Let us use the improvement

of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB

l with the help of the features CFlowl as

followsCRGB

l = fθ(CFlowl )oplus CRGB

l (31)

where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow

l ) nonlinearly extracts the message fromthe feature CFlow

l and then uses the extracted message for improvingthe features CRGB

l

The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB

l and CFlowl To this end

we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB

l in the bottom-up pathway are refined the corre-sponding features maps URGB

l in the top-down pathway are generatedaccordingly

The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages

38Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB

t

and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1

and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-

prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31

(b) Specifically the improved RGB feature FRGBt is obtained by apply-

ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB

t of the RGB stream is also used to help improve theROI-feature FFlow

t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage

324 Training Details

For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process

33 Temporal Boundary Refinement for Action

Tubes

After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this

33 Temporal Boundary Refinement for Action Tubes 39

Initi

al

Segm

ents

(a) I

nitia

l Seg

men

t Gen

erat

ion

Initi

al S

egm

ent

Gen

erat

ion

RG

B o

r Flo

w F

eatu

res

from

Act

ion

Tube

s

=cup

2

0

1

Initi

al S

egm

ent

Gen

erat

ion

Initi

al

Segm

ents

Stag

e 2

Segm

ent F

eatu

reC

oope

ratio

nD

etec

tion

Hea

dSe

gmen

t Fea

ture

Coo

pera

tion

Det

ectio

nH

ead

=cup

1

0

0

1

2

1

2

Impr

oved

Seg

men

t Fe

atur

es

Det

ecte

d Se

gmen

ts

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Impr

oved

Seg

men

tFe

atur

es

Det

ecte

d Se

gmen

ts

(b) T

empo

ral P

CSC

RG

B F

eatu

res

from

A

ctio

n Tu

bes

0

Flow

Fea

ture

s fr

om

Act

ion

Tube

s

0

Actio

nnes

sD

etec

tor (

AD)

Actio

n Se

gmen

tD

etec

tor (

ASD

)Se

gmen

ts fr

om A

D

Uni

on o

f Seg

men

tsSe

gmen

ts fr

om A

SD

Initi

al S

egm

ents

Figu

re3

2O

verv

iew

oftw

om

odul

esin

our

tem

pora

lbo

unda

ryre

finem

ent

fram

ewor

k(a

)th

eIn

itia

lSe

gmen

tG

en-

erat

ion

Mod

ule

and

(b)

the

Tem

pora

lPr

ogre

ssiv

eC

ross

-str

eam

Coo

pera

tion

(T-P

CSC

)M

odul

eTh

eIn

itia

lSe

gmen

tG

ener

atio

nm

odul

eco

mbi

nes

the

outp

uts

from

Act

ionn

ess

Det

ecto

ran

dA

ctio

nSe

gmen

tDet

ecto

rto

gene

rate

the

init

ial

segm

ents

for

both

RG

Ban

dflo

wst

ream

sO

urT-

PCSC

fram

ewor

kta

kes

thes

ein

itia

lseg

men

tsas

the

segm

entp

ropo

sals

and

iter

ativ

ely

impr

oves

the

tem

pora

lact

ion

dete

ctio

nre

sult

sby

leve

ragi

ngco

mpl

emen

tary

info

rmat

ion

oftw

ost

ream

sat

both

feat

ure

and

segm

entp

ropo

sall

evel

s(t

wo

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

segm

entp

ro-

posa

lsan

dfe

atur

esfr

omth

eflo

wst

ream

help

impr

ove

acti

onse

gmen

tdet

ecti

onre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

whi

leth

ese

gmen

tpro

posa

lsan

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2Ea

chst

age

com

pris

esof

the

coop

erat

ion

mod

ule

and

the

dete

ctio

nhe

adT

hese

gmen

tpro

posa

llev

elco

oper

-at

ion

mod

ule

refin

esth

ese

gmen

tpro

posa

lsP

i tby

com

bini

ngth

em

ostr

ecen

tseg

men

tpro

posa

lsfr

omth

etw

ost

ream

sw

hile

the

segm

ent

feat

ure

leve

lcoo

pera

tion

mod

ule

impr

oves

the

feat

ures

Fi tw

here

the

supe

rscr

ipt

iisinR

GB

Flo

w

deno

tes

the

RG

BFl

owst

ream

and

the

subs

crip

ttde

note

sth

est

age

num

ber

The

dete

ctio

nhe

adis

used

toes

tim

ate

the

acti

onse

gmen

tloc

atio

nan

dth

eac

tion

ness

prob

abili

ty

40Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework

To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments

33 Temporal Boundary Refinement for Action Tubes 41

331 Actionness Detector (AD)

We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i

At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the

42Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

testing stage

332 Action Segment Detector (ASD)

Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below

(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture

33 Temporal Boundary Refinement for Action Tubes 43

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventuallyremoved by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively

but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details

(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three

44Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg

(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments

333 Temporal PCSC (T-PCSC)

Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework

33 Temporal Boundary Refinement for Action Tubes 45

from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments

Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results

46Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

34 Experimental results

We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343

341 Experimental Setup

Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset

Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005

To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes

34 Experimental results 47

To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ

Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch

342 Comparison with the State-of-the-art methods

We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted

48Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy

Results on the UCF-101-24 Dataset

The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32

Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05

Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792

Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for

34 Experimental results 49

Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector

IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -

Zolfaghari et al [105] RGB+Flow+Pose 476 - - -

Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219

Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172

Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237

PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289

per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs

Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary

50Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods

Results on the J-HMDB Dataset

For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset

Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05

Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748

We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU

34 Experimental results 51

Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds

IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -

Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -

Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442

Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463

Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476

PCSC (Ours) RGB+Flow 826 822 631 528

threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively

At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work

52Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

343 Ablation Study

In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method

Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset

Stage PCSC wo featurecooperation PCSC

0 755 7811 761 7862 764 7883 767 7914 767 792

Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector

Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631

Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset

Stage Video mAP ()0 6191 6282 631

Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to

34 Experimental results 53

verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement

Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work

54Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset

Stage LoR()1 6112 7193 9524 954

in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes

Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +

34 Experimental results 55

Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset

Stage MABO()1 8412 8433 8454 846

Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset

θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801

Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages

Stage 0 1 2 3 4

Detection speed (FPS) 31 29 28 26 24

Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages

Stage 0 1 2

Detection speed (VPS) 21 17 15

ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main

56Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

344 Analysis of Our PCSC at Different Stages

In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages

In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4

To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification

34 Experimental results 57

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)

345 Runtime Analysis

We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages

58Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detected instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)

346 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams

34 Experimental results 59

and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))

Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)

While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue

60Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

which will be studied in our future work

35 Summary

In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets

61

Chapter 4

PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization

There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework

62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

41 Background

411 Action Recognition

Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction

412 Spatio-temporal Action Localization

The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues

41 Background 63

413 Temporal Action Localization

Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level

64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues

42 Our Approach

In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively

421 Overall Framework

We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is

an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN

i=1 and their temporal durations(tis tie)N

i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi

The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows

42 Our Approach 65

Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature

Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB

i |pRGBi = (tRGB

is tRGBie )NRGB

i=1 and PFlow = pFlowi |pFlow

i =

(tFlowis tFlow

ie )NFlow

i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB

0 PFlow0

for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB

0 SFlow0 and the predicted action categories

Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB

0 and QFlow0 respectively

The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB

0 and QFlow0 respectively

after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)

Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation

66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Anchor Feautres (Output)

Frame Features (Output)

Cross-granularity

RGB-StreamAFC Module

(Stage 1)

Flow-StreamAFC Module

(Stage 2)

RGB1

RGB-StreamAFC Module

(Stage 3)

RGB1Flow

1

Flow-StreamAFC Module

(Stage4)

Features RGB0 Flow

0Flow

0

Segments

Proposals RGB0

Flow0

Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

RGB

RGB

Flow

Message Passing

Flowminus2 RGB

minus2Flowminus1

Flowminus1

RGBminus2

Flow

RGB

Anchor AnchorFrame

Anchor Features (Output)

1 x 1 Conv

1 x 1 Conv

Detected Segments (Output)

Proposal Combination= cup cupRGB

Flowminus1

RGBminus2

Flow

InputProposals

Input Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

1 x 1 Conv

1 x 1 Conv

Proposal Combination= cup cupFlow

RGBminus1

Flowminus2

RGB

Input Features

RGBminus1

Flowminus2

InputProposals

Flowminus2 RGB

minus1 RGBminus2

RGB

RGB

Flow

Flow

Flow

Cross-stream Cross-granularity

Anchor Anchor Frame

Message Passing

Features

Proposals Flow2

Features RGB3 Flow

2 RGB2

ProposalsRGB3

Flow2

Flow2

SegmentsFeatures Flow

2RGB2

SegmentsFeatures Flow

3

RGB3

RGB3

Flow4

Flow4RGB

4

SegmentsFeatures

Detected Segments (Output)

Frame Features (Output)

Message Passing

Outputs

Features RGB1 Flow

0RGB

0

ProposalsRGB1

Flow0

RGB1

RGB1 Flow

2 Flow1

Inputs

(a) Overview

(b) RGB-streamAFC module

(c) Flow-stremAFC module

Cross-stream Message Passing

Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 aredenoted as SRGB

0 PRGB0 and QFlow

0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments

42 Our Approach 67

methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages

422 Progressive Cross-Granularity Cooperation

The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)

To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow

nminus1 and SRGBnminus2 as the input two-stream proposals and

uses preceding two-stream anchor-based features PFlownminus1 and PRGB

nminus2 andflow-stream frame-based feature QFlow

nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB

n and anchor-based feature PRGB

n as well as update the flow-stream frame-based fea-tures as QFlow

n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages

68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes

Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances

423 Anchor-frame Cooperation Module

The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)

Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features

42 Our Approach 69

PRGB0 PFlow

0 and QRGB0 QFlow

0 as well as proposals SRGB0 SFlow

0 as theinput as shown in Sec 421 and Fig 41(a)

Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow

nminus1 to the frame-based features QFlownminus2 both from the flow-stream

The message passing operation is defined as

QFlown =Manchorrarr f rame(PFlow

nminus2 )oplusQFlownminus2 (41)

oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow

n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow

n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB

n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB

nminus2 which flips the superscripts from Flow to RGB QRGB

n also leads to betterframe-based RGB-stream proposals QRGB

n

Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow

n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow

n And then we combine QFlown together with the

two-stream anchor-based proposals SFlownminus1 and SRGB

nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB

n = SRGBnminus2 cup SFlow

nminus1 cup

70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

QFlown This set of proposals not only provides more comprehensive pro-

posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow

n = SFlownminus2 cup SRGB

nminus1 cupQRGBn

Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB

n =M f lowrarrrgb(PFlownminus1 )oplus PRGB

nminus2 forthe RGB stream and PFlow

n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow

nminus2 for flow stream

Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB

n PFlown and

the updated anchor-based features PRGBn PFlow

n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes

424 Training Details

At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the

42 Our Approach 71

boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules

425 Discussions

Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants

Similarity of Results Between Adjacent Stages Note that even though

72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results

Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources

43 Experiment

431 Datasets and Setup

Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples

- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available

- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an

43 Experiment 73

existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1

Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches

Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24

432 Comparison with the State-of-the-art Methods

We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers

THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05

74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds

tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190

REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178

TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369

MGG [46] - - 539 468 374BMN [35] - - 560 474 388

TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491

ours 712 689 651 595 512

our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN

ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and

43 Experiment 75

Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used

tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -

R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398

TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699

ours 4431 2985 547 2885

Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied

tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003

P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385

ours 5204 3592 797 3491

BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN

UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05

76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds

st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -

MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278

ours 849 635 237 292

433 Ablation study

We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework

Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow

n and QRGBn are

directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo

CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow

nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and

43 Experiment 77

Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset

Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822

wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667

wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679

wo PC 4214 4321 4514 4574 4577

PRGBnminus1 and PFlow

nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison

In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB

0 cupPFlow0 for ldquowo CG andQRGB

0 cupQFlow0 for ldquowo PC We have

the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved

78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset

Stage Comparisons LoR ()

1 PRGB1 vs SFlow

0 6712 PFlow

2 vs PRGB1 792

3 PRGB3 vs PFlow

2 8954 PFlow

4 vs PRGB3 971

when the number of stages increases and the results will become stableafter 2 or 3 stages

Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB

1 at Stage 1 andPFlow

2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow

2 among all proposals in PFlow2 A proposal

is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB

1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow

2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB

1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other

Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow

4 and those in PRGB3 are very similar even they are gathered for

different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal

43 Experiment 79

action localization method after Stage 3 which is consistent with the re-sults in Table 45

Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005

In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries

80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

(a) AR50 () for various methods at different number of stages

(b) AR100 () for various methods at different number of stages

(c) AR200 () for various methods at different number of stages

(d) AR500 () for various methods at different number of stages

Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset

43 Experiment 81

Ground Truth

Anchor-based Method

PCG-TAL at Stage 3

977s 1042s

954s 1046s

973s 1043s

Frame-based Method Missed

Video Clip

964s 1046s

97s 1045s

PCG-TAL at Stage 1

PCG-TAL at Stage 2

Time

1042s 1046s977s

954s

954s

Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment

434 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our

82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule

44 Summary

In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework

83

Chapter 5

STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding

Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our

84Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG

51 Background

511 Vision-language Modelling

Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work

The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations

512 Visual Grounding in ImagesVideos

Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et

52 Methodology 85

al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling

In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods

52 Methodology

521 Overview

We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip

k Kk=1 where Vclip

k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN

n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte

t=ts

containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B

86Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Pooled Video Feature

Input Textual Tokens

Spatial VisualG

rouding Netw

ork(SVG

-net)

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

HW T

Video Frames

Video Feature

Initial Object Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]

Textual Query

Word Em

beddingInput Textual Tokens

ST-ViLBERTVidoe C

lipSam

pling

Video FeatureC

lip Feature

Input Textual Tokens

Deconvs

Head

Head

Correlation

MLP

Spatial Pooling

Global Textual Feature

Text-guided Viusal Feature

Initial TubeG

eneration

Initial Object Tube

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

Input Textual Tokens

(a) Overview

of Our M

ethod

(b) Our SVG

-net(c) O

ur TBR

-net

Bounding Box Size

Bounding Box Center

Relevance Scores

Pooled Video Feature

Initial Object

TubePooled Video

Feature

SVG-net-core

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

H T

VideoC

lip Features

Initial Object

Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]Textual Q

ueryW

ord Embedding

Input Textual Tokens

ST-ViLBERTC

lip Feature

Input Textual Tokens

Deconv

Head

Head

Correlation

amp Sigmoid

MLP

Spatial Pooling

Global

Textual Feature

Text-guided Viusal Feature

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

(a) Overview

of Our M

ethod

(b) Our SVG

-net-core Module

(c) Our TB

R-net

Bounding Box Sizes

Bounding Box Centers

Relevance Scores

Pooled Video Feature

Initial Object

Tube

W

BB Centers

BB Sizes

Relevance Scores

SVG-net-core

Initial TubeG

eneration

Spatial Visual Grounding N

etwork (SVG

-net)

Figure51

(a)T

heoverview

ofour

spatio-temporalvideo

groundingfram

ework

STVG

BertO

urtw

o-stepfram

ework

consistsof

two

sub-networksSpatial

Visual

Grounding

network

(SVG

-net)and

Temporal

BoundaryR

efinement

net-w

ork(T

BR-net)w

hichtakes

videoand

textualquerypairs

asthe

inputtogenerate

spatio-temporalobjecttubes

contain-ing

theobject

ofinterest

In(b)the

SVG

-netgenerates

theinitialobject

tubesIn

(c)theTBR

-netprogressively

refinesthe

temporalboundaries

oftheinitialobjecttubes

andgenerates

thefinalspatio-tem

poralobjecttubes

52 Methodology 87

In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections

522 Spatial Visual Grounding

Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames

Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip

k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature

For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two

88Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2

n=1 where en is the i-th textual input token

Multi-modal Feature Learning Given the visual input feature Fclipk

and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details

The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors

Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in

52 Methodology 89

Multi-head AttentionQuery Key Value

Add amp Norm

Multi-head AttentionQueryKeyValue

Add amp Norm

Feed Forward

Add amp Norm

Feed Forward

Add amp Norm

Visual Input (Output from the Last Layer)

Visual Output for the Current Layer

Textual Input (Output from the Last Layer)

Textual Output for the Current Layer

Visual Branch Textual Branch

Spatial Pooling Temporal Pooling

Combination

Multi-head Attention

Query Key ValueTextual Input

Combined Visual Feature

Decomposition

Replication Replication

Intermediate Visual Feature

Visual Input

Norm

(a) One Co-attention Layer in ViLBERT

(b) Our Spatio-temporal Combination and Decomposition Module

Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))

90Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively

Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv

to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box

Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending

52 Methodology 91

frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0

s t0e ) The

initial temporal boundaries (t0s t0

e ) and the predicted bounding boxes bt

form the initial tube prediction result Binit = btt0e

t=t0s

Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel

92Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

axy = exp(minus (xminusx)2+(yminusy)2

2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows

LSVG =λ1Lc(A A) + λ2Lrel(oobj o)

+ λ3Lsize(wxy hxy w h)(51)

where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details

523 Temporal Boundary Refinement

The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube

As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries

52 Methodology 93

Text

ual I

nput

To

kens

Star

ting

amp En

ding

Pr

obab

ilitie

s

Tem

pora

l Ext

ensi

onU

nifo

rm S

ampl

ing

ViLB

ERT

Boun

dary

Reg

ress

ion

Loca

l Fea

ture

Ext

ract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Exte

nded

Stra

ting

amp En

ding

Pos

ition

sAn

chor

Fea

ture

Refi

ned

Star

ting

amp En

ding

Pos

ition

s (

From

Anc

hor-b

ased

Bra

nch)

and

Star

ting

amp En

ding

Reg

ion

Feat

ures

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Initi

al S

tarti

ng amp

End

ing

Posi

tions

Refi

nde

Star

ting

amp En

ding

Pos

ition

s (F

rom

Fra

me-

base

d Br

anch

)

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Anc

hor-

base

d B

ranc

h

Fram

e-ba

sed

Bra

nch

Refi

ned

Obj

ect

Tube

Initi

al O

bjec

t Tu

be

Vide

o Po

oled

Fe

atur

e

Text

ual I

nput

To

kens

Stag

e 1

Stag

e 2

Stag

e 3

Star

ting

Endi

ng

Prob

abilit

yAn

chor

Fea

ture

Extra

ctio

nVi

LBER

TBo

unda

ryR

egre

ssio

nLo

cal F

eatu

reEx

tract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Refi

ned

Posi

tions

(Fro

m A

B)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Initi

al P

ositi

ons

Refi

ned

Posi

tions

(F

rom

FB)

Pool

ed V

ideo

Fe

atur

eTe

xtua

l Inp

ut

Toke

nsPo

oled

Vid

eo

Feat

ure

Anc

hor-

base

d B

ranc

h (A

B)

Fram

e-ba

sed

Bra

nch

(FB

)

Refi

ned

Obj

ect T

ube

from

Initi

al O

bjec

t Tu

be

Pool

ed V

ideo

Fea

ture

Text

ual I

nput

Tok

ens

Stag

e 1

Stag

e 2

Figu

re5

3O

verv

iew

ofou

rTe

mpo

ralB

ound

ary

Refi

nem

entn

etw

ork

(TBR

-net

)w

hich

isa

mul

ti-s

tage

ViL

BER

T-ba

sed

fram

ewor

kw

ith

anan

chor

-bas

edbr

anch

and

afr

ame-

base

dbr

anch

(tw

ost

ages

are

used

for

bett

erill

ustr

atio

n)

The

inpu

tsta

rtin

gan

den

ding

fram

epo

siti

ons

atea

chst

age

are

thos

eof

the

outp

utob

ject

tube

from

the

prev

ious

stag

eA

tst

age

1th

ein

puto

bjec

ttub

eis

gene

rate

dby

our

SVG

-net

94Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage

Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets

Given an anchor with the temporal boundaries (t0s t0

e ) and the lengthl = t0

e minus t0s we temporally extend the anchor before and after the start-

ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0

s minus l4 t0e + l4] and generate the anchor feature F by stacking the

corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension

The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0

s and t0e and generate the new starting

and ending frame positions t1s and t1

e

Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1

s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1

s minusD2+ 1t1s + D2] where D is the length

53 Experiment 95

of the starting duration We then construct the frame-level feature Fs

by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1

s Accordingly we can produce the ending frameposition t1

e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1

s and t1e are generated they can be used as the

input for the anchor-based branch at the next stage

Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow

LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)

where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1

53 Experiment

531 Experiment Setup

Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset

96Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes

-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes

Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU

Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1

|SU | sumtisinSIrt

where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames

53 Experiment 97

between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples

Baseline Methods

-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals

-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes

-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT

-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net

98Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

532 Comparison with the State-of-the-art Methods

We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries

533 Ablation Study

In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method

Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step

It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating

53 Experiment 99

Tabl

e5

1R

esul

tsof

diff

eren

tmet

hods

onth

eV

idST

Gda

tase

tldquo

ind

icat

esth

ere

sult

sar

equ

oted

from

its

orig

inal

wor

k

Met

hod

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)m

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)ST

GR

N

[102

]19

75

257

714

60

183

221

10

128

3V

anill

a(o

urs)

214

927

69

157

720

22

231

713

82

SVG

-net

(our

s)22

87

283

116

31

215

724

09

143

3SV

G-n

et+

TBR

-net

(our

s)25

21

309

918

21

236

726

26

160

1

100Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work

Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875

SVG-net (ours) 1945 2778 1012SVG-net +

TBR-net (ours)2241 3031 1207

the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT

Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively

Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)

53 Experiment 101

Tabl

e5

3R

esul

tsof

our

met

hod

and

its

vari

ants

whe

nus

ing

diff

eren

tnum

ber

ofst

ages

onth

eV

idST

Gda

tase

t

Met

hods

Sta

ges

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

UvI

oU

03

vIoU

0

5m

_vIo

UvI

oU

03

vIoU

0

5SV

G-n

et-

228

728

31

163

121

57

240

914

33

SVG

-net

+IF

B-ne

t1

234

128

84

166

922

12

244

814

57

223

82

289

716

85

225

524

69

147

13

237

728

91

168

722

61

247

214

79

SVG

-net

+IA

B-ne

t1

232

729

57

163

921

78

249

314

31

223

74

299

816

52

222

125

42

145

93

238

530

15

165

722

43

257

114

47

SVG

-net

+T

BR-n

et1

243

229

92

172

222

69

253

715

29

224

95

307

517

02

233

426

03

158

33

252

130

99

182

123

67

262

616

01

102Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods

54 Summary

In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding

103

Chapter 6

Conclusions and Future Work

As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future

61 Conclusions

The contributions of this thesis are summarized as follows

bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level

104 Chapter 6 Conclusions and Future Work

The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets

bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework

bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding

62 Future Work

The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into

62 Future Work 105

one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously

107

Bibliography

[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812

[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100

[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970

[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308

[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139

[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206

[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894

108 Bibliography

[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008

[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802

[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990

[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255

[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941

[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83

[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275

[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636

[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448

[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587

Bibliography 109

[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015

[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056

[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778

[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017

[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708

[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6

[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199

[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014

[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413

[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105

110 Bibliography

[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750

[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165

[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)

[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318

[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344

[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019

[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889

[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898

[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996

[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In

Bibliography 111

Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19

[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017

[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988

[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755

[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682

[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864

[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855

[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37

[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959

[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613

112 Bibliography

[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23

[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043

[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950

[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028

[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946

[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205

[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334

[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759

[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439

Bibliography 113

[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541

[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788

[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99

[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427

[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016

[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565

[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743

[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058

114 Bibliography

[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576

[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)

[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970

[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646

[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)

[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025

[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020

[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473

[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018

[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772

Bibliography 115

[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9

[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019

[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)

[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497

[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008

[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717

[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334

[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36

[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the

116 Bibliography

IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968

[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172

[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321

[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792

[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462

[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154

[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653

[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019

[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947

[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)

Bibliography 117

[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693

[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)

[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524

[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687

[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315

[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102

[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692

[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103

[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809

118 Bibliography

[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075

[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677

[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923

[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019

[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913

  • Abstract
  • Declaration
  • Acknowledgements
  • List of Publications
  • List of Figures
  • List of Tables
  • Introduction
    • Spatial-temporal Action Localization
    • Temporal Action Localization
    • Spatio-temporal Visual Grounding
    • Thesis Outline
      • Literature Review
        • Background
          • Image Classification
          • Object Detection
          • Action Recognition
            • Spatio-temporal Action Localization
            • Temporal Action Localization
            • Spatio-temporal Visual Grounding
            • Summary
              • Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
                • Background
                  • Spatial temporal localization methods
                  • Two-stream R-CNN
                    • Action Detection Model in Spatial Domain
                      • PCSC Overview
                      • Cross-stream Region Proposal Cooperation
                      • Cross-stream Feature Cooperation
                      • Training Details
                        • Temporal Boundary Refinement for Action Tubes
                          • Actionness Detector (AD)
                          • Action Segment Detector (ASD)
                          • Temporal PCSC (T-PCSC)
                            • Experimental results
                              • Experimental Setup
                              • Comparison with the State-of-the-art methods
                                • Results on the UCF-101-24 Dataset
                                • Results on the J-HMDB Dataset
                                  • Ablation Study
                                  • Analysis of Our PCSC at Different Stages
                                  • Runtime Analysis
                                  • Qualitative Results
                                    • Summary
                                      • PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
                                        • Background
                                          • Action Recognition
                                          • Spatio-temporal Action Localization
                                          • Temporal Action Localization
                                            • Our Approach
                                              • Overall Framework
                                              • Progressive Cross-Granularity Cooperation
                                              • Anchor-frame Cooperation Module
                                              • Training Details
                                              • Discussions
                                                • Experiment
                                                  • Datasets and Setup
                                                  • Comparison with the State-of-the-art Methods
                                                  • Ablation study
                                                  • Qualitative Results
                                                    • Summary
                                                      • STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
                                                        • Background
                                                          • Vision-language Modelling
                                                          • Visual Grounding in ImagesVideos
                                                            • Methodology
                                                              • Overview
                                                              • Spatial Visual Grounding
                                                              • Temporal Boundary Refinement
                                                                • Experiment
                                                                  • Experiment Setup
                                                                  • Comparison with the State-of-the-art Methods
                                                                  • Ablation Study
                                                                    • Summary
                                                                      • Conclusions and Future Work
                                                                        • Conclusions
                                                                        • Future Work
                                                                          • Bibliography
Page 3: Deep Learning for Spatial and Temporal Video Localization

localization task The whole framework can be learned in an end-to-end fashion whilst the temporal action localization performance can begradually boosted in a progressive manner

Finally we propose a two-step visual-linguistic transformer basedframework called STVGBert for the spatio-temporal visual groundingtask which consists of a Spatial Visual Grounding network (SVG-net)and a Temporal Boundary Refinement network (TBR-net) Different fromthe existing works for the video grounding tasks our proposed frame-work does not rely on any pre-trained object detector For all our pro-posed approaches we conduct extensive experiments on publicly avail-able datasets to demonstrate their effectiveness

Deep Learning for Spatial andTemporal Video Localization

by

Rui SU

BE South China University of Technology MPhil University of Queensland

A Thesis Submitted in Partial Fulfilmentof the Requirements for the Degree of

Doctor of Philosophy

at

University of SydneyJune 2021

COPYRIGHT ccopy2021 BY RUI SU

ALL RIGHTS RESERVED

i

Declaration

I Rui SU declare that this thesis titled ldquoDeep Learning for Spatialand Temporal Video Localizationrdquo which is submitted in fulfillment ofthe requirements for the Degree of Doctor of Philosophy represents myown work except where due acknowledgement have been made I fur-ther declared that it has not been previously included in a thesis disser-tation or report submitted to this University or to any other institutionfor a degree diploma or other qualifications

Signed

Date June 17 2021

For Mama and PapaA sA s

ii

Acknowledgements

First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher

I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future

Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything

Rui SU

University of SydneyJune 17 2021

iii

List of Publications

JOURNALS

[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020

[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020

CONFERENCES

[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019

v

Contents

Abstract i

Declaration i

Acknowledgements ii

List of Publications iii

List of Figures ix

List of Tables xv

1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11

2 Literature Review 1321 Background 13

211 Image Classification 13212 Object Detection 15213 Action Recognition 18

22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26

3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29

31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31

32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38

33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44

34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47

Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50

343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58

35 Summary 60

4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62

411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63

42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71

43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81

44 Summary 82

5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84

511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84

52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92

53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98

54 Summary 102

6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104

Bibliography 107

ix

List of Figures

11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6

31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33

32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i

t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi

t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39

33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43

34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57

35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58

41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 are denoted as SRGB0 PRGB

0

and QFlow0 The initial proposals and features from anchor-

based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66

42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80

43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81

51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86

52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89

53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93

xv

List of Tables

31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48

32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49

33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50

34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51

35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52

36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52

37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52

38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54

39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55

310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55

311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55

312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55

41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74

42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75

43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75

44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76

45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77

46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78

51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99

52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100

53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101

1

Chapter 1

Introduction

With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades

In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows

2 Chapter 1 Introduction

Spatio-temporal Action Localization

Framework

Input Videos

Action Tubes

Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes

11 Spatial-temporal Action Localization

Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task

Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each

11 Spatial-temporal Action Localization 3

other

For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation

On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation

Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region

4 Chapter 1 Introduction

proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level

Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level

Our contributions are briefly summarized as follows

bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy

bull We also propose a temporal boundary refinement method to learn

12 Temporal Action Localization 5

class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results

bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios

12 Temporal Action Localization

Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging

Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework

For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other

6 Chapter 1 Introduction

Ground Truth

P-GCN

Ours

Frisbee Catch783s 821s

Frisbee Catch799s 822s

Frisbee Catch787s 820s

Ground Truth

P-GCN

Ours

Golf Swing

Golf Swing

Golf Swing

210s 318s

179s 311s

204s 315s

Ground Truth

P-GCN

Ours

Throw Discus

Hammer Throw

Throw Discus

639s 712s

642s 708s

637s 715s

Time

Time

Video Input Start End

Anchor-based ProposalScore 082

Score 039Score 047

Score 033

Frame Actioness startingending

Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods

Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries

For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and

12 Temporal Action Localization 7

segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method

To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based

8 Chapter 1 Introduction

segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold

(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels

(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner

(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization

13 Spatio-temporal Visual Grounding

Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers

Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of

13 Spatio-temporal Visual Grounding 9

bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene

Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks

Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance

Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the

10 Chapter 1 Introduction

STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]

Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods

Our contributions can be summarized as follows

(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors

(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover

14 Thesis Outline 11

we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods

(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task

14 Thesis Outline

The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows

Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters

Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]

Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

12 Chapter 1 Introduction

Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod

Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future

13

Chapter 2

Literature Review

In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks

21 Background

211 Image Classification

Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters

14 Chapter 2 Literature Review

(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem

Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition

Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional

21 Background 15

neural network is built by stacking such inception modules to achievehigh image classification performance

VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance

Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper

212 Object Detection

The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem

16 Chapter 2 Literature Review

Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time

To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced

Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined

21 Background 17

anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map

Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming

Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps

All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map

18 Chapter 2 Literature Review

213 Action Recognition

Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other

For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities

Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and

21 Background 19

randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization

2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient

Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos

20 Chapter 2 Literature Review

22 Spatio-temporal Action Localization

Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task

Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization

Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization

Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these

22 Spatio-temporal Action Localization 21

two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency

As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization

The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos

A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes

22 Chapter 2 Literature Review

23 Temporal Action Localization

Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error

Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation

Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce

23 Temporal Action Localization 23

the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame

Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods

In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates

Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors

Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the

24 Chapter 2 Literature Review

prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise

Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU

Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction

In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way

24 Spatio-temporal Visual Grounding 25

24 Spatio-temporal Visual Grounding

Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects

Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer

In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects

Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes

One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors

26 Chapter 2 Literature Review

to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance

25 Summary

In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding

In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms

Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows

25 Summary 27

the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results

In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities

In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos

29

Chapter 3

Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization

Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal

30Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios

31 Background

311 Spatial temporal localization methods

Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]

Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization

Third temporal localization is to form action tubes from per-frame

31 Background 31

detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc

Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]

312 Two-stream R-CNN

Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams

Please note that most appearance and motion fusion approaches in

32Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion

32 Action Detection Model in Spatial Domain

Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results

321 PCSC Overview

The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression

The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation

32 Action Detection Model in Spatial Domain 33

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

t

IFlow

t

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(a)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(b)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(c)

Figu

re3

1(a

)O

verv

iew

ofou

rC

ross

-str

eam

Coo

pera

tion

fram

ewor

k(f

our

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

regi

onpr

opos

als

and

feat

ures

from

the

flow

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

and

Stag

e3

whi

leth

ere

gion

prop

osal

san

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2an

dSt

age

4Ea

chst

age

com

pris

esof

two

coop

erat

ion

mod

ules

and

the

dete

ctio

nhe

ad(

b)(c

)The

deta

ilsof

our

feat

ure-

leve

lcoo

pera

tion

mod

ules

thro

ugh

mes

sage

pass

ing

from

one

stre

amto

anot

her

stre

am

The

flow

feat

ures

are

used

tohe

lpth

eR

GB

feat

ures

in(b

)w

hile

the

RG

Bfe

atur

esar

eus

edto

help

the

flow

feat

ures

in(c

)

34Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages

After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33

322 Cross-stream Region Proposal Cooperation

We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream

In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes

The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage

32 Action Detection Model in Spatial Domain 35

The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)

Mathematically let P (i)t and B(i)t denote the set of region propos-

als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)

t is up-dated as P (i)

t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)

t is updated as

B(j)t = G(P (j)

t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)

0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams

With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results

Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the

36Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data

323 Cross-stream Feature Cooperation

To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections

Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci

2 Ci3 Ci

4Ci

5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui

2 Ui3 Ui

4 Ui5

Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are

32 Action Detection Model in Spatial Domain 37

insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement

We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci

2 Ci3 Ci

4 Ci5 l isin 2 3 4 5 Let us use the improvement

of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB

l with the help of the features CFlowl as

followsCRGB

l = fθ(CFlowl )oplus CRGB

l (31)

where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow

l ) nonlinearly extracts the message fromthe feature CFlow

l and then uses the extracted message for improvingthe features CRGB

l

The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB

l and CFlowl To this end

we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB

l in the bottom-up pathway are refined the corre-sponding features maps URGB

l in the top-down pathway are generatedaccordingly

The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages

38Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB

t

and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1

and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-

prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31

(b) Specifically the improved RGB feature FRGBt is obtained by apply-

ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB

t of the RGB stream is also used to help improve theROI-feature FFlow

t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage

324 Training Details

For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process

33 Temporal Boundary Refinement for Action

Tubes

After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this

33 Temporal Boundary Refinement for Action Tubes 39

Initi

al

Segm

ents

(a) I

nitia

l Seg

men

t Gen

erat

ion

Initi

al S

egm

ent

Gen

erat

ion

RG

B o

r Flo

w F

eatu

res

from

Act

ion

Tube

s

=cup

2

0

1

Initi

al S

egm

ent

Gen

erat

ion

Initi

al

Segm

ents

Stag

e 2

Segm

ent F

eatu

reC

oope

ratio

nD

etec

tion

Hea

dSe

gmen

t Fea

ture

Coo

pera

tion

Det

ectio

nH

ead

=cup

1

0

0

1

2

1

2

Impr

oved

Seg

men

t Fe

atur

es

Det

ecte

d Se

gmen

ts

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Impr

oved

Seg

men

tFe

atur

es

Det

ecte

d Se

gmen

ts

(b) T

empo

ral P

CSC

RG

B F

eatu

res

from

A

ctio

n Tu

bes

0

Flow

Fea

ture

s fr

om

Act

ion

Tube

s

0

Actio

nnes

sD

etec

tor (

AD)

Actio

n Se

gmen

tD

etec

tor (

ASD

)Se

gmen

ts fr

om A

D

Uni

on o

f Seg

men

tsSe

gmen

ts fr

om A

SD

Initi

al S

egm

ents

Figu

re3

2O

verv

iew

oftw

om

odul

esin

our

tem

pora

lbo

unda

ryre

finem

ent

fram

ewor

k(a

)th

eIn

itia

lSe

gmen

tG

en-

erat

ion

Mod

ule

and

(b)

the

Tem

pora

lPr

ogre

ssiv

eC

ross

-str

eam

Coo

pera

tion

(T-P

CSC

)M

odul

eTh

eIn

itia

lSe

gmen

tG

ener

atio

nm

odul

eco

mbi

nes

the

outp

uts

from

Act

ionn

ess

Det

ecto

ran

dA

ctio

nSe

gmen

tDet

ecto

rto

gene

rate

the

init

ial

segm

ents

for

both

RG

Ban

dflo

wst

ream

sO

urT-

PCSC

fram

ewor

kta

kes

thes

ein

itia

lseg

men

tsas

the

segm

entp

ropo

sals

and

iter

ativ

ely

impr

oves

the

tem

pora

lact

ion

dete

ctio

nre

sult

sby

leve

ragi

ngco

mpl

emen

tary

info

rmat

ion

oftw

ost

ream

sat

both

feat

ure

and

segm

entp

ropo

sall

evel

s(t

wo

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

segm

entp

ro-

posa

lsan

dfe

atur

esfr

omth

eflo

wst

ream

help

impr

ove

acti

onse

gmen

tdet

ecti

onre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

whi

leth

ese

gmen

tpro

posa

lsan

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2Ea

chst

age

com

pris

esof

the

coop

erat

ion

mod

ule

and

the

dete

ctio

nhe

adT

hese

gmen

tpro

posa

llev

elco

oper

-at

ion

mod

ule

refin

esth

ese

gmen

tpro

posa

lsP

i tby

com

bini

ngth

em

ostr

ecen

tseg

men

tpro

posa

lsfr

omth

etw

ost

ream

sw

hile

the

segm

ent

feat

ure

leve

lcoo

pera

tion

mod

ule

impr

oves

the

feat

ures

Fi tw

here

the

supe

rscr

ipt

iisinR

GB

Flo

w

deno

tes

the

RG

BFl

owst

ream

and

the

subs

crip

ttde

note

sth

est

age

num

ber

The

dete

ctio

nhe

adis

used

toes

tim

ate

the

acti

onse

gmen

tloc

atio

nan

dth

eac

tion

ness

prob

abili

ty

40Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework

To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments

33 Temporal Boundary Refinement for Action Tubes 41

331 Actionness Detector (AD)

We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i

At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the

42Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

testing stage

332 Action Segment Detector (ASD)

Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below

(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture

33 Temporal Boundary Refinement for Action Tubes 43

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventuallyremoved by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively

but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details

(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three

44Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg

(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments

333 Temporal PCSC (T-PCSC)

Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework

33 Temporal Boundary Refinement for Action Tubes 45

from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments

Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results

46Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

34 Experimental results

We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343

341 Experimental Setup

Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset

Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005

To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes

34 Experimental results 47

To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ

Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch

342 Comparison with the State-of-the-art methods

We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted

48Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy

Results on the UCF-101-24 Dataset

The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32

Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05

Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792

Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for

34 Experimental results 49

Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector

IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -

Zolfaghari et al [105] RGB+Flow+Pose 476 - - -

Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219

Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172

Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237

PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289

per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs

Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary

50Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods

Results on the J-HMDB Dataset

For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset

Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05

Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748

We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU

34 Experimental results 51

Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds

IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -

Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -

Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442

Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463

Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476

PCSC (Ours) RGB+Flow 826 822 631 528

threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively

At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work

52Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

343 Ablation Study

In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method

Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset

Stage PCSC wo featurecooperation PCSC

0 755 7811 761 7862 764 7883 767 7914 767 792

Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector

Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631

Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset

Stage Video mAP ()0 6191 6282 631

Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to

34 Experimental results 53

verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement

Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work

54Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset

Stage LoR()1 6112 7193 9524 954

in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes

Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +

34 Experimental results 55

Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset

Stage MABO()1 8412 8433 8454 846

Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset

θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801

Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages

Stage 0 1 2 3 4

Detection speed (FPS) 31 29 28 26 24

Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages

Stage 0 1 2

Detection speed (VPS) 21 17 15

ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main

56Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

344 Analysis of Our PCSC at Different Stages

In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages

In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4

To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification

34 Experimental results 57

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)

345 Runtime Analysis

We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages

58Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detected instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)

346 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams

34 Experimental results 59

and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))

Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)

While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue

60Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

which will be studied in our future work

35 Summary

In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets

61

Chapter 4

PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization

There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework

62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

41 Background

411 Action Recognition

Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction

412 Spatio-temporal Action Localization

The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues

41 Background 63

413 Temporal Action Localization

Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level

64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues

42 Our Approach

In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively

421 Overall Framework

We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is

an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN

i=1 and their temporal durations(tis tie)N

i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi

The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows

42 Our Approach 65

Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature

Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB

i |pRGBi = (tRGB

is tRGBie )NRGB

i=1 and PFlow = pFlowi |pFlow

i =

(tFlowis tFlow

ie )NFlow

i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB

0 PFlow0

for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB

0 SFlow0 and the predicted action categories

Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB

0 and QFlow0 respectively

The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB

0 and QFlow0 respectively

after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)

Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation

66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Anchor Feautres (Output)

Frame Features (Output)

Cross-granularity

RGB-StreamAFC Module

(Stage 1)

Flow-StreamAFC Module

(Stage 2)

RGB1

RGB-StreamAFC Module

(Stage 3)

RGB1Flow

1

Flow-StreamAFC Module

(Stage4)

Features RGB0 Flow

0Flow

0

Segments

Proposals RGB0

Flow0

Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

RGB

RGB

Flow

Message Passing

Flowminus2 RGB

minus2Flowminus1

Flowminus1

RGBminus2

Flow

RGB

Anchor AnchorFrame

Anchor Features (Output)

1 x 1 Conv

1 x 1 Conv

Detected Segments (Output)

Proposal Combination= cup cupRGB

Flowminus1

RGBminus2

Flow

InputProposals

Input Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

1 x 1 Conv

1 x 1 Conv

Proposal Combination= cup cupFlow

RGBminus1

Flowminus2

RGB

Input Features

RGBminus1

Flowminus2

InputProposals

Flowminus2 RGB

minus1 RGBminus2

RGB

RGB

Flow

Flow

Flow

Cross-stream Cross-granularity

Anchor Anchor Frame

Message Passing

Features

Proposals Flow2

Features RGB3 Flow

2 RGB2

ProposalsRGB3

Flow2

Flow2

SegmentsFeatures Flow

2RGB2

SegmentsFeatures Flow

3

RGB3

RGB3

Flow4

Flow4RGB

4

SegmentsFeatures

Detected Segments (Output)

Frame Features (Output)

Message Passing

Outputs

Features RGB1 Flow

0RGB

0

ProposalsRGB1

Flow0

RGB1

RGB1 Flow

2 Flow1

Inputs

(a) Overview

(b) RGB-streamAFC module

(c) Flow-stremAFC module

Cross-stream Message Passing

Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 aredenoted as SRGB

0 PRGB0 and QFlow

0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments

42 Our Approach 67

methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages

422 Progressive Cross-Granularity Cooperation

The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)

To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow

nminus1 and SRGBnminus2 as the input two-stream proposals and

uses preceding two-stream anchor-based features PFlownminus1 and PRGB

nminus2 andflow-stream frame-based feature QFlow

nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB

n and anchor-based feature PRGB

n as well as update the flow-stream frame-based fea-tures as QFlow

n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages

68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes

Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances

423 Anchor-frame Cooperation Module

The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)

Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features

42 Our Approach 69

PRGB0 PFlow

0 and QRGB0 QFlow

0 as well as proposals SRGB0 SFlow

0 as theinput as shown in Sec 421 and Fig 41(a)

Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow

nminus1 to the frame-based features QFlownminus2 both from the flow-stream

The message passing operation is defined as

QFlown =Manchorrarr f rame(PFlow

nminus2 )oplusQFlownminus2 (41)

oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow

n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow

n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB

n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB

nminus2 which flips the superscripts from Flow to RGB QRGB

n also leads to betterframe-based RGB-stream proposals QRGB

n

Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow

n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow

n And then we combine QFlown together with the

two-stream anchor-based proposals SFlownminus1 and SRGB

nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB

n = SRGBnminus2 cup SFlow

nminus1 cup

70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

QFlown This set of proposals not only provides more comprehensive pro-

posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow

n = SFlownminus2 cup SRGB

nminus1 cupQRGBn

Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB

n =M f lowrarrrgb(PFlownminus1 )oplus PRGB

nminus2 forthe RGB stream and PFlow

n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow

nminus2 for flow stream

Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB

n PFlown and

the updated anchor-based features PRGBn PFlow

n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes

424 Training Details

At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the

42 Our Approach 71

boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules

425 Discussions

Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants

Similarity of Results Between Adjacent Stages Note that even though

72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results

Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources

43 Experiment

431 Datasets and Setup

Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples

- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available

- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an

43 Experiment 73

existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1

Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches

Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24

432 Comparison with the State-of-the-art Methods

We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers

THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05

74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds

tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190

REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178

TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369

MGG [46] - - 539 468 374BMN [35] - - 560 474 388

TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491

ours 712 689 651 595 512

our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN

ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and

43 Experiment 75

Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used

tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -

R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398

TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699

ours 4431 2985 547 2885

Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied

tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003

P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385

ours 5204 3592 797 3491

BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN

UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05

76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds

st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -

MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278

ours 849 635 237 292

433 Ablation study

We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework

Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow

n and QRGBn are

directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo

CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow

nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and

43 Experiment 77

Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset

Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822

wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667

wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679

wo PC 4214 4321 4514 4574 4577

PRGBnminus1 and PFlow

nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison

In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB

0 cupPFlow0 for ldquowo CG andQRGB

0 cupQFlow0 for ldquowo PC We have

the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved

78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset

Stage Comparisons LoR ()

1 PRGB1 vs SFlow

0 6712 PFlow

2 vs PRGB1 792

3 PRGB3 vs PFlow

2 8954 PFlow

4 vs PRGB3 971

when the number of stages increases and the results will become stableafter 2 or 3 stages

Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB

1 at Stage 1 andPFlow

2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow

2 among all proposals in PFlow2 A proposal

is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB

1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow

2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB

1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other

Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow

4 and those in PRGB3 are very similar even they are gathered for

different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal

43 Experiment 79

action localization method after Stage 3 which is consistent with the re-sults in Table 45

Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005

In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries

80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

(a) AR50 () for various methods at different number of stages

(b) AR100 () for various methods at different number of stages

(c) AR200 () for various methods at different number of stages

(d) AR500 () for various methods at different number of stages

Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset

43 Experiment 81

Ground Truth

Anchor-based Method

PCG-TAL at Stage 3

977s 1042s

954s 1046s

973s 1043s

Frame-based Method Missed

Video Clip

964s 1046s

97s 1045s

PCG-TAL at Stage 1

PCG-TAL at Stage 2

Time

1042s 1046s977s

954s

954s

Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment

434 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our

82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule

44 Summary

In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework

83

Chapter 5

STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding

Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our

84Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG

51 Background

511 Vision-language Modelling

Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work

The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations

512 Visual Grounding in ImagesVideos

Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et

52 Methodology 85

al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling

In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods

52 Methodology

521 Overview

We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip

k Kk=1 where Vclip

k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN

n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte

t=ts

containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B

86Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Pooled Video Feature

Input Textual Tokens

Spatial VisualG

rouding Netw

ork(SVG

-net)

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

HW T

Video Frames

Video Feature

Initial Object Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]

Textual Query

Word Em

beddingInput Textual Tokens

ST-ViLBERTVidoe C

lipSam

pling

Video FeatureC

lip Feature

Input Textual Tokens

Deconvs

Head

Head

Correlation

MLP

Spatial Pooling

Global Textual Feature

Text-guided Viusal Feature

Initial TubeG

eneration

Initial Object Tube

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

Input Textual Tokens

(a) Overview

of Our M

ethod

(b) Our SVG

-net(c) O

ur TBR

-net

Bounding Box Size

Bounding Box Center

Relevance Scores

Pooled Video Feature

Initial Object

TubePooled Video

Feature

SVG-net-core

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

H T

VideoC

lip Features

Initial Object

Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]Textual Q

ueryW

ord Embedding

Input Textual Tokens

ST-ViLBERTC

lip Feature

Input Textual Tokens

Deconv

Head

Head

Correlation

amp Sigmoid

MLP

Spatial Pooling

Global

Textual Feature

Text-guided Viusal Feature

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

(a) Overview

of Our M

ethod

(b) Our SVG

-net-core Module

(c) Our TB

R-net

Bounding Box Sizes

Bounding Box Centers

Relevance Scores

Pooled Video Feature

Initial Object

Tube

W

BB Centers

BB Sizes

Relevance Scores

SVG-net-core

Initial TubeG

eneration

Spatial Visual Grounding N

etwork (SVG

-net)

Figure51

(a)T

heoverview

ofour

spatio-temporalvideo

groundingfram

ework

STVG

BertO

urtw

o-stepfram

ework

consistsof

two

sub-networksSpatial

Visual

Grounding

network

(SVG

-net)and

Temporal

BoundaryR

efinement

net-w

ork(T

BR-net)w

hichtakes

videoand

textualquerypairs

asthe

inputtogenerate

spatio-temporalobjecttubes

contain-ing

theobject

ofinterest

In(b)the

SVG

-netgenerates

theinitialobject

tubesIn

(c)theTBR

-netprogressively

refinesthe

temporalboundaries

oftheinitialobjecttubes

andgenerates

thefinalspatio-tem

poralobjecttubes

52 Methodology 87

In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections

522 Spatial Visual Grounding

Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames

Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip

k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature

For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two

88Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2

n=1 where en is the i-th textual input token

Multi-modal Feature Learning Given the visual input feature Fclipk

and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details

The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors

Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in

52 Methodology 89

Multi-head AttentionQuery Key Value

Add amp Norm

Multi-head AttentionQueryKeyValue

Add amp Norm

Feed Forward

Add amp Norm

Feed Forward

Add amp Norm

Visual Input (Output from the Last Layer)

Visual Output for the Current Layer

Textual Input (Output from the Last Layer)

Textual Output for the Current Layer

Visual Branch Textual Branch

Spatial Pooling Temporal Pooling

Combination

Multi-head Attention

Query Key ValueTextual Input

Combined Visual Feature

Decomposition

Replication Replication

Intermediate Visual Feature

Visual Input

Norm

(a) One Co-attention Layer in ViLBERT

(b) Our Spatio-temporal Combination and Decomposition Module

Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))

90Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively

Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv

to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box

Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending

52 Methodology 91

frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0

s t0e ) The

initial temporal boundaries (t0s t0

e ) and the predicted bounding boxes bt

form the initial tube prediction result Binit = btt0e

t=t0s

Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel

92Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

axy = exp(minus (xminusx)2+(yminusy)2

2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows

LSVG =λ1Lc(A A) + λ2Lrel(oobj o)

+ λ3Lsize(wxy hxy w h)(51)

where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details

523 Temporal Boundary Refinement

The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube

As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries

52 Methodology 93

Text

ual I

nput

To

kens

Star

ting

amp En

ding

Pr

obab

ilitie

s

Tem

pora

l Ext

ensi

onU

nifo

rm S

ampl

ing

ViLB

ERT

Boun

dary

Reg

ress

ion

Loca

l Fea

ture

Ext

ract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Exte

nded

Stra

ting

amp En

ding

Pos

ition

sAn

chor

Fea

ture

Refi

ned

Star

ting

amp En

ding

Pos

ition

s (

From

Anc

hor-b

ased

Bra

nch)

and

Star

ting

amp En

ding

Reg

ion

Feat

ures

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Initi

al S

tarti

ng amp

End

ing

Posi

tions

Refi

nde

Star

ting

amp En

ding

Pos

ition

s (F

rom

Fra

me-

base

d Br

anch

)

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Anc

hor-

base

d B

ranc

h

Fram

e-ba

sed

Bra

nch

Refi

ned

Obj

ect

Tube

Initi

al O

bjec

t Tu

be

Vide

o Po

oled

Fe

atur

e

Text

ual I

nput

To

kens

Stag

e 1

Stag

e 2

Stag

e 3

Star

ting

Endi

ng

Prob

abilit

yAn

chor

Fea

ture

Extra

ctio

nVi

LBER

TBo

unda

ryR

egre

ssio

nLo

cal F

eatu

reEx

tract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Refi

ned

Posi

tions

(Fro

m A

B)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Initi

al P

ositi

ons

Refi

ned

Posi

tions

(F

rom

FB)

Pool

ed V

ideo

Fe

atur

eTe

xtua

l Inp

ut

Toke

nsPo

oled

Vid

eo

Feat

ure

Anc

hor-

base

d B

ranc

h (A

B)

Fram

e-ba

sed

Bra

nch

(FB

)

Refi

ned

Obj

ect T

ube

from

Initi

al O

bjec

t Tu

be

Pool

ed V

ideo

Fea

ture

Text

ual I

nput

Tok

ens

Stag

e 1

Stag

e 2

Figu

re5

3O

verv

iew

ofou

rTe

mpo

ralB

ound

ary

Refi

nem

entn

etw

ork

(TBR

-net

)w

hich

isa

mul

ti-s

tage

ViL

BER

T-ba

sed

fram

ewor

kw

ith

anan

chor

-bas

edbr

anch

and

afr

ame-

base

dbr

anch

(tw

ost

ages

are

used

for

bett

erill

ustr

atio

n)

The

inpu

tsta

rtin

gan

den

ding

fram

epo

siti

ons

atea

chst

age

are

thos

eof

the

outp

utob

ject

tube

from

the

prev

ious

stag

eA

tst

age

1th

ein

puto

bjec

ttub

eis

gene

rate

dby

our

SVG

-net

94Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage

Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets

Given an anchor with the temporal boundaries (t0s t0

e ) and the lengthl = t0

e minus t0s we temporally extend the anchor before and after the start-

ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0

s minus l4 t0e + l4] and generate the anchor feature F by stacking the

corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension

The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0

s and t0e and generate the new starting

and ending frame positions t1s and t1

e

Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1

s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1

s minusD2+ 1t1s + D2] where D is the length

53 Experiment 95

of the starting duration We then construct the frame-level feature Fs

by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1

s Accordingly we can produce the ending frameposition t1

e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1

s and t1e are generated they can be used as the

input for the anchor-based branch at the next stage

Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow

LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)

where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1

53 Experiment

531 Experiment Setup

Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset

96Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes

-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes

Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU

Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1

|SU | sumtisinSIrt

where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames

53 Experiment 97

between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples

Baseline Methods

-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals

-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes

-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT

-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net

98Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

532 Comparison with the State-of-the-art Methods

We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries

533 Ablation Study

In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method

Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step

It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating

53 Experiment 99

Tabl

e5

1R

esul

tsof

diff

eren

tmet

hods

onth

eV

idST

Gda

tase

tldquo

ind

icat

esth

ere

sult

sar

equ

oted

from

its

orig

inal

wor

k

Met

hod

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)m

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)ST

GR

N

[102

]19

75

257

714

60

183

221

10

128

3V

anill

a(o

urs)

214

927

69

157

720

22

231

713

82

SVG

-net

(our

s)22

87

283

116

31

215

724

09

143

3SV

G-n

et+

TBR

-net

(our

s)25

21

309

918

21

236

726

26

160

1

100Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work

Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875

SVG-net (ours) 1945 2778 1012SVG-net +

TBR-net (ours)2241 3031 1207

the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT

Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively

Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)

53 Experiment 101

Tabl

e5

3R

esul

tsof

our

met

hod

and

its

vari

ants

whe

nus

ing

diff

eren

tnum

ber

ofst

ages

onth

eV

idST

Gda

tase

t

Met

hods

Sta

ges

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

UvI

oU

03

vIoU

0

5m

_vIo

UvI

oU

03

vIoU

0

5SV

G-n

et-

228

728

31

163

121

57

240

914

33

SVG

-net

+IF

B-ne

t1

234

128

84

166

922

12

244

814

57

223

82

289

716

85

225

524

69

147

13

237

728

91

168

722

61

247

214

79

SVG

-net

+IA

B-ne

t1

232

729

57

163

921

78

249

314

31

223

74

299

816

52

222

125

42

145

93

238

530

15

165

722

43

257

114

47

SVG

-net

+T

BR-n

et1

243

229

92

172

222

69

253

715

29

224

95

307

517

02

233

426

03

158

33

252

130

99

182

123

67

262

616

01

102Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods

54 Summary

In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding

103

Chapter 6

Conclusions and Future Work

As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future

61 Conclusions

The contributions of this thesis are summarized as follows

bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level

104 Chapter 6 Conclusions and Future Work

The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets

bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework

bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding

62 Future Work

The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into

62 Future Work 105

one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously

107

Bibliography

[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812

[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100

[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970

[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308

[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139

[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206

[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894

108 Bibliography

[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008

[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802

[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990

[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255

[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941

[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83

[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275

[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636

[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448

[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587

Bibliography 109

[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015

[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056

[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778

[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017

[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708

[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6

[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199

[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014

[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413

[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105

110 Bibliography

[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750

[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165

[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)

[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318

[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344

[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019

[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889

[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898

[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996

[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In

Bibliography 111

Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19

[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017

[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988

[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755

[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682

[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864

[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855

[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37

[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959

[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613

112 Bibliography

[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23

[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043

[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950

[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028

[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946

[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205

[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334

[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759

[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439

Bibliography 113

[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541

[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788

[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99

[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427

[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016

[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565

[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743

[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058

114 Bibliography

[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576

[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)

[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970

[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646

[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)

[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025

[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020

[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473

[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018

[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772

Bibliography 115

[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9

[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019

[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)

[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497

[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008

[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717

[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334

[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36

[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the

116 Bibliography

IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968

[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172

[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321

[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792

[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462

[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154

[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653

[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019

[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947

[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)

Bibliography 117

[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693

[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)

[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524

[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687

[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315

[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102

[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692

[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103

[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809

118 Bibliography

[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075

[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677

[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923

[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019

[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913

  • Abstract
  • Declaration
  • Acknowledgements
  • List of Publications
  • List of Figures
  • List of Tables
  • Introduction
    • Spatial-temporal Action Localization
    • Temporal Action Localization
    • Spatio-temporal Visual Grounding
    • Thesis Outline
      • Literature Review
        • Background
          • Image Classification
          • Object Detection
          • Action Recognition
            • Spatio-temporal Action Localization
            • Temporal Action Localization
            • Spatio-temporal Visual Grounding
            • Summary
              • Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
                • Background
                  • Spatial temporal localization methods
                  • Two-stream R-CNN
                    • Action Detection Model in Spatial Domain
                      • PCSC Overview
                      • Cross-stream Region Proposal Cooperation
                      • Cross-stream Feature Cooperation
                      • Training Details
                        • Temporal Boundary Refinement for Action Tubes
                          • Actionness Detector (AD)
                          • Action Segment Detector (ASD)
                          • Temporal PCSC (T-PCSC)
                            • Experimental results
                              • Experimental Setup
                              • Comparison with the State-of-the-art methods
                                • Results on the UCF-101-24 Dataset
                                • Results on the J-HMDB Dataset
                                  • Ablation Study
                                  • Analysis of Our PCSC at Different Stages
                                  • Runtime Analysis
                                  • Qualitative Results
                                    • Summary
                                      • PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
                                        • Background
                                          • Action Recognition
                                          • Spatio-temporal Action Localization
                                          • Temporal Action Localization
                                            • Our Approach
                                              • Overall Framework
                                              • Progressive Cross-Granularity Cooperation
                                              • Anchor-frame Cooperation Module
                                              • Training Details
                                              • Discussions
                                                • Experiment
                                                  • Datasets and Setup
                                                  • Comparison with the State-of-the-art Methods
                                                  • Ablation study
                                                  • Qualitative Results
                                                    • Summary
                                                      • STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
                                                        • Background
                                                          • Vision-language Modelling
                                                          • Visual Grounding in ImagesVideos
                                                            • Methodology
                                                              • Overview
                                                              • Spatial Visual Grounding
                                                              • Temporal Boundary Refinement
                                                                • Experiment
                                                                  • Experiment Setup
                                                                  • Comparison with the State-of-the-art Methods
                                                                  • Ablation Study
                                                                    • Summary
                                                                      • Conclusions and Future Work
                                                                        • Conclusions
                                                                        • Future Work
                                                                          • Bibliography
Page 4: Deep Learning for Spatial and Temporal Video Localization

Deep Learning for Spatial andTemporal Video Localization

by

Rui SU

BE South China University of Technology MPhil University of Queensland

A Thesis Submitted in Partial Fulfilmentof the Requirements for the Degree of

Doctor of Philosophy

at

University of SydneyJune 2021

COPYRIGHT ccopy2021 BY RUI SU

ALL RIGHTS RESERVED

i

Declaration

I Rui SU declare that this thesis titled ldquoDeep Learning for Spatialand Temporal Video Localizationrdquo which is submitted in fulfillment ofthe requirements for the Degree of Doctor of Philosophy represents myown work except where due acknowledgement have been made I fur-ther declared that it has not been previously included in a thesis disser-tation or report submitted to this University or to any other institutionfor a degree diploma or other qualifications

Signed

Date June 17 2021

For Mama and PapaA sA s

ii

Acknowledgements

First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher

I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future

Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything

Rui SU

University of SydneyJune 17 2021

iii

List of Publications

JOURNALS

[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020

[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020

CONFERENCES

[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019

v

Contents

Abstract i

Declaration i

Acknowledgements ii

List of Publications iii

List of Figures ix

List of Tables xv

1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11

2 Literature Review 1321 Background 13

211 Image Classification 13212 Object Detection 15213 Action Recognition 18

22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26

3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29

31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31

32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38

33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44

34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47

Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50

343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58

35 Summary 60

4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62

411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63

42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71

43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81

44 Summary 82

5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84

511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84

52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92

53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98

54 Summary 102

6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104

Bibliography 107

ix

List of Figures

11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6

31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33

32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i

t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi

t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39

33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43

34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57

35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58

41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 are denoted as SRGB0 PRGB

0

and QFlow0 The initial proposals and features from anchor-

based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66

42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80

43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81

51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86

52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89

53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93

xv

List of Tables

31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48

32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49

33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50

34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51

35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52

36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52

37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52

38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54

39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55

310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55

311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55

312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55

41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74

42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75

43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75

44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76

45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77

46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78

51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99

52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100

53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101

1

Chapter 1

Introduction

With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades

In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows

2 Chapter 1 Introduction

Spatio-temporal Action Localization

Framework

Input Videos

Action Tubes

Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes

11 Spatial-temporal Action Localization

Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task

Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each

11 Spatial-temporal Action Localization 3

other

For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation

On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation

Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region

4 Chapter 1 Introduction

proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level

Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level

Our contributions are briefly summarized as follows

bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy

bull We also propose a temporal boundary refinement method to learn

12 Temporal Action Localization 5

class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results

bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios

12 Temporal Action Localization

Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging

Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework

For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other

6 Chapter 1 Introduction

Ground Truth

P-GCN

Ours

Frisbee Catch783s 821s

Frisbee Catch799s 822s

Frisbee Catch787s 820s

Ground Truth

P-GCN

Ours

Golf Swing

Golf Swing

Golf Swing

210s 318s

179s 311s

204s 315s

Ground Truth

P-GCN

Ours

Throw Discus

Hammer Throw

Throw Discus

639s 712s

642s 708s

637s 715s

Time

Time

Video Input Start End

Anchor-based ProposalScore 082

Score 039Score 047

Score 033

Frame Actioness startingending

Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods

Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries

For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and

12 Temporal Action Localization 7

segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method

To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based

8 Chapter 1 Introduction

segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold

(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels

(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner

(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization

13 Spatio-temporal Visual Grounding

Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers

Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of

13 Spatio-temporal Visual Grounding 9

bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene

Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks

Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance

Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the

10 Chapter 1 Introduction

STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]

Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods

Our contributions can be summarized as follows

(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors

(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover

14 Thesis Outline 11

we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods

(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task

14 Thesis Outline

The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows

Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters

Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]

Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

12 Chapter 1 Introduction

Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod

Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future

13

Chapter 2

Literature Review

In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks

21 Background

211 Image Classification

Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters

14 Chapter 2 Literature Review

(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem

Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition

Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional

21 Background 15

neural network is built by stacking such inception modules to achievehigh image classification performance

VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance

Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper

212 Object Detection

The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem

16 Chapter 2 Literature Review

Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time

To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced

Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined

21 Background 17

anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map

Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming

Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps

All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map

18 Chapter 2 Literature Review

213 Action Recognition

Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other

For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities

Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and

21 Background 19

randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization

2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient

Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos

20 Chapter 2 Literature Review

22 Spatio-temporal Action Localization

Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task

Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization

Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization

Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these

22 Spatio-temporal Action Localization 21

two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency

As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization

The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos

A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes

22 Chapter 2 Literature Review

23 Temporal Action Localization

Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error

Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation

Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce

23 Temporal Action Localization 23

the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame

Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods

In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates

Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors

Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the

24 Chapter 2 Literature Review

prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise

Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU

Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction

In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way

24 Spatio-temporal Visual Grounding 25

24 Spatio-temporal Visual Grounding

Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects

Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer

In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects

Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes

One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors

26 Chapter 2 Literature Review

to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance

25 Summary

In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding

In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms

Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows

25 Summary 27

the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results

In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities

In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos

29

Chapter 3

Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization

Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal

30Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios

31 Background

311 Spatial temporal localization methods

Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]

Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization

Third temporal localization is to form action tubes from per-frame

31 Background 31

detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc

Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]

312 Two-stream R-CNN

Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams

Please note that most appearance and motion fusion approaches in

32Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion

32 Action Detection Model in Spatial Domain

Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results

321 PCSC Overview

The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression

The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation

32 Action Detection Model in Spatial Domain 33

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

t

IFlow

t

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(a)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(b)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(c)

Figu

re3

1(a

)O

verv

iew

ofou

rC

ross

-str

eam

Coo

pera

tion

fram

ewor

k(f

our

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

regi

onpr

opos

als

and

feat

ures

from

the

flow

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

and

Stag

e3

whi

leth

ere

gion

prop

osal

san

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2an

dSt

age

4Ea

chst

age

com

pris

esof

two

coop

erat

ion

mod

ules

and

the

dete

ctio

nhe

ad(

b)(c

)The

deta

ilsof

our

feat

ure-

leve

lcoo

pera

tion

mod

ules

thro

ugh

mes

sage

pass

ing

from

one

stre

amto

anot

her

stre

am

The

flow

feat

ures

are

used

tohe

lpth

eR

GB

feat

ures

in(b

)w

hile

the

RG

Bfe

atur

esar

eus

edto

help

the

flow

feat

ures

in(c

)

34Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages

After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33

322 Cross-stream Region Proposal Cooperation

We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream

In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes

The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage

32 Action Detection Model in Spatial Domain 35

The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)

Mathematically let P (i)t and B(i)t denote the set of region propos-

als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)

t is up-dated as P (i)

t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)

t is updated as

B(j)t = G(P (j)

t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)

0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams

With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results

Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the

36Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data

323 Cross-stream Feature Cooperation

To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections

Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci

2 Ci3 Ci

4Ci

5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui

2 Ui3 Ui

4 Ui5

Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are

32 Action Detection Model in Spatial Domain 37

insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement

We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci

2 Ci3 Ci

4 Ci5 l isin 2 3 4 5 Let us use the improvement

of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB

l with the help of the features CFlowl as

followsCRGB

l = fθ(CFlowl )oplus CRGB

l (31)

where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow

l ) nonlinearly extracts the message fromthe feature CFlow

l and then uses the extracted message for improvingthe features CRGB

l

The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB

l and CFlowl To this end

we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB

l in the bottom-up pathway are refined the corre-sponding features maps URGB

l in the top-down pathway are generatedaccordingly

The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages

38Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB

t

and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1

and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-

prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31

(b) Specifically the improved RGB feature FRGBt is obtained by apply-

ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB

t of the RGB stream is also used to help improve theROI-feature FFlow

t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage

324 Training Details

For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process

33 Temporal Boundary Refinement for Action

Tubes

After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this

33 Temporal Boundary Refinement for Action Tubes 39

Initi

al

Segm

ents

(a) I

nitia

l Seg

men

t Gen

erat

ion

Initi

al S

egm

ent

Gen

erat

ion

RG

B o

r Flo

w F

eatu

res

from

Act

ion

Tube

s

=cup

2

0

1

Initi

al S

egm

ent

Gen

erat

ion

Initi

al

Segm

ents

Stag

e 2

Segm

ent F

eatu

reC

oope

ratio

nD

etec

tion

Hea

dSe

gmen

t Fea

ture

Coo

pera

tion

Det

ectio

nH

ead

=cup

1

0

0

1

2

1

2

Impr

oved

Seg

men

t Fe

atur

es

Det

ecte

d Se

gmen

ts

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Impr

oved

Seg

men

tFe

atur

es

Det

ecte

d Se

gmen

ts

(b) T

empo

ral P

CSC

RG

B F

eatu

res

from

A

ctio

n Tu

bes

0

Flow

Fea

ture

s fr

om

Act

ion

Tube

s

0

Actio

nnes

sD

etec

tor (

AD)

Actio

n Se

gmen

tD

etec

tor (

ASD

)Se

gmen

ts fr

om A

D

Uni

on o

f Seg

men

tsSe

gmen

ts fr

om A

SD

Initi

al S

egm

ents

Figu

re3

2O

verv

iew

oftw

om

odul

esin

our

tem

pora

lbo

unda

ryre

finem

ent

fram

ewor

k(a

)th

eIn

itia

lSe

gmen

tG

en-

erat

ion

Mod

ule

and

(b)

the

Tem

pora

lPr

ogre

ssiv

eC

ross

-str

eam

Coo

pera

tion

(T-P

CSC

)M

odul

eTh

eIn

itia

lSe

gmen

tG

ener

atio

nm

odul

eco

mbi

nes

the

outp

uts

from

Act

ionn

ess

Det

ecto

ran

dA

ctio

nSe

gmen

tDet

ecto

rto

gene

rate

the

init

ial

segm

ents

for

both

RG

Ban

dflo

wst

ream

sO

urT-

PCSC

fram

ewor

kta

kes

thes

ein

itia

lseg

men

tsas

the

segm

entp

ropo

sals

and

iter

ativ

ely

impr

oves

the

tem

pora

lact

ion

dete

ctio

nre

sult

sby

leve

ragi

ngco

mpl

emen

tary

info

rmat

ion

oftw

ost

ream

sat

both

feat

ure

and

segm

entp

ropo

sall

evel

s(t

wo

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

segm

entp

ro-

posa

lsan

dfe

atur

esfr

omth

eflo

wst

ream

help

impr

ove

acti

onse

gmen

tdet

ecti

onre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

whi

leth

ese

gmen

tpro

posa

lsan

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2Ea

chst

age

com

pris

esof

the

coop

erat

ion

mod

ule

and

the

dete

ctio

nhe

adT

hese

gmen

tpro

posa

llev

elco

oper

-at

ion

mod

ule

refin

esth

ese

gmen

tpro

posa

lsP

i tby

com

bini

ngth

em

ostr

ecen

tseg

men

tpro

posa

lsfr

omth

etw

ost

ream

sw

hile

the

segm

ent

feat

ure

leve

lcoo

pera

tion

mod

ule

impr

oves

the

feat

ures

Fi tw

here

the

supe

rscr

ipt

iisinR

GB

Flo

w

deno

tes

the

RG

BFl

owst

ream

and

the

subs

crip

ttde

note

sth

est

age

num

ber

The

dete

ctio

nhe

adis

used

toes

tim

ate

the

acti

onse

gmen

tloc

atio

nan

dth

eac

tion

ness

prob

abili

ty

40Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework

To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments

33 Temporal Boundary Refinement for Action Tubes 41

331 Actionness Detector (AD)

We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i

At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the

42Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

testing stage

332 Action Segment Detector (ASD)

Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below

(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture

33 Temporal Boundary Refinement for Action Tubes 43

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventuallyremoved by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively

but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details

(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three

44Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg

(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments

333 Temporal PCSC (T-PCSC)

Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework

33 Temporal Boundary Refinement for Action Tubes 45

from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments

Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results

46Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

34 Experimental results

We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343

341 Experimental Setup

Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset

Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005

To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes

34 Experimental results 47

To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ

Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch

342 Comparison with the State-of-the-art methods

We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted

48Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy

Results on the UCF-101-24 Dataset

The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32

Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05

Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792

Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for

34 Experimental results 49

Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector

IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -

Zolfaghari et al [105] RGB+Flow+Pose 476 - - -

Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219

Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172

Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237

PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289

per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs

Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary

50Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods

Results on the J-HMDB Dataset

For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset

Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05

Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748

We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU

34 Experimental results 51

Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds

IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -

Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -

Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442

Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463

Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476

PCSC (Ours) RGB+Flow 826 822 631 528

threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively

At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work

52Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

343 Ablation Study

In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method

Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset

Stage PCSC wo featurecooperation PCSC

0 755 7811 761 7862 764 7883 767 7914 767 792

Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector

Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631

Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset

Stage Video mAP ()0 6191 6282 631

Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to

34 Experimental results 53

verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement

Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work

54Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset

Stage LoR()1 6112 7193 9524 954

in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes

Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +

34 Experimental results 55

Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset

Stage MABO()1 8412 8433 8454 846

Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset

θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801

Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages

Stage 0 1 2 3 4

Detection speed (FPS) 31 29 28 26 24

Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages

Stage 0 1 2

Detection speed (VPS) 21 17 15

ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main

56Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

344 Analysis of Our PCSC at Different Stages

In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages

In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4

To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification

34 Experimental results 57

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)

345 Runtime Analysis

We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages

58Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detected instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)

346 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams

34 Experimental results 59

and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))

Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)

While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue

60Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

which will be studied in our future work

35 Summary

In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets

61

Chapter 4

PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization

There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework

62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

41 Background

411 Action Recognition

Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction

412 Spatio-temporal Action Localization

The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues

41 Background 63

413 Temporal Action Localization

Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level

64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues

42 Our Approach

In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively

421 Overall Framework

We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is

an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN

i=1 and their temporal durations(tis tie)N

i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi

The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows

42 Our Approach 65

Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature

Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB

i |pRGBi = (tRGB

is tRGBie )NRGB

i=1 and PFlow = pFlowi |pFlow

i =

(tFlowis tFlow

ie )NFlow

i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB

0 PFlow0

for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB

0 SFlow0 and the predicted action categories

Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB

0 and QFlow0 respectively

The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB

0 and QFlow0 respectively

after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)

Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation

66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Anchor Feautres (Output)

Frame Features (Output)

Cross-granularity

RGB-StreamAFC Module

(Stage 1)

Flow-StreamAFC Module

(Stage 2)

RGB1

RGB-StreamAFC Module

(Stage 3)

RGB1Flow

1

Flow-StreamAFC Module

(Stage4)

Features RGB0 Flow

0Flow

0

Segments

Proposals RGB0

Flow0

Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

RGB

RGB

Flow

Message Passing

Flowminus2 RGB

minus2Flowminus1

Flowminus1

RGBminus2

Flow

RGB

Anchor AnchorFrame

Anchor Features (Output)

1 x 1 Conv

1 x 1 Conv

Detected Segments (Output)

Proposal Combination= cup cupRGB

Flowminus1

RGBminus2

Flow

InputProposals

Input Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

1 x 1 Conv

1 x 1 Conv

Proposal Combination= cup cupFlow

RGBminus1

Flowminus2

RGB

Input Features

RGBminus1

Flowminus2

InputProposals

Flowminus2 RGB

minus1 RGBminus2

RGB

RGB

Flow

Flow

Flow

Cross-stream Cross-granularity

Anchor Anchor Frame

Message Passing

Features

Proposals Flow2

Features RGB3 Flow

2 RGB2

ProposalsRGB3

Flow2

Flow2

SegmentsFeatures Flow

2RGB2

SegmentsFeatures Flow

3

RGB3

RGB3

Flow4

Flow4RGB

4

SegmentsFeatures

Detected Segments (Output)

Frame Features (Output)

Message Passing

Outputs

Features RGB1 Flow

0RGB

0

ProposalsRGB1

Flow0

RGB1

RGB1 Flow

2 Flow1

Inputs

(a) Overview

(b) RGB-streamAFC module

(c) Flow-stremAFC module

Cross-stream Message Passing

Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 aredenoted as SRGB

0 PRGB0 and QFlow

0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments

42 Our Approach 67

methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages

422 Progressive Cross-Granularity Cooperation

The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)

To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow

nminus1 and SRGBnminus2 as the input two-stream proposals and

uses preceding two-stream anchor-based features PFlownminus1 and PRGB

nminus2 andflow-stream frame-based feature QFlow

nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB

n and anchor-based feature PRGB

n as well as update the flow-stream frame-based fea-tures as QFlow

n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages

68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes

Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances

423 Anchor-frame Cooperation Module

The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)

Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features

42 Our Approach 69

PRGB0 PFlow

0 and QRGB0 QFlow

0 as well as proposals SRGB0 SFlow

0 as theinput as shown in Sec 421 and Fig 41(a)

Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow

nminus1 to the frame-based features QFlownminus2 both from the flow-stream

The message passing operation is defined as

QFlown =Manchorrarr f rame(PFlow

nminus2 )oplusQFlownminus2 (41)

oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow

n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow

n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB

n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB

nminus2 which flips the superscripts from Flow to RGB QRGB

n also leads to betterframe-based RGB-stream proposals QRGB

n

Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow

n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow

n And then we combine QFlown together with the

two-stream anchor-based proposals SFlownminus1 and SRGB

nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB

n = SRGBnminus2 cup SFlow

nminus1 cup

70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

QFlown This set of proposals not only provides more comprehensive pro-

posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow

n = SFlownminus2 cup SRGB

nminus1 cupQRGBn

Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB

n =M f lowrarrrgb(PFlownminus1 )oplus PRGB

nminus2 forthe RGB stream and PFlow

n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow

nminus2 for flow stream

Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB

n PFlown and

the updated anchor-based features PRGBn PFlow

n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes

424 Training Details

At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the

42 Our Approach 71

boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules

425 Discussions

Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants

Similarity of Results Between Adjacent Stages Note that even though

72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results

Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources

43 Experiment

431 Datasets and Setup

Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples

- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available

- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an

43 Experiment 73

existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1

Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches

Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24

432 Comparison with the State-of-the-art Methods

We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers

THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05

74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds

tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190

REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178

TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369

MGG [46] - - 539 468 374BMN [35] - - 560 474 388

TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491

ours 712 689 651 595 512

our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN

ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and

43 Experiment 75

Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used

tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -

R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398

TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699

ours 4431 2985 547 2885

Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied

tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003

P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385

ours 5204 3592 797 3491

BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN

UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05

76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds

st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -

MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278

ours 849 635 237 292

433 Ablation study

We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework

Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow

n and QRGBn are

directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo

CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow

nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and

43 Experiment 77

Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset

Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822

wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667

wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679

wo PC 4214 4321 4514 4574 4577

PRGBnminus1 and PFlow

nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison

In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB

0 cupPFlow0 for ldquowo CG andQRGB

0 cupQFlow0 for ldquowo PC We have

the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved

78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset

Stage Comparisons LoR ()

1 PRGB1 vs SFlow

0 6712 PFlow

2 vs PRGB1 792

3 PRGB3 vs PFlow

2 8954 PFlow

4 vs PRGB3 971

when the number of stages increases and the results will become stableafter 2 or 3 stages

Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB

1 at Stage 1 andPFlow

2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow

2 among all proposals in PFlow2 A proposal

is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB

1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow

2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB

1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other

Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow

4 and those in PRGB3 are very similar even they are gathered for

different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal

43 Experiment 79

action localization method after Stage 3 which is consistent with the re-sults in Table 45

Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005

In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries

80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

(a) AR50 () for various methods at different number of stages

(b) AR100 () for various methods at different number of stages

(c) AR200 () for various methods at different number of stages

(d) AR500 () for various methods at different number of stages

Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset

43 Experiment 81

Ground Truth

Anchor-based Method

PCG-TAL at Stage 3

977s 1042s

954s 1046s

973s 1043s

Frame-based Method Missed

Video Clip

964s 1046s

97s 1045s

PCG-TAL at Stage 1

PCG-TAL at Stage 2

Time

1042s 1046s977s

954s

954s

Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment

434 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our

82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule

44 Summary

In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework

83

Chapter 5

STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding

Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our

84Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG

51 Background

511 Vision-language Modelling

Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work

The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations

512 Visual Grounding in ImagesVideos

Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et

52 Methodology 85

al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling

In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods

52 Methodology

521 Overview

We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip

k Kk=1 where Vclip

k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN

n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte

t=ts

containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B

86Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Pooled Video Feature

Input Textual Tokens

Spatial VisualG

rouding Netw

ork(SVG

-net)

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

HW T

Video Frames

Video Feature

Initial Object Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]

Textual Query

Word Em

beddingInput Textual Tokens

ST-ViLBERTVidoe C

lipSam

pling

Video FeatureC

lip Feature

Input Textual Tokens

Deconvs

Head

Head

Correlation

MLP

Spatial Pooling

Global Textual Feature

Text-guided Viusal Feature

Initial TubeG

eneration

Initial Object Tube

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

Input Textual Tokens

(a) Overview

of Our M

ethod

(b) Our SVG

-net(c) O

ur TBR

-net

Bounding Box Size

Bounding Box Center

Relevance Scores

Pooled Video Feature

Initial Object

TubePooled Video

Feature

SVG-net-core

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

H T

VideoC

lip Features

Initial Object

Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]Textual Q

ueryW

ord Embedding

Input Textual Tokens

ST-ViLBERTC

lip Feature

Input Textual Tokens

Deconv

Head

Head

Correlation

amp Sigmoid

MLP

Spatial Pooling

Global

Textual Feature

Text-guided Viusal Feature

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

(a) Overview

of Our M

ethod

(b) Our SVG

-net-core Module

(c) Our TB

R-net

Bounding Box Sizes

Bounding Box Centers

Relevance Scores

Pooled Video Feature

Initial Object

Tube

W

BB Centers

BB Sizes

Relevance Scores

SVG-net-core

Initial TubeG

eneration

Spatial Visual Grounding N

etwork (SVG

-net)

Figure51

(a)T

heoverview

ofour

spatio-temporalvideo

groundingfram

ework

STVG

BertO

urtw

o-stepfram

ework

consistsof

two

sub-networksSpatial

Visual

Grounding

network

(SVG

-net)and

Temporal

BoundaryR

efinement

net-w

ork(T

BR-net)w

hichtakes

videoand

textualquerypairs

asthe

inputtogenerate

spatio-temporalobjecttubes

contain-ing

theobject

ofinterest

In(b)the

SVG

-netgenerates

theinitialobject

tubesIn

(c)theTBR

-netprogressively

refinesthe

temporalboundaries

oftheinitialobjecttubes

andgenerates

thefinalspatio-tem

poralobjecttubes

52 Methodology 87

In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections

522 Spatial Visual Grounding

Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames

Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip

k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature

For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two

88Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2

n=1 where en is the i-th textual input token

Multi-modal Feature Learning Given the visual input feature Fclipk

and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details

The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors

Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in

52 Methodology 89

Multi-head AttentionQuery Key Value

Add amp Norm

Multi-head AttentionQueryKeyValue

Add amp Norm

Feed Forward

Add amp Norm

Feed Forward

Add amp Norm

Visual Input (Output from the Last Layer)

Visual Output for the Current Layer

Textual Input (Output from the Last Layer)

Textual Output for the Current Layer

Visual Branch Textual Branch

Spatial Pooling Temporal Pooling

Combination

Multi-head Attention

Query Key ValueTextual Input

Combined Visual Feature

Decomposition

Replication Replication

Intermediate Visual Feature

Visual Input

Norm

(a) One Co-attention Layer in ViLBERT

(b) Our Spatio-temporal Combination and Decomposition Module

Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))

90Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively

Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv

to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box

Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending

52 Methodology 91

frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0

s t0e ) The

initial temporal boundaries (t0s t0

e ) and the predicted bounding boxes bt

form the initial tube prediction result Binit = btt0e

t=t0s

Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel

92Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

axy = exp(minus (xminusx)2+(yminusy)2

2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows

LSVG =λ1Lc(A A) + λ2Lrel(oobj o)

+ λ3Lsize(wxy hxy w h)(51)

where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details

523 Temporal Boundary Refinement

The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube

As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries

52 Methodology 93

Text

ual I

nput

To

kens

Star

ting

amp En

ding

Pr

obab

ilitie

s

Tem

pora

l Ext

ensi

onU

nifo

rm S

ampl

ing

ViLB

ERT

Boun

dary

Reg

ress

ion

Loca

l Fea

ture

Ext

ract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Exte

nded

Stra

ting

amp En

ding

Pos

ition

sAn

chor

Fea

ture

Refi

ned

Star

ting

amp En

ding

Pos

ition

s (

From

Anc

hor-b

ased

Bra

nch)

and

Star

ting

amp En

ding

Reg

ion

Feat

ures

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Initi

al S

tarti

ng amp

End

ing

Posi

tions

Refi

nde

Star

ting

amp En

ding

Pos

ition

s (F

rom

Fra

me-

base

d Br

anch

)

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Anc

hor-

base

d B

ranc

h

Fram

e-ba

sed

Bra

nch

Refi

ned

Obj

ect

Tube

Initi

al O

bjec

t Tu

be

Vide

o Po

oled

Fe

atur

e

Text

ual I

nput

To

kens

Stag

e 1

Stag

e 2

Stag

e 3

Star

ting

Endi

ng

Prob

abilit

yAn

chor

Fea

ture

Extra

ctio

nVi

LBER

TBo

unda

ryR

egre

ssio

nLo

cal F

eatu

reEx

tract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Refi

ned

Posi

tions

(Fro

m A

B)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Initi

al P

ositi

ons

Refi

ned

Posi

tions

(F

rom

FB)

Pool

ed V

ideo

Fe

atur

eTe

xtua

l Inp

ut

Toke

nsPo

oled

Vid

eo

Feat

ure

Anc

hor-

base

d B

ranc

h (A

B)

Fram

e-ba

sed

Bra

nch

(FB

)

Refi

ned

Obj

ect T

ube

from

Initi

al O

bjec

t Tu

be

Pool

ed V

ideo

Fea

ture

Text

ual I

nput

Tok

ens

Stag

e 1

Stag

e 2

Figu

re5

3O

verv

iew

ofou

rTe

mpo

ralB

ound

ary

Refi

nem

entn

etw

ork

(TBR

-net

)w

hich

isa

mul

ti-s

tage

ViL

BER

T-ba

sed

fram

ewor

kw

ith

anan

chor

-bas

edbr

anch

and

afr

ame-

base

dbr

anch

(tw

ost

ages

are

used

for

bett

erill

ustr

atio

n)

The

inpu

tsta

rtin

gan

den

ding

fram

epo

siti

ons

atea

chst

age

are

thos

eof

the

outp

utob

ject

tube

from

the

prev

ious

stag

eA

tst

age

1th

ein

puto

bjec

ttub

eis

gene

rate

dby

our

SVG

-net

94Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage

Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets

Given an anchor with the temporal boundaries (t0s t0

e ) and the lengthl = t0

e minus t0s we temporally extend the anchor before and after the start-

ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0

s minus l4 t0e + l4] and generate the anchor feature F by stacking the

corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension

The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0

s and t0e and generate the new starting

and ending frame positions t1s and t1

e

Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1

s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1

s minusD2+ 1t1s + D2] where D is the length

53 Experiment 95

of the starting duration We then construct the frame-level feature Fs

by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1

s Accordingly we can produce the ending frameposition t1

e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1

s and t1e are generated they can be used as the

input for the anchor-based branch at the next stage

Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow

LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)

where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1

53 Experiment

531 Experiment Setup

Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset

96Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes

-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes

Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU

Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1

|SU | sumtisinSIrt

where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames

53 Experiment 97

between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples

Baseline Methods

-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals

-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes

-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT

-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net

98Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

532 Comparison with the State-of-the-art Methods

We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries

533 Ablation Study

In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method

Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step

It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating

53 Experiment 99

Tabl

e5

1R

esul

tsof

diff

eren

tmet

hods

onth

eV

idST

Gda

tase

tldquo

ind

icat

esth

ere

sult

sar

equ

oted

from

its

orig

inal

wor

k

Met

hod

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)m

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)ST

GR

N

[102

]19

75

257

714

60

183

221

10

128

3V

anill

a(o

urs)

214

927

69

157

720

22

231

713

82

SVG

-net

(our

s)22

87

283

116

31

215

724

09

143

3SV

G-n

et+

TBR

-net

(our

s)25

21

309

918

21

236

726

26

160

1

100Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work

Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875

SVG-net (ours) 1945 2778 1012SVG-net +

TBR-net (ours)2241 3031 1207

the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT

Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively

Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)

53 Experiment 101

Tabl

e5

3R

esul

tsof

our

met

hod

and

its

vari

ants

whe

nus

ing

diff

eren

tnum

ber

ofst

ages

onth

eV

idST

Gda

tase

t

Met

hods

Sta

ges

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

UvI

oU

03

vIoU

0

5m

_vIo

UvI

oU

03

vIoU

0

5SV

G-n

et-

228

728

31

163

121

57

240

914

33

SVG

-net

+IF

B-ne

t1

234

128

84

166

922

12

244

814

57

223

82

289

716

85

225

524

69

147

13

237

728

91

168

722

61

247

214

79

SVG

-net

+IA

B-ne

t1

232

729

57

163

921

78

249

314

31

223

74

299

816

52

222

125

42

145

93

238

530

15

165

722

43

257

114

47

SVG

-net

+T

BR-n

et1

243

229

92

172

222

69

253

715

29

224

95

307

517

02

233

426

03

158

33

252

130

99

182

123

67

262

616

01

102Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods

54 Summary

In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding

103

Chapter 6

Conclusions and Future Work

As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future

61 Conclusions

The contributions of this thesis are summarized as follows

bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level

104 Chapter 6 Conclusions and Future Work

The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets

bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework

bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding

62 Future Work

The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into

62 Future Work 105

one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously

107

Bibliography

[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812

[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100

[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970

[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308

[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139

[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206

[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894

108 Bibliography

[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008

[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802

[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990

[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255

[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941

[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83

[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275

[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636

[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448

[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587

Bibliography 109

[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015

[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056

[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778

[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017

[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708

[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6

[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199

[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014

[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413

[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105

110 Bibliography

[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750

[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165

[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)

[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318

[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344

[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019

[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889

[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898

[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996

[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In

Bibliography 111

Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19

[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017

[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988

[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755

[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682

[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864

[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855

[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37

[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959

[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613

112 Bibliography

[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23

[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043

[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950

[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028

[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946

[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205

[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334

[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759

[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439

Bibliography 113

[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541

[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788

[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99

[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427

[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016

[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565

[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743

[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058

114 Bibliography

[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576

[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)

[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970

[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646

[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)

[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025

[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020

[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473

[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018

[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772

Bibliography 115

[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9

[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019

[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)

[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497

[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008

[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717

[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334

[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36

[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the

116 Bibliography

IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968

[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172

[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321

[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792

[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462

[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154

[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653

[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019

[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947

[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)

Bibliography 117

[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693

[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)

[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524

[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687

[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315

[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102

[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692

[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103

[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809

118 Bibliography

[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075

[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677

[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923

[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019

[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913

  • Abstract
  • Declaration
  • Acknowledgements
  • List of Publications
  • List of Figures
  • List of Tables
  • Introduction
    • Spatial-temporal Action Localization
    • Temporal Action Localization
    • Spatio-temporal Visual Grounding
    • Thesis Outline
      • Literature Review
        • Background
          • Image Classification
          • Object Detection
          • Action Recognition
            • Spatio-temporal Action Localization
            • Temporal Action Localization
            • Spatio-temporal Visual Grounding
            • Summary
              • Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
                • Background
                  • Spatial temporal localization methods
                  • Two-stream R-CNN
                    • Action Detection Model in Spatial Domain
                      • PCSC Overview
                      • Cross-stream Region Proposal Cooperation
                      • Cross-stream Feature Cooperation
                      • Training Details
                        • Temporal Boundary Refinement for Action Tubes
                          • Actionness Detector (AD)
                          • Action Segment Detector (ASD)
                          • Temporal PCSC (T-PCSC)
                            • Experimental results
                              • Experimental Setup
                              • Comparison with the State-of-the-art methods
                                • Results on the UCF-101-24 Dataset
                                • Results on the J-HMDB Dataset
                                  • Ablation Study
                                  • Analysis of Our PCSC at Different Stages
                                  • Runtime Analysis
                                  • Qualitative Results
                                    • Summary
                                      • PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
                                        • Background
                                          • Action Recognition
                                          • Spatio-temporal Action Localization
                                          • Temporal Action Localization
                                            • Our Approach
                                              • Overall Framework
                                              • Progressive Cross-Granularity Cooperation
                                              • Anchor-frame Cooperation Module
                                              • Training Details
                                              • Discussions
                                                • Experiment
                                                  • Datasets and Setup
                                                  • Comparison with the State-of-the-art Methods
                                                  • Ablation study
                                                  • Qualitative Results
                                                    • Summary
                                                      • STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
                                                        • Background
                                                          • Vision-language Modelling
                                                          • Visual Grounding in ImagesVideos
                                                            • Methodology
                                                              • Overview
                                                              • Spatial Visual Grounding
                                                              • Temporal Boundary Refinement
                                                                • Experiment
                                                                  • Experiment Setup
                                                                  • Comparison with the State-of-the-art Methods
                                                                  • Ablation Study
                                                                    • Summary
                                                                      • Conclusions and Future Work
                                                                        • Conclusions
                                                                        • Future Work
                                                                          • Bibliography
Page 5: Deep Learning for Spatial and Temporal Video Localization

COPYRIGHT ccopy2021 BY RUI SU

ALL RIGHTS RESERVED

i

Declaration

I Rui SU declare that this thesis titled ldquoDeep Learning for Spatialand Temporal Video Localizationrdquo which is submitted in fulfillment ofthe requirements for the Degree of Doctor of Philosophy represents myown work except where due acknowledgement have been made I fur-ther declared that it has not been previously included in a thesis disser-tation or report submitted to this University or to any other institutionfor a degree diploma or other qualifications

Signed

Date June 17 2021

For Mama and PapaA sA s

ii

Acknowledgements

First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher

I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future

Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything

Rui SU

University of SydneyJune 17 2021

iii

List of Publications

JOURNALS

[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020

[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020

CONFERENCES

[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019

v

Contents

Abstract i

Declaration i

Acknowledgements ii

List of Publications iii

List of Figures ix

List of Tables xv

1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11

2 Literature Review 1321 Background 13

211 Image Classification 13212 Object Detection 15213 Action Recognition 18

22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26

3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29

31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31

32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38

33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44

34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47

Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50

343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58

35 Summary 60

4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62

411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63

42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71

43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81

44 Summary 82

5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84

511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84

52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92

53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98

54 Summary 102

6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104

Bibliography 107

ix

List of Figures

11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6

31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33

32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i

t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi

t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39

33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43

34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57

35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58

41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 are denoted as SRGB0 PRGB

0

and QFlow0 The initial proposals and features from anchor-

based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66

42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80

43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81

51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86

52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89

53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93

xv

List of Tables

31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48

32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49

33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50

34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51

35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52

36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52

37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52

38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54

39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55

310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55

311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55

312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55

41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74

42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75

43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75

44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76

45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77

46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78

51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99

52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100

53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101

1

Chapter 1

Introduction

With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades

In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows

2 Chapter 1 Introduction

Spatio-temporal Action Localization

Framework

Input Videos

Action Tubes

Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes

11 Spatial-temporal Action Localization

Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task

Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each

11 Spatial-temporal Action Localization 3

other

For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation

On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation

Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region

4 Chapter 1 Introduction

proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level

Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level

Our contributions are briefly summarized as follows

bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy

bull We also propose a temporal boundary refinement method to learn

12 Temporal Action Localization 5

class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results

bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios

12 Temporal Action Localization

Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging

Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework

For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other

6 Chapter 1 Introduction

Ground Truth

P-GCN

Ours

Frisbee Catch783s 821s

Frisbee Catch799s 822s

Frisbee Catch787s 820s

Ground Truth

P-GCN

Ours

Golf Swing

Golf Swing

Golf Swing

210s 318s

179s 311s

204s 315s

Ground Truth

P-GCN

Ours

Throw Discus

Hammer Throw

Throw Discus

639s 712s

642s 708s

637s 715s

Time

Time

Video Input Start End

Anchor-based ProposalScore 082

Score 039Score 047

Score 033

Frame Actioness startingending

Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods

Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries

For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and

12 Temporal Action Localization 7

segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method

To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based

8 Chapter 1 Introduction

segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold

(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels

(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner

(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization

13 Spatio-temporal Visual Grounding

Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers

Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of

13 Spatio-temporal Visual Grounding 9

bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene

Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks

Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance

Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the

10 Chapter 1 Introduction

STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]

Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods

Our contributions can be summarized as follows

(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors

(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover

14 Thesis Outline 11

we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods

(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task

14 Thesis Outline

The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows

Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters

Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]

Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

12 Chapter 1 Introduction

Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod

Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future

13

Chapter 2

Literature Review

In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks

21 Background

211 Image Classification

Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters

14 Chapter 2 Literature Review

(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem

Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition

Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional

21 Background 15

neural network is built by stacking such inception modules to achievehigh image classification performance

VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance

Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper

212 Object Detection

The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem

16 Chapter 2 Literature Review

Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time

To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced

Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined

21 Background 17

anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map

Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming

Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps

All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map

18 Chapter 2 Literature Review

213 Action Recognition

Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other

For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities

Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and

21 Background 19

randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization

2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient

Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos

20 Chapter 2 Literature Review

22 Spatio-temporal Action Localization

Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task

Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization

Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization

Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these

22 Spatio-temporal Action Localization 21

two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency

As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization

The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos

A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes

22 Chapter 2 Literature Review

23 Temporal Action Localization

Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error

Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation

Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce

23 Temporal Action Localization 23

the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame

Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods

In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates

Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors

Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the

24 Chapter 2 Literature Review

prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise

Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU

Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction

In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way

24 Spatio-temporal Visual Grounding 25

24 Spatio-temporal Visual Grounding

Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects

Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer

In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects

Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes

One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors

26 Chapter 2 Literature Review

to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance

25 Summary

In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding

In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms

Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows

25 Summary 27

the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results

In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities

In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos

29

Chapter 3

Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization

Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal

30Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios

31 Background

311 Spatial temporal localization methods

Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]

Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization

Third temporal localization is to form action tubes from per-frame

31 Background 31

detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc

Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]

312 Two-stream R-CNN

Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams

Please note that most appearance and motion fusion approaches in

32Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion

32 Action Detection Model in Spatial Domain

Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results

321 PCSC Overview

The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression

The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation

32 Action Detection Model in Spatial Domain 33

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

t

IFlow

t

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(a)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(b)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(c)

Figu

re3

1(a

)O

verv

iew

ofou

rC

ross

-str

eam

Coo

pera

tion

fram

ewor

k(f

our

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

regi

onpr

opos

als

and

feat

ures

from

the

flow

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

and

Stag

e3

whi

leth

ere

gion

prop

osal

san

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2an

dSt

age

4Ea

chst

age

com

pris

esof

two

coop

erat

ion

mod

ules

and

the

dete

ctio

nhe

ad(

b)(c

)The

deta

ilsof

our

feat

ure-

leve

lcoo

pera

tion

mod

ules

thro

ugh

mes

sage

pass

ing

from

one

stre

amto

anot

her

stre

am

The

flow

feat

ures

are

used

tohe

lpth

eR

GB

feat

ures

in(b

)w

hile

the

RG

Bfe

atur

esar

eus

edto

help

the

flow

feat

ures

in(c

)

34Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages

After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33

322 Cross-stream Region Proposal Cooperation

We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream

In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes

The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage

32 Action Detection Model in Spatial Domain 35

The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)

Mathematically let P (i)t and B(i)t denote the set of region propos-

als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)

t is up-dated as P (i)

t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)

t is updated as

B(j)t = G(P (j)

t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)

0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams

With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results

Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the

36Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data

323 Cross-stream Feature Cooperation

To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections

Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci

2 Ci3 Ci

4Ci

5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui

2 Ui3 Ui

4 Ui5

Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are

32 Action Detection Model in Spatial Domain 37

insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement

We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci

2 Ci3 Ci

4 Ci5 l isin 2 3 4 5 Let us use the improvement

of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB

l with the help of the features CFlowl as

followsCRGB

l = fθ(CFlowl )oplus CRGB

l (31)

where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow

l ) nonlinearly extracts the message fromthe feature CFlow

l and then uses the extracted message for improvingthe features CRGB

l

The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB

l and CFlowl To this end

we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB

l in the bottom-up pathway are refined the corre-sponding features maps URGB

l in the top-down pathway are generatedaccordingly

The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages

38Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB

t

and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1

and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-

prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31

(b) Specifically the improved RGB feature FRGBt is obtained by apply-

ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB

t of the RGB stream is also used to help improve theROI-feature FFlow

t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage

324 Training Details

For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process

33 Temporal Boundary Refinement for Action

Tubes

After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this

33 Temporal Boundary Refinement for Action Tubes 39

Initi

al

Segm

ents

(a) I

nitia

l Seg

men

t Gen

erat

ion

Initi

al S

egm

ent

Gen

erat

ion

RG

B o

r Flo

w F

eatu

res

from

Act

ion

Tube

s

=cup

2

0

1

Initi

al S

egm

ent

Gen

erat

ion

Initi

al

Segm

ents

Stag

e 2

Segm

ent F

eatu

reC

oope

ratio

nD

etec

tion

Hea

dSe

gmen

t Fea

ture

Coo

pera

tion

Det

ectio

nH

ead

=cup

1

0

0

1

2

1

2

Impr

oved

Seg

men

t Fe

atur

es

Det

ecte

d Se

gmen

ts

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Impr

oved

Seg

men

tFe

atur

es

Det

ecte

d Se

gmen

ts

(b) T

empo

ral P

CSC

RG

B F

eatu

res

from

A

ctio

n Tu

bes

0

Flow

Fea

ture

s fr

om

Act

ion

Tube

s

0

Actio

nnes

sD

etec

tor (

AD)

Actio

n Se

gmen

tD

etec

tor (

ASD

)Se

gmen

ts fr

om A

D

Uni

on o

f Seg

men

tsSe

gmen

ts fr

om A

SD

Initi

al S

egm

ents

Figu

re3

2O

verv

iew

oftw

om

odul

esin

our

tem

pora

lbo

unda

ryre

finem

ent

fram

ewor

k(a

)th

eIn

itia

lSe

gmen

tG

en-

erat

ion

Mod

ule

and

(b)

the

Tem

pora

lPr

ogre

ssiv

eC

ross

-str

eam

Coo

pera

tion

(T-P

CSC

)M

odul

eTh

eIn

itia

lSe

gmen

tG

ener

atio

nm

odul

eco

mbi

nes

the

outp

uts

from

Act

ionn

ess

Det

ecto

ran

dA

ctio

nSe

gmen

tDet

ecto

rto

gene

rate

the

init

ial

segm

ents

for

both

RG

Ban

dflo

wst

ream

sO

urT-

PCSC

fram

ewor

kta

kes

thes

ein

itia

lseg

men

tsas

the

segm

entp

ropo

sals

and

iter

ativ

ely

impr

oves

the

tem

pora

lact

ion

dete

ctio

nre

sult

sby

leve

ragi

ngco

mpl

emen

tary

info

rmat

ion

oftw

ost

ream

sat

both

feat

ure

and

segm

entp

ropo

sall

evel

s(t

wo

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

segm

entp

ro-

posa

lsan

dfe

atur

esfr

omth

eflo

wst

ream

help

impr

ove

acti

onse

gmen

tdet

ecti

onre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

whi

leth

ese

gmen

tpro

posa

lsan

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2Ea

chst

age

com

pris

esof

the

coop

erat

ion

mod

ule

and

the

dete

ctio

nhe

adT

hese

gmen

tpro

posa

llev

elco

oper

-at

ion

mod

ule

refin

esth

ese

gmen

tpro

posa

lsP

i tby

com

bini

ngth

em

ostr

ecen

tseg

men

tpro

posa

lsfr

omth

etw

ost

ream

sw

hile

the

segm

ent

feat

ure

leve

lcoo

pera

tion

mod

ule

impr

oves

the

feat

ures

Fi tw

here

the

supe

rscr

ipt

iisinR

GB

Flo

w

deno

tes

the

RG

BFl

owst

ream

and

the

subs

crip

ttde

note

sth

est

age

num

ber

The

dete

ctio

nhe

adis

used

toes

tim

ate

the

acti

onse

gmen

tloc

atio

nan

dth

eac

tion

ness

prob

abili

ty

40Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework

To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments

33 Temporal Boundary Refinement for Action Tubes 41

331 Actionness Detector (AD)

We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i

At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the

42Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

testing stage

332 Action Segment Detector (ASD)

Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below

(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture

33 Temporal Boundary Refinement for Action Tubes 43

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventuallyremoved by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively

but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details

(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three

44Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg

(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments

333 Temporal PCSC (T-PCSC)

Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework

33 Temporal Boundary Refinement for Action Tubes 45

from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments

Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results

46Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

34 Experimental results

We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343

341 Experimental Setup

Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset

Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005

To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes

34 Experimental results 47

To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ

Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch

342 Comparison with the State-of-the-art methods

We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted

48Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy

Results on the UCF-101-24 Dataset

The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32

Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05

Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792

Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for

34 Experimental results 49

Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector

IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -

Zolfaghari et al [105] RGB+Flow+Pose 476 - - -

Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219

Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172

Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237

PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289

per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs

Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary

50Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods

Results on the J-HMDB Dataset

For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset

Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05

Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748

We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU

34 Experimental results 51

Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds

IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -

Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -

Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442

Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463

Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476

PCSC (Ours) RGB+Flow 826 822 631 528

threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively

At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work

52Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

343 Ablation Study

In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method

Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset

Stage PCSC wo featurecooperation PCSC

0 755 7811 761 7862 764 7883 767 7914 767 792

Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector

Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631

Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset

Stage Video mAP ()0 6191 6282 631

Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to

34 Experimental results 53

verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement

Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work

54Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset

Stage LoR()1 6112 7193 9524 954

in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes

Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +

34 Experimental results 55

Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset

Stage MABO()1 8412 8433 8454 846

Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset

θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801

Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages

Stage 0 1 2 3 4

Detection speed (FPS) 31 29 28 26 24

Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages

Stage 0 1 2

Detection speed (VPS) 21 17 15

ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main

56Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

344 Analysis of Our PCSC at Different Stages

In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages

In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4

To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification

34 Experimental results 57

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)

345 Runtime Analysis

We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages

58Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detected instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)

346 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams

34 Experimental results 59

and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))

Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)

While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue

60Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

which will be studied in our future work

35 Summary

In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets

61

Chapter 4

PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization

There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework

62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

41 Background

411 Action Recognition

Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction

412 Spatio-temporal Action Localization

The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues

41 Background 63

413 Temporal Action Localization

Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level

64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues

42 Our Approach

In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively

421 Overall Framework

We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is

an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN

i=1 and their temporal durations(tis tie)N

i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi

The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows

42 Our Approach 65

Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature

Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB

i |pRGBi = (tRGB

is tRGBie )NRGB

i=1 and PFlow = pFlowi |pFlow

i =

(tFlowis tFlow

ie )NFlow

i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB

0 PFlow0

for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB

0 SFlow0 and the predicted action categories

Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB

0 and QFlow0 respectively

The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB

0 and QFlow0 respectively

after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)

Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation

66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Anchor Feautres (Output)

Frame Features (Output)

Cross-granularity

RGB-StreamAFC Module

(Stage 1)

Flow-StreamAFC Module

(Stage 2)

RGB1

RGB-StreamAFC Module

(Stage 3)

RGB1Flow

1

Flow-StreamAFC Module

(Stage4)

Features RGB0 Flow

0Flow

0

Segments

Proposals RGB0

Flow0

Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

RGB

RGB

Flow

Message Passing

Flowminus2 RGB

minus2Flowminus1

Flowminus1

RGBminus2

Flow

RGB

Anchor AnchorFrame

Anchor Features (Output)

1 x 1 Conv

1 x 1 Conv

Detected Segments (Output)

Proposal Combination= cup cupRGB

Flowminus1

RGBminus2

Flow

InputProposals

Input Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

1 x 1 Conv

1 x 1 Conv

Proposal Combination= cup cupFlow

RGBminus1

Flowminus2

RGB

Input Features

RGBminus1

Flowminus2

InputProposals

Flowminus2 RGB

minus1 RGBminus2

RGB

RGB

Flow

Flow

Flow

Cross-stream Cross-granularity

Anchor Anchor Frame

Message Passing

Features

Proposals Flow2

Features RGB3 Flow

2 RGB2

ProposalsRGB3

Flow2

Flow2

SegmentsFeatures Flow

2RGB2

SegmentsFeatures Flow

3

RGB3

RGB3

Flow4

Flow4RGB

4

SegmentsFeatures

Detected Segments (Output)

Frame Features (Output)

Message Passing

Outputs

Features RGB1 Flow

0RGB

0

ProposalsRGB1

Flow0

RGB1

RGB1 Flow

2 Flow1

Inputs

(a) Overview

(b) RGB-streamAFC module

(c) Flow-stremAFC module

Cross-stream Message Passing

Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 aredenoted as SRGB

0 PRGB0 and QFlow

0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments

42 Our Approach 67

methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages

422 Progressive Cross-Granularity Cooperation

The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)

To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow

nminus1 and SRGBnminus2 as the input two-stream proposals and

uses preceding two-stream anchor-based features PFlownminus1 and PRGB

nminus2 andflow-stream frame-based feature QFlow

nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB

n and anchor-based feature PRGB

n as well as update the flow-stream frame-based fea-tures as QFlow

n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages

68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes

Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances

423 Anchor-frame Cooperation Module

The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)

Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features

42 Our Approach 69

PRGB0 PFlow

0 and QRGB0 QFlow

0 as well as proposals SRGB0 SFlow

0 as theinput as shown in Sec 421 and Fig 41(a)

Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow

nminus1 to the frame-based features QFlownminus2 both from the flow-stream

The message passing operation is defined as

QFlown =Manchorrarr f rame(PFlow

nminus2 )oplusQFlownminus2 (41)

oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow

n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow

n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB

n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB

nminus2 which flips the superscripts from Flow to RGB QRGB

n also leads to betterframe-based RGB-stream proposals QRGB

n

Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow

n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow

n And then we combine QFlown together with the

two-stream anchor-based proposals SFlownminus1 and SRGB

nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB

n = SRGBnminus2 cup SFlow

nminus1 cup

70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

QFlown This set of proposals not only provides more comprehensive pro-

posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow

n = SFlownminus2 cup SRGB

nminus1 cupQRGBn

Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB

n =M f lowrarrrgb(PFlownminus1 )oplus PRGB

nminus2 forthe RGB stream and PFlow

n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow

nminus2 for flow stream

Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB

n PFlown and

the updated anchor-based features PRGBn PFlow

n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes

424 Training Details

At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the

42 Our Approach 71

boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules

425 Discussions

Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants

Similarity of Results Between Adjacent Stages Note that even though

72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results

Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources

43 Experiment

431 Datasets and Setup

Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples

- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available

- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an

43 Experiment 73

existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1

Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches

Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24

432 Comparison with the State-of-the-art Methods

We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers

THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05

74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds

tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190

REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178

TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369

MGG [46] - - 539 468 374BMN [35] - - 560 474 388

TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491

ours 712 689 651 595 512

our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN

ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and

43 Experiment 75

Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used

tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -

R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398

TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699

ours 4431 2985 547 2885

Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied

tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003

P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385

ours 5204 3592 797 3491

BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN

UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05

76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds

st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -

MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278

ours 849 635 237 292

433 Ablation study

We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework

Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow

n and QRGBn are

directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo

CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow

nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and

43 Experiment 77

Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset

Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822

wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667

wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679

wo PC 4214 4321 4514 4574 4577

PRGBnminus1 and PFlow

nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison

In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB

0 cupPFlow0 for ldquowo CG andQRGB

0 cupQFlow0 for ldquowo PC We have

the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved

78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset

Stage Comparisons LoR ()

1 PRGB1 vs SFlow

0 6712 PFlow

2 vs PRGB1 792

3 PRGB3 vs PFlow

2 8954 PFlow

4 vs PRGB3 971

when the number of stages increases and the results will become stableafter 2 or 3 stages

Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB

1 at Stage 1 andPFlow

2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow

2 among all proposals in PFlow2 A proposal

is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB

1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow

2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB

1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other

Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow

4 and those in PRGB3 are very similar even they are gathered for

different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal

43 Experiment 79

action localization method after Stage 3 which is consistent with the re-sults in Table 45

Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005

In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries

80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

(a) AR50 () for various methods at different number of stages

(b) AR100 () for various methods at different number of stages

(c) AR200 () for various methods at different number of stages

(d) AR500 () for various methods at different number of stages

Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset

43 Experiment 81

Ground Truth

Anchor-based Method

PCG-TAL at Stage 3

977s 1042s

954s 1046s

973s 1043s

Frame-based Method Missed

Video Clip

964s 1046s

97s 1045s

PCG-TAL at Stage 1

PCG-TAL at Stage 2

Time

1042s 1046s977s

954s

954s

Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment

434 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our

82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule

44 Summary

In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework

83

Chapter 5

STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding

Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our

84Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG

51 Background

511 Vision-language Modelling

Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work

The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations

512 Visual Grounding in ImagesVideos

Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et

52 Methodology 85

al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling

In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods

52 Methodology

521 Overview

We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip

k Kk=1 where Vclip

k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN

n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte

t=ts

containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B

86Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Pooled Video Feature

Input Textual Tokens

Spatial VisualG

rouding Netw

ork(SVG

-net)

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

HW T

Video Frames

Video Feature

Initial Object Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]

Textual Query

Word Em

beddingInput Textual Tokens

ST-ViLBERTVidoe C

lipSam

pling

Video FeatureC

lip Feature

Input Textual Tokens

Deconvs

Head

Head

Correlation

MLP

Spatial Pooling

Global Textual Feature

Text-guided Viusal Feature

Initial TubeG

eneration

Initial Object Tube

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

Input Textual Tokens

(a) Overview

of Our M

ethod

(b) Our SVG

-net(c) O

ur TBR

-net

Bounding Box Size

Bounding Box Center

Relevance Scores

Pooled Video Feature

Initial Object

TubePooled Video

Feature

SVG-net-core

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

H T

VideoC

lip Features

Initial Object

Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]Textual Q

ueryW

ord Embedding

Input Textual Tokens

ST-ViLBERTC

lip Feature

Input Textual Tokens

Deconv

Head

Head

Correlation

amp Sigmoid

MLP

Spatial Pooling

Global

Textual Feature

Text-guided Viusal Feature

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

(a) Overview

of Our M

ethod

(b) Our SVG

-net-core Module

(c) Our TB

R-net

Bounding Box Sizes

Bounding Box Centers

Relevance Scores

Pooled Video Feature

Initial Object

Tube

W

BB Centers

BB Sizes

Relevance Scores

SVG-net-core

Initial TubeG

eneration

Spatial Visual Grounding N

etwork (SVG

-net)

Figure51

(a)T

heoverview

ofour

spatio-temporalvideo

groundingfram

ework

STVG

BertO

urtw

o-stepfram

ework

consistsof

two

sub-networksSpatial

Visual

Grounding

network

(SVG

-net)and

Temporal

BoundaryR

efinement

net-w

ork(T

BR-net)w

hichtakes

videoand

textualquerypairs

asthe

inputtogenerate

spatio-temporalobjecttubes

contain-ing

theobject

ofinterest

In(b)the

SVG

-netgenerates

theinitialobject

tubesIn

(c)theTBR

-netprogressively

refinesthe

temporalboundaries

oftheinitialobjecttubes

andgenerates

thefinalspatio-tem

poralobjecttubes

52 Methodology 87

In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections

522 Spatial Visual Grounding

Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames

Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip

k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature

For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two

88Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2

n=1 where en is the i-th textual input token

Multi-modal Feature Learning Given the visual input feature Fclipk

and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details

The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors

Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in

52 Methodology 89

Multi-head AttentionQuery Key Value

Add amp Norm

Multi-head AttentionQueryKeyValue

Add amp Norm

Feed Forward

Add amp Norm

Feed Forward

Add amp Norm

Visual Input (Output from the Last Layer)

Visual Output for the Current Layer

Textual Input (Output from the Last Layer)

Textual Output for the Current Layer

Visual Branch Textual Branch

Spatial Pooling Temporal Pooling

Combination

Multi-head Attention

Query Key ValueTextual Input

Combined Visual Feature

Decomposition

Replication Replication

Intermediate Visual Feature

Visual Input

Norm

(a) One Co-attention Layer in ViLBERT

(b) Our Spatio-temporal Combination and Decomposition Module

Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))

90Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively

Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv

to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box

Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending

52 Methodology 91

frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0

s t0e ) The

initial temporal boundaries (t0s t0

e ) and the predicted bounding boxes bt

form the initial tube prediction result Binit = btt0e

t=t0s

Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel

92Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

axy = exp(minus (xminusx)2+(yminusy)2

2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows

LSVG =λ1Lc(A A) + λ2Lrel(oobj o)

+ λ3Lsize(wxy hxy w h)(51)

where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details

523 Temporal Boundary Refinement

The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube

As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries

52 Methodology 93

Text

ual I

nput

To

kens

Star

ting

amp En

ding

Pr

obab

ilitie

s

Tem

pora

l Ext

ensi

onU

nifo

rm S

ampl

ing

ViLB

ERT

Boun

dary

Reg

ress

ion

Loca

l Fea

ture

Ext

ract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Exte

nded

Stra

ting

amp En

ding

Pos

ition

sAn

chor

Fea

ture

Refi

ned

Star

ting

amp En

ding

Pos

ition

s (

From

Anc

hor-b

ased

Bra

nch)

and

Star

ting

amp En

ding

Reg

ion

Feat

ures

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Initi

al S

tarti

ng amp

End

ing

Posi

tions

Refi

nde

Star

ting

amp En

ding

Pos

ition

s (F

rom

Fra

me-

base

d Br

anch

)

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Anc

hor-

base

d B

ranc

h

Fram

e-ba

sed

Bra

nch

Refi

ned

Obj

ect

Tube

Initi

al O

bjec

t Tu

be

Vide

o Po

oled

Fe

atur

e

Text

ual I

nput

To

kens

Stag

e 1

Stag

e 2

Stag

e 3

Star

ting

Endi

ng

Prob

abilit

yAn

chor

Fea

ture

Extra

ctio

nVi

LBER

TBo

unda

ryR

egre

ssio

nLo

cal F

eatu

reEx

tract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Refi

ned

Posi

tions

(Fro

m A

B)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Initi

al P

ositi

ons

Refi

ned

Posi

tions

(F

rom

FB)

Pool

ed V

ideo

Fe

atur

eTe

xtua

l Inp

ut

Toke

nsPo

oled

Vid

eo

Feat

ure

Anc

hor-

base

d B

ranc

h (A

B)

Fram

e-ba

sed

Bra

nch

(FB

)

Refi

ned

Obj

ect T

ube

from

Initi

al O

bjec

t Tu

be

Pool

ed V

ideo

Fea

ture

Text

ual I

nput

Tok

ens

Stag

e 1

Stag

e 2

Figu

re5

3O

verv

iew

ofou

rTe

mpo

ralB

ound

ary

Refi

nem

entn

etw

ork

(TBR

-net

)w

hich

isa

mul

ti-s

tage

ViL

BER

T-ba

sed

fram

ewor

kw

ith

anan

chor

-bas

edbr

anch

and

afr

ame-

base

dbr

anch

(tw

ost

ages

are

used

for

bett

erill

ustr

atio

n)

The

inpu

tsta

rtin

gan

den

ding

fram

epo

siti

ons

atea

chst

age

are

thos

eof

the

outp

utob

ject

tube

from

the

prev

ious

stag

eA

tst

age

1th

ein

puto

bjec

ttub

eis

gene

rate

dby

our

SVG

-net

94Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage

Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets

Given an anchor with the temporal boundaries (t0s t0

e ) and the lengthl = t0

e minus t0s we temporally extend the anchor before and after the start-

ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0

s minus l4 t0e + l4] and generate the anchor feature F by stacking the

corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension

The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0

s and t0e and generate the new starting

and ending frame positions t1s and t1

e

Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1

s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1

s minusD2+ 1t1s + D2] where D is the length

53 Experiment 95

of the starting duration We then construct the frame-level feature Fs

by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1

s Accordingly we can produce the ending frameposition t1

e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1

s and t1e are generated they can be used as the

input for the anchor-based branch at the next stage

Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow

LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)

where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1

53 Experiment

531 Experiment Setup

Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset

96Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes

-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes

Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU

Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1

|SU | sumtisinSIrt

where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames

53 Experiment 97

between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples

Baseline Methods

-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals

-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes

-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT

-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net

98Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

532 Comparison with the State-of-the-art Methods

We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries

533 Ablation Study

In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method

Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step

It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating

53 Experiment 99

Tabl

e5

1R

esul

tsof

diff

eren

tmet

hods

onth

eV

idST

Gda

tase

tldquo

ind

icat

esth

ere

sult

sar

equ

oted

from

its

orig

inal

wor

k

Met

hod

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)m

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)ST

GR

N

[102

]19

75

257

714

60

183

221

10

128

3V

anill

a(o

urs)

214

927

69

157

720

22

231

713

82

SVG

-net

(our

s)22

87

283

116

31

215

724

09

143

3SV

G-n

et+

TBR

-net

(our

s)25

21

309

918

21

236

726

26

160

1

100Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work

Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875

SVG-net (ours) 1945 2778 1012SVG-net +

TBR-net (ours)2241 3031 1207

the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT

Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively

Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)

53 Experiment 101

Tabl

e5

3R

esul

tsof

our

met

hod

and

its

vari

ants

whe

nus

ing

diff

eren

tnum

ber

ofst

ages

onth

eV

idST

Gda

tase

t

Met

hods

Sta

ges

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

UvI

oU

03

vIoU

0

5m

_vIo

UvI

oU

03

vIoU

0

5SV

G-n

et-

228

728

31

163

121

57

240

914

33

SVG

-net

+IF

B-ne

t1

234

128

84

166

922

12

244

814

57

223

82

289

716

85

225

524

69

147

13

237

728

91

168

722

61

247

214

79

SVG

-net

+IA

B-ne

t1

232

729

57

163

921

78

249

314

31

223

74

299

816

52

222

125

42

145

93

238

530

15

165

722

43

257

114

47

SVG

-net

+T

BR-n

et1

243

229

92

172

222

69

253

715

29

224

95

307

517

02

233

426

03

158

33

252

130

99

182

123

67

262

616

01

102Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods

54 Summary

In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding

103

Chapter 6

Conclusions and Future Work

As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future

61 Conclusions

The contributions of this thesis are summarized as follows

bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level

104 Chapter 6 Conclusions and Future Work

The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets

bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework

bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding

62 Future Work

The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into

62 Future Work 105

one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously

107

Bibliography

[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812

[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100

[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970

[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308

[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139

[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206

[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894

108 Bibliography

[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008

[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802

[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990

[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255

[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941

[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83

[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275

[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636

[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448

[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587

Bibliography 109

[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015

[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056

[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778

[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017

[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708

[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6

[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199

[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014

[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413

[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105

110 Bibliography

[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750

[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165

[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)

[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318

[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344

[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019

[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889

[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898

[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996

[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In

Bibliography 111

Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19

[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017

[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988

[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755

[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682

[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864

[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855

[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37

[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959

[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613

112 Bibliography

[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23

[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043

[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950

[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028

[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946

[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205

[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334

[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759

[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439

Bibliography 113

[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541

[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788

[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99

[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427

[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016

[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565

[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743

[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058

114 Bibliography

[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576

[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)

[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970

[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646

[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)

[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025

[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020

[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473

[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018

[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772

Bibliography 115

[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9

[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019

[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)

[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497

[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008

[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717

[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334

[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36

[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the

116 Bibliography

IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968

[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172

[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321

[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792

[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462

[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154

[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653

[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019

[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947

[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)

Bibliography 117

[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693

[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)

[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524

[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687

[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315

[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102

[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692

[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103

[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809

118 Bibliography

[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075

[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677

[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923

[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019

[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913

  • Abstract
  • Declaration
  • Acknowledgements
  • List of Publications
  • List of Figures
  • List of Tables
  • Introduction
    • Spatial-temporal Action Localization
    • Temporal Action Localization
    • Spatio-temporal Visual Grounding
    • Thesis Outline
      • Literature Review
        • Background
          • Image Classification
          • Object Detection
          • Action Recognition
            • Spatio-temporal Action Localization
            • Temporal Action Localization
            • Spatio-temporal Visual Grounding
            • Summary
              • Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
                • Background
                  • Spatial temporal localization methods
                  • Two-stream R-CNN
                    • Action Detection Model in Spatial Domain
                      • PCSC Overview
                      • Cross-stream Region Proposal Cooperation
                      • Cross-stream Feature Cooperation
                      • Training Details
                        • Temporal Boundary Refinement for Action Tubes
                          • Actionness Detector (AD)
                          • Action Segment Detector (ASD)
                          • Temporal PCSC (T-PCSC)
                            • Experimental results
                              • Experimental Setup
                              • Comparison with the State-of-the-art methods
                                • Results on the UCF-101-24 Dataset
                                • Results on the J-HMDB Dataset
                                  • Ablation Study
                                  • Analysis of Our PCSC at Different Stages
                                  • Runtime Analysis
                                  • Qualitative Results
                                    • Summary
                                      • PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
                                        • Background
                                          • Action Recognition
                                          • Spatio-temporal Action Localization
                                          • Temporal Action Localization
                                            • Our Approach
                                              • Overall Framework
                                              • Progressive Cross-Granularity Cooperation
                                              • Anchor-frame Cooperation Module
                                              • Training Details
                                              • Discussions
                                                • Experiment
                                                  • Datasets and Setup
                                                  • Comparison with the State-of-the-art Methods
                                                  • Ablation study
                                                  • Qualitative Results
                                                    • Summary
                                                      • STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
                                                        • Background
                                                          • Vision-language Modelling
                                                          • Visual Grounding in ImagesVideos
                                                            • Methodology
                                                              • Overview
                                                              • Spatial Visual Grounding
                                                              • Temporal Boundary Refinement
                                                                • Experiment
                                                                  • Experiment Setup
                                                                  • Comparison with the State-of-the-art Methods
                                                                  • Ablation Study
                                                                    • Summary
                                                                      • Conclusions and Future Work
                                                                        • Conclusions
                                                                        • Future Work
                                                                          • Bibliography
Page 6: Deep Learning for Spatial and Temporal Video Localization

i

Declaration

I Rui SU declare that this thesis titled ldquoDeep Learning for Spatialand Temporal Video Localizationrdquo which is submitted in fulfillment ofthe requirements for the Degree of Doctor of Philosophy represents myown work except where due acknowledgement have been made I fur-ther declared that it has not been previously included in a thesis disser-tation or report submitted to this University or to any other institutionfor a degree diploma or other qualifications

Signed

Date June 17 2021

For Mama and PapaA sA s

ii

Acknowledgements

First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher

I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future

Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything

Rui SU

University of SydneyJune 17 2021

iii

List of Publications

JOURNALS

[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020

[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020

CONFERENCES

[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019

v

Contents

Abstract i

Declaration i

Acknowledgements ii

List of Publications iii

List of Figures ix

List of Tables xv

1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11

2 Literature Review 1321 Background 13

211 Image Classification 13212 Object Detection 15213 Action Recognition 18

22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26

3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29

31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31

32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38

33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44

34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47

Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50

343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58

35 Summary 60

4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62

411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63

42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71

43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81

44 Summary 82

5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84

511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84

52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92

53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98

54 Summary 102

6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104

Bibliography 107

ix

List of Figures

11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6

31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33

32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i

t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi

t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39

33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43

34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57

35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58

41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 are denoted as SRGB0 PRGB

0

and QFlow0 The initial proposals and features from anchor-

based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66

42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80

43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81

51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86

52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89

53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93

xv

List of Tables

31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48

32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49

33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50

34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51

35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52

36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52

37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52

38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54

39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55

310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55

311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55

312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55

41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74

42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75

43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75

44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76

45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77

46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78

51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99

52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100

53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101

1

Chapter 1

Introduction

With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades

In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows

2 Chapter 1 Introduction

Spatio-temporal Action Localization

Framework

Input Videos

Action Tubes

Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes

11 Spatial-temporal Action Localization

Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task

Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each

11 Spatial-temporal Action Localization 3

other

For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation

On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation

Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region

4 Chapter 1 Introduction

proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level

Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level

Our contributions are briefly summarized as follows

bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy

bull We also propose a temporal boundary refinement method to learn

12 Temporal Action Localization 5

class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results

bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios

12 Temporal Action Localization

Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging

Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework

For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other

6 Chapter 1 Introduction

Ground Truth

P-GCN

Ours

Frisbee Catch783s 821s

Frisbee Catch799s 822s

Frisbee Catch787s 820s

Ground Truth

P-GCN

Ours

Golf Swing

Golf Swing

Golf Swing

210s 318s

179s 311s

204s 315s

Ground Truth

P-GCN

Ours

Throw Discus

Hammer Throw

Throw Discus

639s 712s

642s 708s

637s 715s

Time

Time

Video Input Start End

Anchor-based ProposalScore 082

Score 039Score 047

Score 033

Frame Actioness startingending

Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods

Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries

For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and

12 Temporal Action Localization 7

segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method

To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based

8 Chapter 1 Introduction

segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold

(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels

(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner

(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization

13 Spatio-temporal Visual Grounding

Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers

Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of

13 Spatio-temporal Visual Grounding 9

bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene

Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks

Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance

Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the

10 Chapter 1 Introduction

STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]

Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods

Our contributions can be summarized as follows

(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors

(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover

14 Thesis Outline 11

we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods

(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task

14 Thesis Outline

The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows

Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters

Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]

Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

12 Chapter 1 Introduction

Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod

Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future

13

Chapter 2

Literature Review

In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks

21 Background

211 Image Classification

Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters

14 Chapter 2 Literature Review

(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem

Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition

Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional

21 Background 15

neural network is built by stacking such inception modules to achievehigh image classification performance

VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance

Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper

212 Object Detection

The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem

16 Chapter 2 Literature Review

Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time

To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced

Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined

21 Background 17

anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map

Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming

Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps

All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map

18 Chapter 2 Literature Review

213 Action Recognition

Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other

For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities

Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and

21 Background 19

randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization

2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient

Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos

20 Chapter 2 Literature Review

22 Spatio-temporal Action Localization

Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task

Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization

Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization

Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these

22 Spatio-temporal Action Localization 21

two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency

As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization

The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos

A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes

22 Chapter 2 Literature Review

23 Temporal Action Localization

Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error

Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation

Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce

23 Temporal Action Localization 23

the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame

Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods

In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates

Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors

Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the

24 Chapter 2 Literature Review

prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise

Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU

Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction

In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way

24 Spatio-temporal Visual Grounding 25

24 Spatio-temporal Visual Grounding

Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects

Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer

In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects

Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes

One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors

26 Chapter 2 Literature Review

to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance

25 Summary

In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding

In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms

Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows

25 Summary 27

the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results

In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities

In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos

29

Chapter 3

Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization

Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal

30Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios

31 Background

311 Spatial temporal localization methods

Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]

Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization

Third temporal localization is to form action tubes from per-frame

31 Background 31

detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc

Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]

312 Two-stream R-CNN

Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams

Please note that most appearance and motion fusion approaches in

32Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion

32 Action Detection Model in Spatial Domain

Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results

321 PCSC Overview

The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression

The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation

32 Action Detection Model in Spatial Domain 33

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

t

IFlow

t

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(a)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(b)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(c)

Figu

re3

1(a

)O

verv

iew

ofou

rC

ross

-str

eam

Coo

pera

tion

fram

ewor

k(f

our

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

regi

onpr

opos

als

and

feat

ures

from

the

flow

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

and

Stag

e3

whi

leth

ere

gion

prop

osal

san

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2an

dSt

age

4Ea

chst

age

com

pris

esof

two

coop

erat

ion

mod

ules

and

the

dete

ctio

nhe

ad(

b)(c

)The

deta

ilsof

our

feat

ure-

leve

lcoo

pera

tion

mod

ules

thro

ugh

mes

sage

pass

ing

from

one

stre

amto

anot

her

stre

am

The

flow

feat

ures

are

used

tohe

lpth

eR

GB

feat

ures

in(b

)w

hile

the

RG

Bfe

atur

esar

eus

edto

help

the

flow

feat

ures

in(c

)

34Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages

After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33

322 Cross-stream Region Proposal Cooperation

We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream

In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes

The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage

32 Action Detection Model in Spatial Domain 35

The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)

Mathematically let P (i)t and B(i)t denote the set of region propos-

als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)

t is up-dated as P (i)

t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)

t is updated as

B(j)t = G(P (j)

t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)

0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams

With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results

Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the

36Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data

323 Cross-stream Feature Cooperation

To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections

Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci

2 Ci3 Ci

4Ci

5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui

2 Ui3 Ui

4 Ui5

Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are

32 Action Detection Model in Spatial Domain 37

insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement

We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci

2 Ci3 Ci

4 Ci5 l isin 2 3 4 5 Let us use the improvement

of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB

l with the help of the features CFlowl as

followsCRGB

l = fθ(CFlowl )oplus CRGB

l (31)

where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow

l ) nonlinearly extracts the message fromthe feature CFlow

l and then uses the extracted message for improvingthe features CRGB

l

The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB

l and CFlowl To this end

we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB

l in the bottom-up pathway are refined the corre-sponding features maps URGB

l in the top-down pathway are generatedaccordingly

The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages

38Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB

t

and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1

and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-

prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31

(b) Specifically the improved RGB feature FRGBt is obtained by apply-

ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB

t of the RGB stream is also used to help improve theROI-feature FFlow

t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage

324 Training Details

For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process

33 Temporal Boundary Refinement for Action

Tubes

After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this

33 Temporal Boundary Refinement for Action Tubes 39

Initi

al

Segm

ents

(a) I

nitia

l Seg

men

t Gen

erat

ion

Initi

al S

egm

ent

Gen

erat

ion

RG

B o

r Flo

w F

eatu

res

from

Act

ion

Tube

s

=cup

2

0

1

Initi

al S

egm

ent

Gen

erat

ion

Initi

al

Segm

ents

Stag

e 2

Segm

ent F

eatu

reC

oope

ratio

nD

etec

tion

Hea

dSe

gmen

t Fea

ture

Coo

pera

tion

Det

ectio

nH

ead

=cup

1

0

0

1

2

1

2

Impr

oved

Seg

men

t Fe

atur

es

Det

ecte

d Se

gmen

ts

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Impr

oved

Seg

men

tFe

atur

es

Det

ecte

d Se

gmen

ts

(b) T

empo

ral P

CSC

RG

B F

eatu

res

from

A

ctio

n Tu

bes

0

Flow

Fea

ture

s fr

om

Act

ion

Tube

s

0

Actio

nnes

sD

etec

tor (

AD)

Actio

n Se

gmen

tD

etec

tor (

ASD

)Se

gmen

ts fr

om A

D

Uni

on o

f Seg

men

tsSe

gmen

ts fr

om A

SD

Initi

al S

egm

ents

Figu

re3

2O

verv

iew

oftw

om

odul

esin

our

tem

pora

lbo

unda

ryre

finem

ent

fram

ewor

k(a

)th

eIn

itia

lSe

gmen

tG

en-

erat

ion

Mod

ule

and

(b)

the

Tem

pora

lPr

ogre

ssiv

eC

ross

-str

eam

Coo

pera

tion

(T-P

CSC

)M

odul

eTh

eIn

itia

lSe

gmen

tG

ener

atio

nm

odul

eco

mbi

nes

the

outp

uts

from

Act

ionn

ess

Det

ecto

ran

dA

ctio

nSe

gmen

tDet

ecto

rto

gene

rate

the

init

ial

segm

ents

for

both

RG

Ban

dflo

wst

ream

sO

urT-

PCSC

fram

ewor

kta

kes

thes

ein

itia

lseg

men

tsas

the

segm

entp

ropo

sals

and

iter

ativ

ely

impr

oves

the

tem

pora

lact

ion

dete

ctio

nre

sult

sby

leve

ragi

ngco

mpl

emen

tary

info

rmat

ion

oftw

ost

ream

sat

both

feat

ure

and

segm

entp

ropo

sall

evel

s(t

wo

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

segm

entp

ro-

posa

lsan

dfe

atur

esfr

omth

eflo

wst

ream

help

impr

ove

acti

onse

gmen

tdet

ecti

onre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

whi

leth

ese

gmen

tpro

posa

lsan

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2Ea

chst

age

com

pris

esof

the

coop

erat

ion

mod

ule

and

the

dete

ctio

nhe

adT

hese

gmen

tpro

posa

llev

elco

oper

-at

ion

mod

ule

refin

esth

ese

gmen

tpro

posa

lsP

i tby

com

bini

ngth

em

ostr

ecen

tseg

men

tpro

posa

lsfr

omth

etw

ost

ream

sw

hile

the

segm

ent

feat

ure

leve

lcoo

pera

tion

mod

ule

impr

oves

the

feat

ures

Fi tw

here

the

supe

rscr

ipt

iisinR

GB

Flo

w

deno

tes

the

RG

BFl

owst

ream

and

the

subs

crip

ttde

note

sth

est

age

num

ber

The

dete

ctio

nhe

adis

used

toes

tim

ate

the

acti

onse

gmen

tloc

atio

nan

dth

eac

tion

ness

prob

abili

ty

40Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework

To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments

33 Temporal Boundary Refinement for Action Tubes 41

331 Actionness Detector (AD)

We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i

At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the

42Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

testing stage

332 Action Segment Detector (ASD)

Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below

(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture

33 Temporal Boundary Refinement for Action Tubes 43

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventuallyremoved by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively

but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details

(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three

44Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg

(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments

333 Temporal PCSC (T-PCSC)

Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework

33 Temporal Boundary Refinement for Action Tubes 45

from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments

Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results

46Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

34 Experimental results

We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343

341 Experimental Setup

Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset

Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005

To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes

34 Experimental results 47

To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ

Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch

342 Comparison with the State-of-the-art methods

We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted

48Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy

Results on the UCF-101-24 Dataset

The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32

Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05

Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792

Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for

34 Experimental results 49

Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector

IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -

Zolfaghari et al [105] RGB+Flow+Pose 476 - - -

Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219

Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172

Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237

PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289

per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs

Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary

50Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods

Results on the J-HMDB Dataset

For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset

Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05

Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748

We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU

34 Experimental results 51

Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds

IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -

Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -

Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442

Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463

Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476

PCSC (Ours) RGB+Flow 826 822 631 528

threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively

At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work

52Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

343 Ablation Study

In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method

Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset

Stage PCSC wo featurecooperation PCSC

0 755 7811 761 7862 764 7883 767 7914 767 792

Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector

Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631

Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset

Stage Video mAP ()0 6191 6282 631

Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to

34 Experimental results 53

verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement

Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work

54Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset

Stage LoR()1 6112 7193 9524 954

in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes

Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +

34 Experimental results 55

Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset

Stage MABO()1 8412 8433 8454 846

Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset

θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801

Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages

Stage 0 1 2 3 4

Detection speed (FPS) 31 29 28 26 24

Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages

Stage 0 1 2

Detection speed (VPS) 21 17 15

ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main

56Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

344 Analysis of Our PCSC at Different Stages

In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages

In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4

To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification

34 Experimental results 57

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)

345 Runtime Analysis

We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages

58Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detected instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)

346 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams

34 Experimental results 59

and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))

Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)

While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue

60Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

which will be studied in our future work

35 Summary

In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets

61

Chapter 4

PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization

There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework

62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

41 Background

411 Action Recognition

Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction

412 Spatio-temporal Action Localization

The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues

41 Background 63

413 Temporal Action Localization

Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level

64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues

42 Our Approach

In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively

421 Overall Framework

We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is

an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN

i=1 and their temporal durations(tis tie)N

i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi

The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows

42 Our Approach 65

Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature

Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB

i |pRGBi = (tRGB

is tRGBie )NRGB

i=1 and PFlow = pFlowi |pFlow

i =

(tFlowis tFlow

ie )NFlow

i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB

0 PFlow0

for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB

0 SFlow0 and the predicted action categories

Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB

0 and QFlow0 respectively

The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB

0 and QFlow0 respectively

after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)

Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation

66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Anchor Feautres (Output)

Frame Features (Output)

Cross-granularity

RGB-StreamAFC Module

(Stage 1)

Flow-StreamAFC Module

(Stage 2)

RGB1

RGB-StreamAFC Module

(Stage 3)

RGB1Flow

1

Flow-StreamAFC Module

(Stage4)

Features RGB0 Flow

0Flow

0

Segments

Proposals RGB0

Flow0

Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

RGB

RGB

Flow

Message Passing

Flowminus2 RGB

minus2Flowminus1

Flowminus1

RGBminus2

Flow

RGB

Anchor AnchorFrame

Anchor Features (Output)

1 x 1 Conv

1 x 1 Conv

Detected Segments (Output)

Proposal Combination= cup cupRGB

Flowminus1

RGBminus2

Flow

InputProposals

Input Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

1 x 1 Conv

1 x 1 Conv

Proposal Combination= cup cupFlow

RGBminus1

Flowminus2

RGB

Input Features

RGBminus1

Flowminus2

InputProposals

Flowminus2 RGB

minus1 RGBminus2

RGB

RGB

Flow

Flow

Flow

Cross-stream Cross-granularity

Anchor Anchor Frame

Message Passing

Features

Proposals Flow2

Features RGB3 Flow

2 RGB2

ProposalsRGB3

Flow2

Flow2

SegmentsFeatures Flow

2RGB2

SegmentsFeatures Flow

3

RGB3

RGB3

Flow4

Flow4RGB

4

SegmentsFeatures

Detected Segments (Output)

Frame Features (Output)

Message Passing

Outputs

Features RGB1 Flow

0RGB

0

ProposalsRGB1

Flow0

RGB1

RGB1 Flow

2 Flow1

Inputs

(a) Overview

(b) RGB-streamAFC module

(c) Flow-stremAFC module

Cross-stream Message Passing

Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 aredenoted as SRGB

0 PRGB0 and QFlow

0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments

42 Our Approach 67

methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages

422 Progressive Cross-Granularity Cooperation

The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)

To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow

nminus1 and SRGBnminus2 as the input two-stream proposals and

uses preceding two-stream anchor-based features PFlownminus1 and PRGB

nminus2 andflow-stream frame-based feature QFlow

nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB

n and anchor-based feature PRGB

n as well as update the flow-stream frame-based fea-tures as QFlow

n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages

68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes

Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances

423 Anchor-frame Cooperation Module

The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)

Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features

42 Our Approach 69

PRGB0 PFlow

0 and QRGB0 QFlow

0 as well as proposals SRGB0 SFlow

0 as theinput as shown in Sec 421 and Fig 41(a)

Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow

nminus1 to the frame-based features QFlownminus2 both from the flow-stream

The message passing operation is defined as

QFlown =Manchorrarr f rame(PFlow

nminus2 )oplusQFlownminus2 (41)

oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow

n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow

n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB

n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB

nminus2 which flips the superscripts from Flow to RGB QRGB

n also leads to betterframe-based RGB-stream proposals QRGB

n

Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow

n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow

n And then we combine QFlown together with the

two-stream anchor-based proposals SFlownminus1 and SRGB

nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB

n = SRGBnminus2 cup SFlow

nminus1 cup

70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

QFlown This set of proposals not only provides more comprehensive pro-

posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow

n = SFlownminus2 cup SRGB

nminus1 cupQRGBn

Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB

n =M f lowrarrrgb(PFlownminus1 )oplus PRGB

nminus2 forthe RGB stream and PFlow

n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow

nminus2 for flow stream

Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB

n PFlown and

the updated anchor-based features PRGBn PFlow

n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes

424 Training Details

At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the

42 Our Approach 71

boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules

425 Discussions

Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants

Similarity of Results Between Adjacent Stages Note that even though

72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results

Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources

43 Experiment

431 Datasets and Setup

Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples

- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available

- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an

43 Experiment 73

existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1

Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches

Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24

432 Comparison with the State-of-the-art Methods

We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers

THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05

74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds

tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190

REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178

TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369

MGG [46] - - 539 468 374BMN [35] - - 560 474 388

TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491

ours 712 689 651 595 512

our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN

ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and

43 Experiment 75

Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used

tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -

R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398

TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699

ours 4431 2985 547 2885

Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied

tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003

P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385

ours 5204 3592 797 3491

BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN

UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05

76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds

st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -

MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278

ours 849 635 237 292

433 Ablation study

We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework

Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow

n and QRGBn are

directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo

CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow

nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and

43 Experiment 77

Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset

Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822

wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667

wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679

wo PC 4214 4321 4514 4574 4577

PRGBnminus1 and PFlow

nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison

In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB

0 cupPFlow0 for ldquowo CG andQRGB

0 cupQFlow0 for ldquowo PC We have

the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved

78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset

Stage Comparisons LoR ()

1 PRGB1 vs SFlow

0 6712 PFlow

2 vs PRGB1 792

3 PRGB3 vs PFlow

2 8954 PFlow

4 vs PRGB3 971

when the number of stages increases and the results will become stableafter 2 or 3 stages

Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB

1 at Stage 1 andPFlow

2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow

2 among all proposals in PFlow2 A proposal

is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB

1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow

2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB

1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other

Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow

4 and those in PRGB3 are very similar even they are gathered for

different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal

43 Experiment 79

action localization method after Stage 3 which is consistent with the re-sults in Table 45

Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005

In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries

80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

(a) AR50 () for various methods at different number of stages

(b) AR100 () for various methods at different number of stages

(c) AR200 () for various methods at different number of stages

(d) AR500 () for various methods at different number of stages

Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset

43 Experiment 81

Ground Truth

Anchor-based Method

PCG-TAL at Stage 3

977s 1042s

954s 1046s

973s 1043s

Frame-based Method Missed

Video Clip

964s 1046s

97s 1045s

PCG-TAL at Stage 1

PCG-TAL at Stage 2

Time

1042s 1046s977s

954s

954s

Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment

434 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our

82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule

44 Summary

In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework

83

Chapter 5

STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding

Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our

84Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG

51 Background

511 Vision-language Modelling

Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work

The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations

512 Visual Grounding in ImagesVideos

Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et

52 Methodology 85

al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling

In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods

52 Methodology

521 Overview

We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip

k Kk=1 where Vclip

k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN

n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte

t=ts

containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B

86Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Pooled Video Feature

Input Textual Tokens

Spatial VisualG

rouding Netw

ork(SVG

-net)

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

HW T

Video Frames

Video Feature

Initial Object Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]

Textual Query

Word Em

beddingInput Textual Tokens

ST-ViLBERTVidoe C

lipSam

pling

Video FeatureC

lip Feature

Input Textual Tokens

Deconvs

Head

Head

Correlation

MLP

Spatial Pooling

Global Textual Feature

Text-guided Viusal Feature

Initial TubeG

eneration

Initial Object Tube

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

Input Textual Tokens

(a) Overview

of Our M

ethod

(b) Our SVG

-net(c) O

ur TBR

-net

Bounding Box Size

Bounding Box Center

Relevance Scores

Pooled Video Feature

Initial Object

TubePooled Video

Feature

SVG-net-core

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

H T

VideoC

lip Features

Initial Object

Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]Textual Q

ueryW

ord Embedding

Input Textual Tokens

ST-ViLBERTC

lip Feature

Input Textual Tokens

Deconv

Head

Head

Correlation

amp Sigmoid

MLP

Spatial Pooling

Global

Textual Feature

Text-guided Viusal Feature

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

(a) Overview

of Our M

ethod

(b) Our SVG

-net-core Module

(c) Our TB

R-net

Bounding Box Sizes

Bounding Box Centers

Relevance Scores

Pooled Video Feature

Initial Object

Tube

W

BB Centers

BB Sizes

Relevance Scores

SVG-net-core

Initial TubeG

eneration

Spatial Visual Grounding N

etwork (SVG

-net)

Figure51

(a)T

heoverview

ofour

spatio-temporalvideo

groundingfram

ework

STVG

BertO

urtw

o-stepfram

ework

consistsof

two

sub-networksSpatial

Visual

Grounding

network

(SVG

-net)and

Temporal

BoundaryR

efinement

net-w

ork(T

BR-net)w

hichtakes

videoand

textualquerypairs

asthe

inputtogenerate

spatio-temporalobjecttubes

contain-ing

theobject

ofinterest

In(b)the

SVG

-netgenerates

theinitialobject

tubesIn

(c)theTBR

-netprogressively

refinesthe

temporalboundaries

oftheinitialobjecttubes

andgenerates

thefinalspatio-tem

poralobjecttubes

52 Methodology 87

In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections

522 Spatial Visual Grounding

Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames

Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip

k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature

For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two

88Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2

n=1 where en is the i-th textual input token

Multi-modal Feature Learning Given the visual input feature Fclipk

and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details

The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors

Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in

52 Methodology 89

Multi-head AttentionQuery Key Value

Add amp Norm

Multi-head AttentionQueryKeyValue

Add amp Norm

Feed Forward

Add amp Norm

Feed Forward

Add amp Norm

Visual Input (Output from the Last Layer)

Visual Output for the Current Layer

Textual Input (Output from the Last Layer)

Textual Output for the Current Layer

Visual Branch Textual Branch

Spatial Pooling Temporal Pooling

Combination

Multi-head Attention

Query Key ValueTextual Input

Combined Visual Feature

Decomposition

Replication Replication

Intermediate Visual Feature

Visual Input

Norm

(a) One Co-attention Layer in ViLBERT

(b) Our Spatio-temporal Combination and Decomposition Module

Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))

90Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively

Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv

to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box

Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending

52 Methodology 91

frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0

s t0e ) The

initial temporal boundaries (t0s t0

e ) and the predicted bounding boxes bt

form the initial tube prediction result Binit = btt0e

t=t0s

Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel

92Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

axy = exp(minus (xminusx)2+(yminusy)2

2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows

LSVG =λ1Lc(A A) + λ2Lrel(oobj o)

+ λ3Lsize(wxy hxy w h)(51)

where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details

523 Temporal Boundary Refinement

The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube

As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries

52 Methodology 93

Text

ual I

nput

To

kens

Star

ting

amp En

ding

Pr

obab

ilitie

s

Tem

pora

l Ext

ensi

onU

nifo

rm S

ampl

ing

ViLB

ERT

Boun

dary

Reg

ress

ion

Loca

l Fea

ture

Ext

ract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Exte

nded

Stra

ting

amp En

ding

Pos

ition

sAn

chor

Fea

ture

Refi

ned

Star

ting

amp En

ding

Pos

ition

s (

From

Anc

hor-b

ased

Bra

nch)

and

Star

ting

amp En

ding

Reg

ion

Feat

ures

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Initi

al S

tarti

ng amp

End

ing

Posi

tions

Refi

nde

Star

ting

amp En

ding

Pos

ition

s (F

rom

Fra

me-

base

d Br

anch

)

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Anc

hor-

base

d B

ranc

h

Fram

e-ba

sed

Bra

nch

Refi

ned

Obj

ect

Tube

Initi

al O

bjec

t Tu

be

Vide

o Po

oled

Fe

atur

e

Text

ual I

nput

To

kens

Stag

e 1

Stag

e 2

Stag

e 3

Star

ting

Endi

ng

Prob

abilit

yAn

chor

Fea

ture

Extra

ctio

nVi

LBER

TBo

unda

ryR

egre

ssio

nLo

cal F

eatu

reEx

tract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Refi

ned

Posi

tions

(Fro

m A

B)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Initi

al P

ositi

ons

Refi

ned

Posi

tions

(F

rom

FB)

Pool

ed V

ideo

Fe

atur

eTe

xtua

l Inp

ut

Toke

nsPo

oled

Vid

eo

Feat

ure

Anc

hor-

base

d B

ranc

h (A

B)

Fram

e-ba

sed

Bra

nch

(FB

)

Refi

ned

Obj

ect T

ube

from

Initi

al O

bjec

t Tu

be

Pool

ed V

ideo

Fea

ture

Text

ual I

nput

Tok

ens

Stag

e 1

Stag

e 2

Figu

re5

3O

verv

iew

ofou

rTe

mpo

ralB

ound

ary

Refi

nem

entn

etw

ork

(TBR

-net

)w

hich

isa

mul

ti-s

tage

ViL

BER

T-ba

sed

fram

ewor

kw

ith

anan

chor

-bas

edbr

anch

and

afr

ame-

base

dbr

anch

(tw

ost

ages

are

used

for

bett

erill

ustr

atio

n)

The

inpu

tsta

rtin

gan

den

ding

fram

epo

siti

ons

atea

chst

age

are

thos

eof

the

outp

utob

ject

tube

from

the

prev

ious

stag

eA

tst

age

1th

ein

puto

bjec

ttub

eis

gene

rate

dby

our

SVG

-net

94Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage

Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets

Given an anchor with the temporal boundaries (t0s t0

e ) and the lengthl = t0

e minus t0s we temporally extend the anchor before and after the start-

ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0

s minus l4 t0e + l4] and generate the anchor feature F by stacking the

corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension

The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0

s and t0e and generate the new starting

and ending frame positions t1s and t1

e

Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1

s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1

s minusD2+ 1t1s + D2] where D is the length

53 Experiment 95

of the starting duration We then construct the frame-level feature Fs

by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1

s Accordingly we can produce the ending frameposition t1

e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1

s and t1e are generated they can be used as the

input for the anchor-based branch at the next stage

Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow

LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)

where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1

53 Experiment

531 Experiment Setup

Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset

96Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes

-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes

Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU

Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1

|SU | sumtisinSIrt

where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames

53 Experiment 97

between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples

Baseline Methods

-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals

-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes

-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT

-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net

98Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

532 Comparison with the State-of-the-art Methods

We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries

533 Ablation Study

In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method

Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step

It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating

53 Experiment 99

Tabl

e5

1R

esul

tsof

diff

eren

tmet

hods

onth

eV

idST

Gda

tase

tldquo

ind

icat

esth

ere

sult

sar

equ

oted

from

its

orig

inal

wor

k

Met

hod

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)m

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)ST

GR

N

[102

]19

75

257

714

60

183

221

10

128

3V

anill

a(o

urs)

214

927

69

157

720

22

231

713

82

SVG

-net

(our

s)22

87

283

116

31

215

724

09

143

3SV

G-n

et+

TBR

-net

(our

s)25

21

309

918

21

236

726

26

160

1

100Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work

Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875

SVG-net (ours) 1945 2778 1012SVG-net +

TBR-net (ours)2241 3031 1207

the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT

Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively

Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)

53 Experiment 101

Tabl

e5

3R

esul

tsof

our

met

hod

and

its

vari

ants

whe

nus

ing

diff

eren

tnum

ber

ofst

ages

onth

eV

idST

Gda

tase

t

Met

hods

Sta

ges

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

UvI

oU

03

vIoU

0

5m

_vIo

UvI

oU

03

vIoU

0

5SV

G-n

et-

228

728

31

163

121

57

240

914

33

SVG

-net

+IF

B-ne

t1

234

128

84

166

922

12

244

814

57

223

82

289

716

85

225

524

69

147

13

237

728

91

168

722

61

247

214

79

SVG

-net

+IA

B-ne

t1

232

729

57

163

921

78

249

314

31

223

74

299

816

52

222

125

42

145

93

238

530

15

165

722

43

257

114

47

SVG

-net

+T

BR-n

et1

243

229

92

172

222

69

253

715

29

224

95

307

517

02

233

426

03

158

33

252

130

99

182

123

67

262

616

01

102Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods

54 Summary

In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding

103

Chapter 6

Conclusions and Future Work

As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future

61 Conclusions

The contributions of this thesis are summarized as follows

bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level

104 Chapter 6 Conclusions and Future Work

The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets

bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework

bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding

62 Future Work

The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into

62 Future Work 105

one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously

107

Bibliography

[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812

[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100

[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970

[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308

[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139

[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206

[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894

108 Bibliography

[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008

[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802

[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990

[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255

[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941

[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83

[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275

[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636

[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448

[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587

Bibliography 109

[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015

[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056

[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778

[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017

[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708

[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6

[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199

[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014

[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413

[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105

110 Bibliography

[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750

[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165

[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)

[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318

[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344

[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019

[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889

[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898

[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996

[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In

Bibliography 111

Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19

[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017

[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988

[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755

[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682

[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864

[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855

[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37

[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959

[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613

112 Bibliography

[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23

[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043

[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950

[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028

[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946

[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205

[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334

[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759

[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439

Bibliography 113

[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541

[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788

[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99

[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427

[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016

[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565

[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743

[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058

114 Bibliography

[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576

[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)

[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970

[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646

[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)

[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025

[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020

[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473

[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018

[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772

Bibliography 115

[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9

[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019

[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)

[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497

[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008

[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717

[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334

[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36

[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the

116 Bibliography

IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968

[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172

[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321

[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792

[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462

[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154

[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653

[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019

[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947

[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)

Bibliography 117

[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693

[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)

[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524

[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687

[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315

[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102

[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692

[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103

[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809

118 Bibliography

[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075

[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677

[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923

[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019

[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913

  • Abstract
  • Declaration
  • Acknowledgements
  • List of Publications
  • List of Figures
  • List of Tables
  • Introduction
    • Spatial-temporal Action Localization
    • Temporal Action Localization
    • Spatio-temporal Visual Grounding
    • Thesis Outline
      • Literature Review
        • Background
          • Image Classification
          • Object Detection
          • Action Recognition
            • Spatio-temporal Action Localization
            • Temporal Action Localization
            • Spatio-temporal Visual Grounding
            • Summary
              • Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
                • Background
                  • Spatial temporal localization methods
                  • Two-stream R-CNN
                    • Action Detection Model in Spatial Domain
                      • PCSC Overview
                      • Cross-stream Region Proposal Cooperation
                      • Cross-stream Feature Cooperation
                      • Training Details
                        • Temporal Boundary Refinement for Action Tubes
                          • Actionness Detector (AD)
                          • Action Segment Detector (ASD)
                          • Temporal PCSC (T-PCSC)
                            • Experimental results
                              • Experimental Setup
                              • Comparison with the State-of-the-art methods
                                • Results on the UCF-101-24 Dataset
                                • Results on the J-HMDB Dataset
                                  • Ablation Study
                                  • Analysis of Our PCSC at Different Stages
                                  • Runtime Analysis
                                  • Qualitative Results
                                    • Summary
                                      • PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
                                        • Background
                                          • Action Recognition
                                          • Spatio-temporal Action Localization
                                          • Temporal Action Localization
                                            • Our Approach
                                              • Overall Framework
                                              • Progressive Cross-Granularity Cooperation
                                              • Anchor-frame Cooperation Module
                                              • Training Details
                                              • Discussions
                                                • Experiment
                                                  • Datasets and Setup
                                                  • Comparison with the State-of-the-art Methods
                                                  • Ablation study
                                                  • Qualitative Results
                                                    • Summary
                                                      • STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
                                                        • Background
                                                          • Vision-language Modelling
                                                          • Visual Grounding in ImagesVideos
                                                            • Methodology
                                                              • Overview
                                                              • Spatial Visual Grounding
                                                              • Temporal Boundary Refinement
                                                                • Experiment
                                                                  • Experiment Setup
                                                                  • Comparison with the State-of-the-art Methods
                                                                  • Ablation Study
                                                                    • Summary
                                                                      • Conclusions and Future Work
                                                                        • Conclusions
                                                                        • Future Work
                                                                          • Bibliography
Page 7: Deep Learning for Spatial and Temporal Video Localization

For Mama and PapaA sA s

ii

Acknowledgements

First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher

I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future

Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything

Rui SU

University of SydneyJune 17 2021

iii

List of Publications

JOURNALS

[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020

[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020

CONFERENCES

[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019

v

Contents

Abstract i

Declaration i

Acknowledgements ii

List of Publications iii

List of Figures ix

List of Tables xv

1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11

2 Literature Review 1321 Background 13

211 Image Classification 13212 Object Detection 15213 Action Recognition 18

22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26

3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29

31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31

32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38

33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44

34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47

Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50

343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58

35 Summary 60

4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62

411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63

42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71

43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81

44 Summary 82

5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84

511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84

52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92

53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98

54 Summary 102

6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104

Bibliography 107

ix

List of Figures

11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6

31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33

32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i

t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi

t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39

33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43

34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57

35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58

41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 are denoted as SRGB0 PRGB

0

and QFlow0 The initial proposals and features from anchor-

based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66

42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80

43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81

51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86

52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89

53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93

xv

List of Tables

31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48

32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49

33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50

34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51

35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52

36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52

37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52

38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54

39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55

310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55

311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55

312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55

41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74

42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75

43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75

44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76

45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77

46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78

51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99

52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100

53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101

1

Chapter 1

Introduction

With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades

In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows

2 Chapter 1 Introduction

Spatio-temporal Action Localization

Framework

Input Videos

Action Tubes

Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes

11 Spatial-temporal Action Localization

Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task

Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each

11 Spatial-temporal Action Localization 3

other

For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation

On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation

Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region

4 Chapter 1 Introduction

proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level

Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level

Our contributions are briefly summarized as follows

bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy

bull We also propose a temporal boundary refinement method to learn

12 Temporal Action Localization 5

class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results

bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios

12 Temporal Action Localization

Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging

Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework

For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other

6 Chapter 1 Introduction

Ground Truth

P-GCN

Ours

Frisbee Catch783s 821s

Frisbee Catch799s 822s

Frisbee Catch787s 820s

Ground Truth

P-GCN

Ours

Golf Swing

Golf Swing

Golf Swing

210s 318s

179s 311s

204s 315s

Ground Truth

P-GCN

Ours

Throw Discus

Hammer Throw

Throw Discus

639s 712s

642s 708s

637s 715s

Time

Time

Video Input Start End

Anchor-based ProposalScore 082

Score 039Score 047

Score 033

Frame Actioness startingending

Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods

Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries

For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and

12 Temporal Action Localization 7

segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method

To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based

8 Chapter 1 Introduction

segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold

(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels

(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner

(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization

13 Spatio-temporal Visual Grounding

Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers

Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of

13 Spatio-temporal Visual Grounding 9

bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene

Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks

Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance

Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the

10 Chapter 1 Introduction

STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]

Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods

Our contributions can be summarized as follows

(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors

(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover

14 Thesis Outline 11

we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods

(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task

14 Thesis Outline

The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows

Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters

Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]

Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

12 Chapter 1 Introduction

Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod

Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future

13

Chapter 2

Literature Review

In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks

21 Background

211 Image Classification

Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters

14 Chapter 2 Literature Review

(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem

Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition

Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional

21 Background 15

neural network is built by stacking such inception modules to achievehigh image classification performance

VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance

Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper

212 Object Detection

The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem

16 Chapter 2 Literature Review

Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time

To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced

Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined

21 Background 17

anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map

Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming

Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps

All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map

18 Chapter 2 Literature Review

213 Action Recognition

Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other

For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities

Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and

21 Background 19

randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization

2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient

Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos

20 Chapter 2 Literature Review

22 Spatio-temporal Action Localization

Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task

Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization

Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization

Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these

22 Spatio-temporal Action Localization 21

two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency

As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization

The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos

A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes

22 Chapter 2 Literature Review

23 Temporal Action Localization

Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error

Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation

Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce

23 Temporal Action Localization 23

the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame

Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods

In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates

Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors

Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the

24 Chapter 2 Literature Review

prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise

Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU

Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction

In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way

24 Spatio-temporal Visual Grounding 25

24 Spatio-temporal Visual Grounding

Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects

Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer

In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects

Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes

One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors

26 Chapter 2 Literature Review

to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance

25 Summary

In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding

In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms

Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows

25 Summary 27

the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results

In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities

In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos

29

Chapter 3

Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization

Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal

30Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios

31 Background

311 Spatial temporal localization methods

Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]

Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization

Third temporal localization is to form action tubes from per-frame

31 Background 31

detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc

Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]

312 Two-stream R-CNN

Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams

Please note that most appearance and motion fusion approaches in

32Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion

32 Action Detection Model in Spatial Domain

Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results

321 PCSC Overview

The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression

The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation

32 Action Detection Model in Spatial Domain 33

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

t

IFlow

t

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(a)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(b)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(c)

Figu

re3

1(a

)O

verv

iew

ofou

rC

ross

-str

eam

Coo

pera

tion

fram

ewor

k(f

our

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

regi

onpr

opos

als

and

feat

ures

from

the

flow

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

and

Stag

e3

whi

leth

ere

gion

prop

osal

san

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2an

dSt

age

4Ea

chst

age

com

pris

esof

two

coop

erat

ion

mod

ules

and

the

dete

ctio

nhe

ad(

b)(c

)The

deta

ilsof

our

feat

ure-

leve

lcoo

pera

tion

mod

ules

thro

ugh

mes

sage

pass

ing

from

one

stre

amto

anot

her

stre

am

The

flow

feat

ures

are

used

tohe

lpth

eR

GB

feat

ures

in(b

)w

hile

the

RG

Bfe

atur

esar

eus

edto

help

the

flow

feat

ures

in(c

)

34Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages

After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33

322 Cross-stream Region Proposal Cooperation

We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream

In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes

The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage

32 Action Detection Model in Spatial Domain 35

The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)

Mathematically let P (i)t and B(i)t denote the set of region propos-

als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)

t is up-dated as P (i)

t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)

t is updated as

B(j)t = G(P (j)

t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)

0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams

With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results

Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the

36Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data

323 Cross-stream Feature Cooperation

To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections

Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci

2 Ci3 Ci

4Ci

5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui

2 Ui3 Ui

4 Ui5

Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are

32 Action Detection Model in Spatial Domain 37

insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement

We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci

2 Ci3 Ci

4 Ci5 l isin 2 3 4 5 Let us use the improvement

of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB

l with the help of the features CFlowl as

followsCRGB

l = fθ(CFlowl )oplus CRGB

l (31)

where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow

l ) nonlinearly extracts the message fromthe feature CFlow

l and then uses the extracted message for improvingthe features CRGB

l

The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB

l and CFlowl To this end

we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB

l in the bottom-up pathway are refined the corre-sponding features maps URGB

l in the top-down pathway are generatedaccordingly

The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages

38Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB

t

and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1

and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-

prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31

(b) Specifically the improved RGB feature FRGBt is obtained by apply-

ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB

t of the RGB stream is also used to help improve theROI-feature FFlow

t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage

324 Training Details

For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process

33 Temporal Boundary Refinement for Action

Tubes

After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this

33 Temporal Boundary Refinement for Action Tubes 39

Initi

al

Segm

ents

(a) I

nitia

l Seg

men

t Gen

erat

ion

Initi

al S

egm

ent

Gen

erat

ion

RG

B o

r Flo

w F

eatu

res

from

Act

ion

Tube

s

=cup

2

0

1

Initi

al S

egm

ent

Gen

erat

ion

Initi

al

Segm

ents

Stag

e 2

Segm

ent F

eatu

reC

oope

ratio

nD

etec

tion

Hea

dSe

gmen

t Fea

ture

Coo

pera

tion

Det

ectio

nH

ead

=cup

1

0

0

1

2

1

2

Impr

oved

Seg

men

t Fe

atur

es

Det

ecte

d Se

gmen

ts

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Impr

oved

Seg

men

tFe

atur

es

Det

ecte

d Se

gmen

ts

(b) T

empo

ral P

CSC

RG

B F

eatu

res

from

A

ctio

n Tu

bes

0

Flow

Fea

ture

s fr

om

Act

ion

Tube

s

0

Actio

nnes

sD

etec

tor (

AD)

Actio

n Se

gmen

tD

etec

tor (

ASD

)Se

gmen

ts fr

om A

D

Uni

on o

f Seg

men

tsSe

gmen

ts fr

om A

SD

Initi

al S

egm

ents

Figu

re3

2O

verv

iew

oftw

om

odul

esin

our

tem

pora

lbo

unda

ryre

finem

ent

fram

ewor

k(a

)th

eIn

itia

lSe

gmen

tG

en-

erat

ion

Mod

ule

and

(b)

the

Tem

pora

lPr

ogre

ssiv

eC

ross

-str

eam

Coo

pera

tion

(T-P

CSC

)M

odul

eTh

eIn

itia

lSe

gmen

tG

ener

atio

nm

odul

eco

mbi

nes

the

outp

uts

from

Act

ionn

ess

Det

ecto

ran

dA

ctio

nSe

gmen

tDet

ecto

rto

gene

rate

the

init

ial

segm

ents

for

both

RG

Ban

dflo

wst

ream

sO

urT-

PCSC

fram

ewor

kta

kes

thes

ein

itia

lseg

men

tsas

the

segm

entp

ropo

sals

and

iter

ativ

ely

impr

oves

the

tem

pora

lact

ion

dete

ctio

nre

sult

sby

leve

ragi

ngco

mpl

emen

tary

info

rmat

ion

oftw

ost

ream

sat

both

feat

ure

and

segm

entp

ropo

sall

evel

s(t

wo

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

segm

entp

ro-

posa

lsan

dfe

atur

esfr

omth

eflo

wst

ream

help

impr

ove

acti

onse

gmen

tdet

ecti

onre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

whi

leth

ese

gmen

tpro

posa

lsan

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2Ea

chst

age

com

pris

esof

the

coop

erat

ion

mod

ule

and

the

dete

ctio

nhe

adT

hese

gmen

tpro

posa

llev

elco

oper

-at

ion

mod

ule

refin

esth

ese

gmen

tpro

posa

lsP

i tby

com

bini

ngth

em

ostr

ecen

tseg

men

tpro

posa

lsfr

omth

etw

ost

ream

sw

hile

the

segm

ent

feat

ure

leve

lcoo

pera

tion

mod

ule

impr

oves

the

feat

ures

Fi tw

here

the

supe

rscr

ipt

iisinR

GB

Flo

w

deno

tes

the

RG

BFl

owst

ream

and

the

subs

crip

ttde

note

sth

est

age

num

ber

The

dete

ctio

nhe

adis

used

toes

tim

ate

the

acti

onse

gmen

tloc

atio

nan

dth

eac

tion

ness

prob

abili

ty

40Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework

To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments

33 Temporal Boundary Refinement for Action Tubes 41

331 Actionness Detector (AD)

We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i

At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the

42Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

testing stage

332 Action Segment Detector (ASD)

Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below

(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture

33 Temporal Boundary Refinement for Action Tubes 43

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventuallyremoved by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively

but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details

(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three

44Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg

(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments

333 Temporal PCSC (T-PCSC)

Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework

33 Temporal Boundary Refinement for Action Tubes 45

from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments

Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results

46Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

34 Experimental results

We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343

341 Experimental Setup

Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset

Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005

To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes

34 Experimental results 47

To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ

Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch

342 Comparison with the State-of-the-art methods

We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted

48Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy

Results on the UCF-101-24 Dataset

The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32

Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05

Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792

Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for

34 Experimental results 49

Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector

IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -

Zolfaghari et al [105] RGB+Flow+Pose 476 - - -

Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219

Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172

Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237

PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289

per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs

Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary

50Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods

Results on the J-HMDB Dataset

For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset

Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05

Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748

We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU

34 Experimental results 51

Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds

IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -

Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -

Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442

Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463

Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476

PCSC (Ours) RGB+Flow 826 822 631 528

threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively

At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work

52Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

343 Ablation Study

In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method

Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset

Stage PCSC wo featurecooperation PCSC

0 755 7811 761 7862 764 7883 767 7914 767 792

Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector

Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631

Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset

Stage Video mAP ()0 6191 6282 631

Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to

34 Experimental results 53

verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement

Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work

54Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset

Stage LoR()1 6112 7193 9524 954

in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes

Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +

34 Experimental results 55

Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset

Stage MABO()1 8412 8433 8454 846

Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset

θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801

Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages

Stage 0 1 2 3 4

Detection speed (FPS) 31 29 28 26 24

Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages

Stage 0 1 2

Detection speed (VPS) 21 17 15

ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main

56Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

344 Analysis of Our PCSC at Different Stages

In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages

In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4

To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification

34 Experimental results 57

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)

345 Runtime Analysis

We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages

58Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detected instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)

346 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams

34 Experimental results 59

and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))

Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)

While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue

60Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

which will be studied in our future work

35 Summary

In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets

61

Chapter 4

PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization

There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework

62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

41 Background

411 Action Recognition

Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction

412 Spatio-temporal Action Localization

The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues

41 Background 63

413 Temporal Action Localization

Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level

64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues

42 Our Approach

In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively

421 Overall Framework

We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is

an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN

i=1 and their temporal durations(tis tie)N

i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi

The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows

42 Our Approach 65

Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature

Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB

i |pRGBi = (tRGB

is tRGBie )NRGB

i=1 and PFlow = pFlowi |pFlow

i =

(tFlowis tFlow

ie )NFlow

i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB

0 PFlow0

for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB

0 SFlow0 and the predicted action categories

Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB

0 and QFlow0 respectively

The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB

0 and QFlow0 respectively

after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)

Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation

66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Anchor Feautres (Output)

Frame Features (Output)

Cross-granularity

RGB-StreamAFC Module

(Stage 1)

Flow-StreamAFC Module

(Stage 2)

RGB1

RGB-StreamAFC Module

(Stage 3)

RGB1Flow

1

Flow-StreamAFC Module

(Stage4)

Features RGB0 Flow

0Flow

0

Segments

Proposals RGB0

Flow0

Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

RGB

RGB

Flow

Message Passing

Flowminus2 RGB

minus2Flowminus1

Flowminus1

RGBminus2

Flow

RGB

Anchor AnchorFrame

Anchor Features (Output)

1 x 1 Conv

1 x 1 Conv

Detected Segments (Output)

Proposal Combination= cup cupRGB

Flowminus1

RGBminus2

Flow

InputProposals

Input Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

1 x 1 Conv

1 x 1 Conv

Proposal Combination= cup cupFlow

RGBminus1

Flowminus2

RGB

Input Features

RGBminus1

Flowminus2

InputProposals

Flowminus2 RGB

minus1 RGBminus2

RGB

RGB

Flow

Flow

Flow

Cross-stream Cross-granularity

Anchor Anchor Frame

Message Passing

Features

Proposals Flow2

Features RGB3 Flow

2 RGB2

ProposalsRGB3

Flow2

Flow2

SegmentsFeatures Flow

2RGB2

SegmentsFeatures Flow

3

RGB3

RGB3

Flow4

Flow4RGB

4

SegmentsFeatures

Detected Segments (Output)

Frame Features (Output)

Message Passing

Outputs

Features RGB1 Flow

0RGB

0

ProposalsRGB1

Flow0

RGB1

RGB1 Flow

2 Flow1

Inputs

(a) Overview

(b) RGB-streamAFC module

(c) Flow-stremAFC module

Cross-stream Message Passing

Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 aredenoted as SRGB

0 PRGB0 and QFlow

0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments

42 Our Approach 67

methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages

422 Progressive Cross-Granularity Cooperation

The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)

To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow

nminus1 and SRGBnminus2 as the input two-stream proposals and

uses preceding two-stream anchor-based features PFlownminus1 and PRGB

nminus2 andflow-stream frame-based feature QFlow

nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB

n and anchor-based feature PRGB

n as well as update the flow-stream frame-based fea-tures as QFlow

n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages

68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes

Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances

423 Anchor-frame Cooperation Module

The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)

Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features

42 Our Approach 69

PRGB0 PFlow

0 and QRGB0 QFlow

0 as well as proposals SRGB0 SFlow

0 as theinput as shown in Sec 421 and Fig 41(a)

Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow

nminus1 to the frame-based features QFlownminus2 both from the flow-stream

The message passing operation is defined as

QFlown =Manchorrarr f rame(PFlow

nminus2 )oplusQFlownminus2 (41)

oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow

n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow

n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB

n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB

nminus2 which flips the superscripts from Flow to RGB QRGB

n also leads to betterframe-based RGB-stream proposals QRGB

n

Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow

n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow

n And then we combine QFlown together with the

two-stream anchor-based proposals SFlownminus1 and SRGB

nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB

n = SRGBnminus2 cup SFlow

nminus1 cup

70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

QFlown This set of proposals not only provides more comprehensive pro-

posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow

n = SFlownminus2 cup SRGB

nminus1 cupQRGBn

Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB

n =M f lowrarrrgb(PFlownminus1 )oplus PRGB

nminus2 forthe RGB stream and PFlow

n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow

nminus2 for flow stream

Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB

n PFlown and

the updated anchor-based features PRGBn PFlow

n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes

424 Training Details

At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the

42 Our Approach 71

boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules

425 Discussions

Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants

Similarity of Results Between Adjacent Stages Note that even though

72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results

Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources

43 Experiment

431 Datasets and Setup

Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples

- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available

- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an

43 Experiment 73

existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1

Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches

Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24

432 Comparison with the State-of-the-art Methods

We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers

THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05

74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds

tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190

REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178

TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369

MGG [46] - - 539 468 374BMN [35] - - 560 474 388

TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491

ours 712 689 651 595 512

our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN

ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and

43 Experiment 75

Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used

tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -

R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398

TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699

ours 4431 2985 547 2885

Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied

tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003

P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385

ours 5204 3592 797 3491

BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN

UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05

76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds

st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -

MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278

ours 849 635 237 292

433 Ablation study

We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework

Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow

n and QRGBn are

directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo

CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow

nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and

43 Experiment 77

Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset

Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822

wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667

wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679

wo PC 4214 4321 4514 4574 4577

PRGBnminus1 and PFlow

nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison

In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB

0 cupPFlow0 for ldquowo CG andQRGB

0 cupQFlow0 for ldquowo PC We have

the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved

78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset

Stage Comparisons LoR ()

1 PRGB1 vs SFlow

0 6712 PFlow

2 vs PRGB1 792

3 PRGB3 vs PFlow

2 8954 PFlow

4 vs PRGB3 971

when the number of stages increases and the results will become stableafter 2 or 3 stages

Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB

1 at Stage 1 andPFlow

2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow

2 among all proposals in PFlow2 A proposal

is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB

1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow

2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB

1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other

Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow

4 and those in PRGB3 are very similar even they are gathered for

different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal

43 Experiment 79

action localization method after Stage 3 which is consistent with the re-sults in Table 45

Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005

In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries

80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

(a) AR50 () for various methods at different number of stages

(b) AR100 () for various methods at different number of stages

(c) AR200 () for various methods at different number of stages

(d) AR500 () for various methods at different number of stages

Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset

43 Experiment 81

Ground Truth

Anchor-based Method

PCG-TAL at Stage 3

977s 1042s

954s 1046s

973s 1043s

Frame-based Method Missed

Video Clip

964s 1046s

97s 1045s

PCG-TAL at Stage 1

PCG-TAL at Stage 2

Time

1042s 1046s977s

954s

954s

Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment

434 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our

82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule

44 Summary

In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework

83

Chapter 5

STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding

Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our

84Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG

51 Background

511 Vision-language Modelling

Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work

The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations

512 Visual Grounding in ImagesVideos

Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et

52 Methodology 85

al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling

In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods

52 Methodology

521 Overview

We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip

k Kk=1 where Vclip

k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN

n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte

t=ts

containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B

86Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Pooled Video Feature

Input Textual Tokens

Spatial VisualG

rouding Netw

ork(SVG

-net)

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

HW T

Video Frames

Video Feature

Initial Object Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]

Textual Query

Word Em

beddingInput Textual Tokens

ST-ViLBERTVidoe C

lipSam

pling

Video FeatureC

lip Feature

Input Textual Tokens

Deconvs

Head

Head

Correlation

MLP

Spatial Pooling

Global Textual Feature

Text-guided Viusal Feature

Initial TubeG

eneration

Initial Object Tube

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

Input Textual Tokens

(a) Overview

of Our M

ethod

(b) Our SVG

-net(c) O

ur TBR

-net

Bounding Box Size

Bounding Box Center

Relevance Scores

Pooled Video Feature

Initial Object

TubePooled Video

Feature

SVG-net-core

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

H T

VideoC

lip Features

Initial Object

Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]Textual Q

ueryW

ord Embedding

Input Textual Tokens

ST-ViLBERTC

lip Feature

Input Textual Tokens

Deconv

Head

Head

Correlation

amp Sigmoid

MLP

Spatial Pooling

Global

Textual Feature

Text-guided Viusal Feature

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

(a) Overview

of Our M

ethod

(b) Our SVG

-net-core Module

(c) Our TB

R-net

Bounding Box Sizes

Bounding Box Centers

Relevance Scores

Pooled Video Feature

Initial Object

Tube

W

BB Centers

BB Sizes

Relevance Scores

SVG-net-core

Initial TubeG

eneration

Spatial Visual Grounding N

etwork (SVG

-net)

Figure51

(a)T

heoverview

ofour

spatio-temporalvideo

groundingfram

ework

STVG

BertO

urtw

o-stepfram

ework

consistsof

two

sub-networksSpatial

Visual

Grounding

network

(SVG

-net)and

Temporal

BoundaryR

efinement

net-w

ork(T

BR-net)w

hichtakes

videoand

textualquerypairs

asthe

inputtogenerate

spatio-temporalobjecttubes

contain-ing

theobject

ofinterest

In(b)the

SVG

-netgenerates

theinitialobject

tubesIn

(c)theTBR

-netprogressively

refinesthe

temporalboundaries

oftheinitialobjecttubes

andgenerates

thefinalspatio-tem

poralobjecttubes

52 Methodology 87

In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections

522 Spatial Visual Grounding

Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames

Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip

k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature

For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two

88Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2

n=1 where en is the i-th textual input token

Multi-modal Feature Learning Given the visual input feature Fclipk

and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details

The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors

Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in

52 Methodology 89

Multi-head AttentionQuery Key Value

Add amp Norm

Multi-head AttentionQueryKeyValue

Add amp Norm

Feed Forward

Add amp Norm

Feed Forward

Add amp Norm

Visual Input (Output from the Last Layer)

Visual Output for the Current Layer

Textual Input (Output from the Last Layer)

Textual Output for the Current Layer

Visual Branch Textual Branch

Spatial Pooling Temporal Pooling

Combination

Multi-head Attention

Query Key ValueTextual Input

Combined Visual Feature

Decomposition

Replication Replication

Intermediate Visual Feature

Visual Input

Norm

(a) One Co-attention Layer in ViLBERT

(b) Our Spatio-temporal Combination and Decomposition Module

Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))

90Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively

Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv

to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box

Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending

52 Methodology 91

frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0

s t0e ) The

initial temporal boundaries (t0s t0

e ) and the predicted bounding boxes bt

form the initial tube prediction result Binit = btt0e

t=t0s

Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel

92Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

axy = exp(minus (xminusx)2+(yminusy)2

2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows

LSVG =λ1Lc(A A) + λ2Lrel(oobj o)

+ λ3Lsize(wxy hxy w h)(51)

where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details

523 Temporal Boundary Refinement

The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube

As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries

52 Methodology 93

Text

ual I

nput

To

kens

Star

ting

amp En

ding

Pr

obab

ilitie

s

Tem

pora

l Ext

ensi

onU

nifo

rm S

ampl

ing

ViLB

ERT

Boun

dary

Reg

ress

ion

Loca

l Fea

ture

Ext

ract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Exte

nded

Stra

ting

amp En

ding

Pos

ition

sAn

chor

Fea

ture

Refi

ned

Star

ting

amp En

ding

Pos

ition

s (

From

Anc

hor-b

ased

Bra

nch)

and

Star

ting

amp En

ding

Reg

ion

Feat

ures

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Initi

al S

tarti

ng amp

End

ing

Posi

tions

Refi

nde

Star

ting

amp En

ding

Pos

ition

s (F

rom

Fra

me-

base

d Br

anch

)

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Anc

hor-

base

d B

ranc

h

Fram

e-ba

sed

Bra

nch

Refi

ned

Obj

ect

Tube

Initi

al O

bjec

t Tu

be

Vide

o Po

oled

Fe

atur

e

Text

ual I

nput

To

kens

Stag

e 1

Stag

e 2

Stag

e 3

Star

ting

Endi

ng

Prob

abilit

yAn

chor

Fea

ture

Extra

ctio

nVi

LBER

TBo

unda

ryR

egre

ssio

nLo

cal F

eatu

reEx

tract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Refi

ned

Posi

tions

(Fro

m A

B)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Initi

al P

ositi

ons

Refi

ned

Posi

tions

(F

rom

FB)

Pool

ed V

ideo

Fe

atur

eTe

xtua

l Inp

ut

Toke

nsPo

oled

Vid

eo

Feat

ure

Anc

hor-

base

d B

ranc

h (A

B)

Fram

e-ba

sed

Bra

nch

(FB

)

Refi

ned

Obj

ect T

ube

from

Initi

al O

bjec

t Tu

be

Pool

ed V

ideo

Fea

ture

Text

ual I

nput

Tok

ens

Stag

e 1

Stag

e 2

Figu

re5

3O

verv

iew

ofou

rTe

mpo

ralB

ound

ary

Refi

nem

entn

etw

ork

(TBR

-net

)w

hich

isa

mul

ti-s

tage

ViL

BER

T-ba

sed

fram

ewor

kw

ith

anan

chor

-bas

edbr

anch

and

afr

ame-

base

dbr

anch

(tw

ost

ages

are

used

for

bett

erill

ustr

atio

n)

The

inpu

tsta

rtin

gan

den

ding

fram

epo

siti

ons

atea

chst

age

are

thos

eof

the

outp

utob

ject

tube

from

the

prev

ious

stag

eA

tst

age

1th

ein

puto

bjec

ttub

eis

gene

rate

dby

our

SVG

-net

94Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage

Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets

Given an anchor with the temporal boundaries (t0s t0

e ) and the lengthl = t0

e minus t0s we temporally extend the anchor before and after the start-

ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0

s minus l4 t0e + l4] and generate the anchor feature F by stacking the

corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension

The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0

s and t0e and generate the new starting

and ending frame positions t1s and t1

e

Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1

s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1

s minusD2+ 1t1s + D2] where D is the length

53 Experiment 95

of the starting duration We then construct the frame-level feature Fs

by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1

s Accordingly we can produce the ending frameposition t1

e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1

s and t1e are generated they can be used as the

input for the anchor-based branch at the next stage

Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow

LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)

where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1

53 Experiment

531 Experiment Setup

Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset

96Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes

-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes

Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU

Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1

|SU | sumtisinSIrt

where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames

53 Experiment 97

between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples

Baseline Methods

-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals

-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes

-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT

-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net

98Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

532 Comparison with the State-of-the-art Methods

We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries

533 Ablation Study

In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method

Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step

It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating

53 Experiment 99

Tabl

e5

1R

esul

tsof

diff

eren

tmet

hods

onth

eV

idST

Gda

tase

tldquo

ind

icat

esth

ere

sult

sar

equ

oted

from

its

orig

inal

wor

k

Met

hod

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)m

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)ST

GR

N

[102

]19

75

257

714

60

183

221

10

128

3V

anill

a(o

urs)

214

927

69

157

720

22

231

713

82

SVG

-net

(our

s)22

87

283

116

31

215

724

09

143

3SV

G-n

et+

TBR

-net

(our

s)25

21

309

918

21

236

726

26

160

1

100Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work

Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875

SVG-net (ours) 1945 2778 1012SVG-net +

TBR-net (ours)2241 3031 1207

the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT

Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively

Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)

53 Experiment 101

Tabl

e5

3R

esul

tsof

our

met

hod

and

its

vari

ants

whe

nus

ing

diff

eren

tnum

ber

ofst

ages

onth

eV

idST

Gda

tase

t

Met

hods

Sta

ges

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

UvI

oU

03

vIoU

0

5m

_vIo

UvI

oU

03

vIoU

0

5SV

G-n

et-

228

728

31

163

121

57

240

914

33

SVG

-net

+IF

B-ne

t1

234

128

84

166

922

12

244

814

57

223

82

289

716

85

225

524

69

147

13

237

728

91

168

722

61

247

214

79

SVG

-net

+IA

B-ne

t1

232

729

57

163

921

78

249

314

31

223

74

299

816

52

222

125

42

145

93

238

530

15

165

722

43

257

114

47

SVG

-net

+T

BR-n

et1

243

229

92

172

222

69

253

715

29

224

95

307

517

02

233

426

03

158

33

252

130

99

182

123

67

262

616

01

102Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods

54 Summary

In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding

103

Chapter 6

Conclusions and Future Work

As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future

61 Conclusions

The contributions of this thesis are summarized as follows

bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level

104 Chapter 6 Conclusions and Future Work

The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets

bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework

bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding

62 Future Work

The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into

62 Future Work 105

one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously

107

Bibliography

[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812

[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100

[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970

[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308

[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139

[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206

[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894

108 Bibliography

[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008

[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802

[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990

[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255

[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941

[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83

[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275

[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636

[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448

[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587

Bibliography 109

[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015

[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056

[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778

[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017

[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708

[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6

[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199

[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014

[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413

[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105

110 Bibliography

[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750

[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165

[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)

[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318

[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344

[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019

[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889

[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898

[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996

[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In

Bibliography 111

Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19

[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017

[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988

[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755

[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682

[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864

[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855

[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37

[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959

[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613

112 Bibliography

[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23

[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043

[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950

[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028

[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946

[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205

[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334

[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759

[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439

Bibliography 113

[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541

[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788

[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99

[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427

[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016

[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565

[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743

[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058

114 Bibliography

[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576

[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)

[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970

[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646

[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)

[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025

[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020

[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473

[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018

[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772

Bibliography 115

[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9

[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019

[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)

[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497

[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008

[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717

[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334

[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36

[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the

116 Bibliography

IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968

[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172

[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321

[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792

[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462

[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154

[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653

[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019

[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947

[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)

Bibliography 117

[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693

[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)

[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524

[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687

[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315

[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102

[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692

[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103

[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809

118 Bibliography

[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075

[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677

[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923

[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019

[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913

  • Abstract
  • Declaration
  • Acknowledgements
  • List of Publications
  • List of Figures
  • List of Tables
  • Introduction
    • Spatial-temporal Action Localization
    • Temporal Action Localization
    • Spatio-temporal Visual Grounding
    • Thesis Outline
      • Literature Review
        • Background
          • Image Classification
          • Object Detection
          • Action Recognition
            • Spatio-temporal Action Localization
            • Temporal Action Localization
            • Spatio-temporal Visual Grounding
            • Summary
              • Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
                • Background
                  • Spatial temporal localization methods
                  • Two-stream R-CNN
                    • Action Detection Model in Spatial Domain
                      • PCSC Overview
                      • Cross-stream Region Proposal Cooperation
                      • Cross-stream Feature Cooperation
                      • Training Details
                        • Temporal Boundary Refinement for Action Tubes
                          • Actionness Detector (AD)
                          • Action Segment Detector (ASD)
                          • Temporal PCSC (T-PCSC)
                            • Experimental results
                              • Experimental Setup
                              • Comparison with the State-of-the-art methods
                                • Results on the UCF-101-24 Dataset
                                • Results on the J-HMDB Dataset
                                  • Ablation Study
                                  • Analysis of Our PCSC at Different Stages
                                  • Runtime Analysis
                                  • Qualitative Results
                                    • Summary
                                      • PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
                                        • Background
                                          • Action Recognition
                                          • Spatio-temporal Action Localization
                                          • Temporal Action Localization
                                            • Our Approach
                                              • Overall Framework
                                              • Progressive Cross-Granularity Cooperation
                                              • Anchor-frame Cooperation Module
                                              • Training Details
                                              • Discussions
                                                • Experiment
                                                  • Datasets and Setup
                                                  • Comparison with the State-of-the-art Methods
                                                  • Ablation study
                                                  • Qualitative Results
                                                    • Summary
                                                      • STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
                                                        • Background
                                                          • Vision-language Modelling
                                                          • Visual Grounding in ImagesVideos
                                                            • Methodology
                                                              • Overview
                                                              • Spatial Visual Grounding
                                                              • Temporal Boundary Refinement
                                                                • Experiment
                                                                  • Experiment Setup
                                                                  • Comparison with the State-of-the-art Methods
                                                                  • Ablation Study
                                                                    • Summary
                                                                      • Conclusions and Future Work
                                                                        • Conclusions
                                                                        • Future Work
                                                                          • Bibliography
Page 8: Deep Learning for Spatial and Temporal Video Localization

ii

Acknowledgements

First I would like to thank my supervisor Prof Dong Xu for his valu-able guidance and support during the past four years of my PhD studyUnder his supervision I have learned useful research techniques includ-ing how to write high quality research papers which help me to beco-man an independent researcher

I would also like to specially thank Dr Wanli Ouyang and Dr Lup-ing Zhou who offer me valuable suggestions and inspirations in ourcollaborated works With the useful discussions with them I work outmany difficult problems and am also encouraged to explore unknownsin the future

Last but not least I sincerely thank my parents and my wife fromthe bottom of my heart for everything

Rui SU

University of SydneyJune 17 2021

iii

List of Publications

JOURNALS

[1] Rui Su Dong Xu Luping Zhou and Wanli Ouyang ldquoProgressiveCross-stream Cooperation in Spatial and Temporal Domain for Ac-tion Localizationrdquo IEEE Transactions on Pattern Analysis and Ma-chine Intelligence 2020

[2] Rui Su Dong Xu Lu Sheng and Wanli Ouyang ldquoPCG-TAL Pro-gressive Cross-granularity Cooperation for Temporal Action Local-izationrdquo IEEE Transactions on Image Processing 2020

CONFERENCES

[1] Rui SU Wanli Ouyang Luping Zhou and Dong Xu ldquoImprov-ing action localization by progressive cross-stream cooperationrdquoin Proceedings of the IEEE Conference on Computer Vision and PatternRecognition pp 12016-12025 2019

v

Contents

Abstract i

Declaration i

Acknowledgements ii

List of Publications iii

List of Figures ix

List of Tables xv

1 Introduction 111 Spatial-temporal Action Localization 212 Temporal Action Localization 513 Spatio-temporal Visual Grounding 814 Thesis Outline 11

2 Literature Review 1321 Background 13

211 Image Classification 13212 Object Detection 15213 Action Recognition 18

22 Spatio-temporal Action Localization 2023 Temporal Action Localization 2224 Spatio-temporal Visual Grounding 2525 Summary 26

3 Progressive Cross-stream Cooperation in Spatial and TemporalDomain for Action Localization 29

31 Background 30311 Spatial temporal localization methods 30312 Two-stream R-CNN 31

32 Action Detection Model in Spatial Domain 32321 PCSC Overview 32322 Cross-stream Region Proposal Cooperation 34323 Cross-stream Feature Cooperation 36324 Training Details 38

33 Temporal Boundary Refinement for Action Tubes 38331 Actionness Detector (AD) 41332 Action Segment Detector (ASD) 42333 Temporal PCSC (T-PCSC) 44

34 Experimental results 46341 Experimental Setup 46342 Comparison with the State-of-the-art methods 47

Results on the UCF-101-24 Dataset 48Results on the J-HMDB Dataset 50

343 Ablation Study 52344 Analysis of Our PCSC at Different Stages 56345 Runtime Analysis 57346 Qualitative Results 58

35 Summary 60

4 PCG-TAL Progressive Cross-granularity Cooperation for Tem-poral Action Localization 6141 Background 62

411 Action Recognition 62412 Spatio-temporal Action Localization 62413 Temporal Action Localization 63

42 Our Approach 64421 Overall Framework 64422 Progressive Cross-Granularity Cooperation 67423 Anchor-frame Cooperation Module 68424 Training Details 70425 Discussions 71

43 Experiment 72431 Datasets and Setup 72432 Comparison with the State-of-the-art Methods 73433 Ablation study 76434 Qualitative Results 81

44 Summary 82

5 STVGBert A Visual-linguistic Transformer based Frameworkfor Spatio-temporal Video Grounding 8351 Background 84

511 Vision-language Modelling 84512 Visual Grounding in ImagesVideos 84

52 Methodology 85521 Overview 85522 Spatial Visual Grounding 87523 Temporal Boundary Refinement 92

53 Experiment 95531 Experiment Setup 95532 Comparison with the State-of-the-art Methods 98533 Ablation Study 98

54 Summary 102

6 Conclusions and Future Work 10361 Conclusions 10362 Future Work 104

Bibliography 107

ix

List of Figures

11 Illustration of the spatio-temporal action localization task 212 Illustration of the temporal action localization task 6

31 (a) Overview of our Cross-stream Cooperation framework(four stages are used as an example for better illustration)The region proposals and features from the flow streamhelp improve action detection results for the RGB streamat Stage 1 and Stage 3 while the region proposals and fea-tures from the RGB stream help improve action detectionresults for the flow stream at Stage 2 and Stage 4 Eachstage comprises of two cooperation modules and the de-tection head (b) (c) The details of our feature-level coop-eration modules through message passing from one streamto another stream The flow features are used to help theRGB features in (b) while the RGB features are used tohelp the flow features in (c) 33

32 Overview of two modules in our temporal boundary re-finement framework (a) the Initial Segment GenerationModule and (b) the Temporal Progressive Cross-streamCooperation (T-PCSC) Module The Initial Segment Gen-eration module combines the outputs from Actionness De-tector and Action Segment Detector to generate the initialsegments for both RGB and flow streams Our T-PCSCframework takes these initial segments as the segment pro-posals and iteratively improves the temporal action de-tection results by leveraging complementary informationof two streams at both feature and segment proposal lev-els (two stages are used as an example for better illustra-tion) The segment proposals and features from the flowstream help improve action segment detection results forthe RGB stream at Stage 1 while the segment proposalsand features from the RGB stream help improve action de-tection results for the flow stream at Stage 2 Each stagecomprises of the cooperation module and the detectionhead The segment proposal level cooperation module re-fines the segment proposals P i

t by combining the most re-cent segment proposals from the two streams while thesegment feature level cooperation module improves thefeatures Fi

t where the superscript i isin RGB Flow de-notes the RGBFlow stream and the subscript t denotesthe stage number The detection head is used to estimatethe action segment location and the actionness probability 39

33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segmentdetector (see Section 42) The modules within each bluedashed box share similar functions at the spatial domainand the temporal domain respectively 43

34 An example of visualization results from different meth-ods for one frame (a) single-stream Faster-RCNN (RGBstream) (b) single-stream Faster-RCNN (flow stream) (c)a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewed on screen) 57

35 Example results from our PCSC with temporal boundaryrefinement (TBR) and the two-stream Faster-RCNN methodThere is one ground-truth instance in (a) and no ground-truth instance in (b-d) In (a) both of our PCSC with TBRmethod and the two-stream Faster-RCNN method [38] suc-cessfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method andthe two-stream Faster-RCNN method falsely detect thebounding boxes with the class label ldquoTennis Swing for (b)and ldquoSoccer Juggling for (c) However our TBR methodcan remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC methodand the two-stream Faster-RCNN method falsely detectthe action bounding boxes with the class label ldquoBasket-ball and our TBR method cannot remove the falsely de-tected bounding box from our PCSC in this failure case(Best viewed on screen) 58

41 (a) Overview of our Progressive Cross-granularity Coop-eration framework At each stage our Anchor-Frame Co-operation (AFC) module focuses on only one stream (ieRGBFlow) The RGB-stream AFC modules at Stage 1 and3(resp the flow-stream AFC modules at Stage 2 and 4)generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flowstream) The inputs of each AFC module are the proposalsand features generated by the previous stages (b) (c) De-tails of our RGB-stream and Flow stream AFC modules atdifferent stages n Note that we use the RGB-stream AFCmodule in (b) when n is an odd number and the Flow-stream AFC module in (c) when n is an even number Forbetter representation when n = 1 the initial proposalsSRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 are denoted as SRGB0 PRGB

0

and QFlow0 The initial proposals and features from anchor-

based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based andanchor-based features are improved by the cross-granularityand cross-stream message passing modules respectivelyThe improved frame-based features are used to generatebetter frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streamsThe newly combined proposals together with the improvedanchor-based features are used as the input to ldquoDetectionHead to generate the updated segments 66

42 ARAN () for the action segments generated from var-ious methods at different number of stages on the THU-MOS14 dataset 80

43 Qualitative comparisons of the results from two baselinemethods (ie the anchor-based method [5] and the frame-based method [37]) and our PCG-TAL method at differ-ent number of stages on the THUMOS14 dataset Moreframes during the period between the 954-th second tothe 977-th second of the video clip are shown in the blueboxes where the actor slowly walks into the circle spotto start the action The true action instance of the actionclass ldquoThrow Discus with the starting frame and the end-ing frame from the 977-th second to the 1042-th second ofthe video clip are marked in the red box The frames dur-ing the period from the 1042-th second to the 1046-th sec-ond of the video clip are shown in the green boxes wherethe actor just finishes the action and starts to straightenhimself up In this example the anchor-based methoddetects the action segment with less accurate boundarieswhile the frame-based method misses the true action in-stance Our PCG-TAL method can progressively improvethe temporal boundaries of the action segment 81

51 (a) The overview of our spatio-temporal video groundingframework STVGBert Our two-step framework consistsof two sub-networks Spatial Visual Grounding network(SVG-net) and Temporal Boundary Refinement network(TBR-net) which takes video and textual query pairs asthe input to generate spatio-temporal object tubes contain-ing the object of interest In (b) the SVG-net generates theinitial object tubes In (c) the TBR-net progressively re-fines the temporal boundaries of the initial object tubesand generates the final spatio-temporal object tubes 86

52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and atextual branch generates the visual-linguistic representa-tions by exchanging the key-value pairs for the multi-headattention blocks (b) The structure of our Spatio-temporalCombination and Decomposition (STCD) module whichreplaces the ldquoMulti-head attention and ldquoAdd amp Normblocks in the visual branch of ViLBERT (marked in thegreen dotted box in (a)) 89

53 Overview of our Temporal Boundary Refinement network(TBR-net) which is a multi-stage ViLBERT-based frame-work with an anchor-based branch and a frame-based branch(two stages are used for better illustration) The inputstarting and ending frame positions at each stage are thoseof the output object tube from the previous stage At stage1 the input object tube is generated by our SVG-net 93

xv

List of Tables

31 Comparison (mAPs at the frame level) of different meth-ods on the UCF-101-24 dataset when using the IoU thresh-old δ at 05 48

32 Comparison (mAPs at the video level) of different meth-ods on the UCF-101-24 dataset when using different IoUthresholds Here ldquoAD is for actionness detector and ldquoASDis for action segment detector 49

33 Comparison (mAPs at the frame level) of different meth-ods on the J-HMDB dataset when using the IoU thresholdδ at 05 50

34 Comparison (mAPs at the video level) of different meth-ods on the J-HMDB dataset when using different IoU thresh-olds 51

35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset 52

36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the action-ness detector while ldquoASD is for the action segment detec-tor 52

37 Ablation study for our Temporal PCSC (T-PCSC) methodat different training stages on the UCF-101-24 dataset 52

38 The large overlap ratio (LoR) of the region proposals fromthe latest RGB stream with respective to the region pro-posals from the latest flow stream at each stage in ourPCSC action detector method on the UCF-101-24 dataset 54

39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in ourPCSC action detector method on the UCF-101-24 dataset 55

310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages onthe UCF-101-24 dataset 55

311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24when using different number of stages 55

312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 when using different number of stages 55

41 Comparison (mAP ) on the THUMOS14 dataset usingdifferent tIoU thresholds 74

42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external videolabels are used 75

43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labelsfrom UntrimmedNet [80] are applied 75

44 Comparison (mAPs ) on the UCF-101-24 dataset usingdifferent st-IoU thresholds 76

45 Ablation study among different AFC-related variants atdifferent training stages on the THUMOS14 dataset 77

46 LoR () scores of proposals from adjacent stages Thisscore is calculated for each video and averaged for all videosin the test set of the THUMOS14 dataset 78

51 Results of different methods on the VidSTG dataset ldquoindicates the results are quoted from its original work 99

52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work 100

53 Results of our method and its variants when using differ-ent number of stages on the VidSTG dataset 101

1

Chapter 1

Introduction

With the massive use of digital cameras smart phones and webcams alarge amount of videos data are available on the Internet which bringsthe increasing demands of an intelligent system to help users analyze thecontents of the videos In this context video localization as one of themost discussed video analysis problems has attracted more and moreattentions from both academics and industries in recent decades

In this thesis we mainly investigate video localization in three tasks(1) spatio-temporal action localization (2) temporal action localization(3) and spatio-temporal visual grounding Specifically given an untrimmedvideo the spatio-temporal action localization task requires to find outthe spatial coordinates (bounding boxes) of action instances in every ac-tion frames (frames that contain action instances) which are then associ-ated into action tubes in order to indicate the spatio-temporal locationsfor action instances For the temporal action localization task it aims toonly capture the starting and the ending time points of action instanceswithin the untrimmed videos without caring about the spatial locationsof the action instances Given an untrimmed video with textual queriesdescribing objects appearing within the video the goal of the spatio-temporal visual grounding task is to localize the queried target objectsfrom the given video by outputting the corresponding spatial coordi-nates (bounding boxes) We develop novel algorithms by using a deepneural network to solve the aforementioned tasks and conduct extensiveexperiments to evaluate our proposed methods A brief introduction forour contributions to these three tasks are provided as follows

2 Chapter 1 Introduction

Spatio-temporal Action Localization

Framework

Input Videos

Action Tubes

Figure 11 Illustration of the spatio-temporal action localization taskThe spaito-temporal action localization framework takes untrimmedvideos as input and generates the required action tubes

11 Spatial-temporal Action Localization

Recently deep neural networks have significantly improved performancein various computer vision tasks [20 53 51 52 73 43 100] including hu-man action detection Human action detection also known as the spatio-temporal action localization aims to recognize the actions of interest pre-sented in videos and localize them in space and time As shown in Fig-ure 11 for the spatio-temporal action localization task the input is theuntrimmed videos while the output is the corresponding action tubes(ie a set of bounding boxes) Due to its wide spectrum of applicationsit has attracted increasing research interests The cluttered backgroundocclusion and large intra-class variance make spatio-temporal action lo-calization a very challenging task

Existing works [64 60 54] have demonstrated the complementaritybetween appearance and motion information at the feature level in rec-ognizing and localizing human actions We further observe that thesetwo types of information are also complementary to each other at the re-gion proposal level That is either appearance or motion clues alone maysucceed or fail to detect region proposals in different scenarios How-ever they can help each other by providing region proposals to each

11 Spatial-temporal Action Localization 3

other

For example the region proposal detectors based on appearance in-formation may fail when certain human actions exhibit extreme poses(eg toothbrushing where only the movement of the hand changes whilethe whole scene remains the same) but motion information from hu-man movements could help capture human actions In another exam-ple when the motion clue is noisy because of subtle movement or clut-tered background it is hard to detect positive region proposals based onmotion information while the appearance clue may still provide enoughinformation to successfully find candidate action regions and remove alarge amount of background or non-action regions Therefore we canuse the bounding boxes detected from the motion information as theregion proposals to improve action detection results based on the ap-pearance information and vice versa This underpins our work in thispaper to fully exploit the interactions between appearance and motioninformation at both region proposal and feature levels to improve spa-tial action localization which is our first motivation

On the other hand to generate spatio-temporal action localizationresults we need to form action tubes by linking the detected boundingboxes for actions in individual frames and temporally segment them outfrom the entire video clip Data association methods based on the spatialoverlap and action class scores are mainly used in the current works [1854 60 67] However it is difficult for such methods to precisely identifythe temporal boundaries of actions For some cases due to the subtledifference between the frames near temporal boundaries it remains anextremely hard challenge to precisely decide the temporal boundary Asa result it produces a large room for further improvement of the existingtemporal refinement methods which is our second motivation

Based on the first motivation in this paper we propose a progres-sive framework called Progressive Cross-stream Cooperation (PCSC) toiteratively use both region proposals and features from one stream toprogressively help learn better action detection models for another streamTo exploit the information from both streams at the region proposal levelwe collect a larger set of training samples by combining the latest region

4 Chapter 1 Introduction

proposals from both streams At the feature level we propose a newmessage passing module to pass information from one stream to anotherstream in order to learn better representations by capturing both the mo-tion and the appearance information at the same time By exploiting thecomplementality information between appearance and motion clues inthis way we can progressively learn better action detection models andimprove action localization results at the frame-level

Based on the second motivation we propose a new temporal bound-ary refinement method consisting of an actionness detector and an actionsegment detector in which we can extend our PCSC framework to tem-poral localization For the actionness detector we train a set of class-specific binary classifiers (actionness detectors) to detect the happeningof a certain type of actions These actionness detectors are trained byfocusing on ldquoconfusing samples from the action tube of the same classand therefore can learn critical features that are good at discriminatingthe subtle changes across the action boundaries Our approach worksbetter when compared with an alternative approach that learns a generalactionness detector for all actions For the action segment detector wepropose a segment proposal based two-stream detector using a multi-branch architecture to address the large temporal scale variation prob-lem Our PCSC framework can be readily applied to the action segmentdetector to take advantage of the complementary information betweenappearance and motion clues to detect accurate temporal boundariesand further improve spatio-temporal localization results at the video-level

Our contributions are briefly summarized as follows

bull For spatial localization we propose the Progressive Cross-streamCooperation (PCSC) framework to iteratively use both features andregion proposals from one stream to help learn better action de-tection models for another stream which includes a new messagepassing approach and a simple region proposal combination strat-egy

bull We also propose a temporal boundary refinement method to learn

12 Temporal Action Localization 5

class-specific actionness detectors and an action segment detectorthat applies the newly proposead PCSC framework from the spa-tial domain to the temporal domain to improve the temporal local-ization results

bull Comprehensive experiments on two benchmark datasets UCF-101-24 and J-HMDB demonstrate that our approach outperforms thestate-of-the-art methods for localizing human actions both spatiallyand temporally in realistic scenarios

12 Temporal Action Localization

Temporal action localization aims at simultaneously classifying the ac-tions of interest and localizing the starting and ending frames of ev-ery action instance in untrimmed videos Many real-world applicationssuch as abnormal event detection only require to localize the actions ofinterest in the temporal domain Unlike the spatio-temporal action local-ization task as the temporal action localization task dose not localize theactors in each frame it can only detect the temporal starting and endingframes based on the whole input image from each frame As a result theinput is less discriminative which makes the temporal action localiza-tion task more challenging

Similar as the object detection methods like [58] the prior meth-ods [9 15 85 5] have been proposed to handle this task through a two-stage pipeline which at first generates a set of 1D temporal segment pro-posals and then performs classification and temporal boundary regres-sion on each individual proposal However this task is still challengingand we are confronted with two major issues (1) how to generate theproposals with high recall rates and accurate boundaries and (2) how toeffectively make complementary clues cooperate with each other acrossthe whole video localization framework

For the first issue as shown in Figure 12 we observe that the pro-posal generation methods with different granularities (ie the anchor-based and frame-based methods) are often complementary to each other

6 Chapter 1 Introduction

Ground Truth

P-GCN

Ours

Frisbee Catch783s 821s

Frisbee Catch799s 822s

Frisbee Catch787s 820s

Ground Truth

P-GCN

Ours

Golf Swing

Golf Swing

Golf Swing

210s 318s

179s 311s

204s 315s

Ground Truth

P-GCN

Ours

Throw Discus

Hammer Throw

Throw Discus

639s 712s

642s 708s

637s 715s

Time

Time

Video Input Start End

Anchor-based ProposalScore 082

Score 039Score 047

Score 033

Frame Actioness startingending

Figure 12 Illustration of the temporal action localization task The start-ing and the ending frames of the actions within the input videos can belocalized by the anchor-based or the frame-based temporal action local-ization methods

Specifically the anchor-based methods generate segment proposals byregressing boundaries for the pre-defined anchors which generally pro-duce coarse proposals with a high recall rate Meanwhile the frame-based methods predict per-frame actionness scores to generate segmentproposals which often detect accurate boundaries but suffer from lowrecall rates The prior works [37 13 46] have attempted to make thesetwo lines of schemes cooperate with each other by using either bottom-up module stacking strategy or guided proposal filtering But we ar-gue that the features from the anchor-based methods and the frame-based methods represent different information of the input videos asthey are learned based on different proposal generation mechanismsConsequently the features and segment proposals from the two lines ofproposal generation schemes are considered as complementary to eachother for the temporal action localization task and it is possible to lever-age this cross-granularity-level complementary information to generatebetter action segments with high recall rates and accurate temporal bound-aries

For the second issue the aforementioned two lines of proposal gen-eration methods can cooperate with each other at both the feature and

12 Temporal Action Localization 7

segment levels To be more specific the feature-level cooperation strat-egy will improve the frame-based features and filter out false startingendingpoint pairs with invalid durations and thus enhance the frame-basedmethods to generate better proposals having higher overlapping areaswith the ground-truth action segments Meanwhile the proposal-levelcooperation strategy will collect complete and accurate proposals withthe aid of the frame-based approaches which can help the subsequentaction classification and boundary regression process In order to effec-tively integrate complementary information between the anchor-basedand frame-based methods one straightforward approach is to progres-sively leverage the cross-granularity complementarity to improve thetemporal action localization performance and eventually fully exploitthe complementary information However it is suboptimal by triviallyrepeating the aforementioned cross-granularity cooperation process asit can easily lead to performance saturation Building upon the well-known observation that appearance and motion clues are also comple-mentary to each other our work augments the cross-granularity collab-oration process with the cross-stream cooperation strategy which en-hances the representations for the anchor-based methods and better ex-ploits the complementarity between the two-granularity approaches atthe feature level by iteratively using the features from one stream (ieRGBflow) to improve the features from another stream (ie flowRGB)for the anchor-based method

To this end building upon two-granularity and two-stream pipelinewe propose a unified temporal action localization framework namedas PCG-TAL which enables progressive cross-granularity cooperationMore specifically our PCG-TAL includes a new Anchor-Frame Coop-eration (AFC) module to improve the frame-based features by using amessage passing operation to pass complementary information from theanchor-based features and also update the segment proposal set by com-bining the RGB and flow segment proposals generated from the anchor-based scheme and the improved segment proposals from the frame-basedapproach based on the updated frame-based features Moreover ourAFC module incorporates the aforementioned message passing and seg-ment proposal combination strategies into the two-stream anchor-based

8 Chapter 1 Introduction

segment proposal generation pipeline in which the RGB-stream AFCmodule and the flow-stream AFC modules are stacked sequentially inorder to exchange cross-stream and cross-granularity information at thefeature and segment proposal levels over multiple stages respectivelySpecifically we iteratively update the RGB-stream AFC module by us-ing the updated features and proposals from the flow-stream AFC mod-ule and vice versa As a result our framework can progressively im-prove the temporal action localization performance by leveraging thecross-granularity and cross-stream complementary information The en-tire framework is learned in an end-to-end fashion Our contributionsare three-fold

(1) We propose an AFC module to exploit the cross-granularity andcross-stream complementarity at both feature and segment pro-posal levels

(2) We introduce a new multi-stage framework to effectively integratethe cross-granularity and cross-stream information from the two-stream anchor-based and frame-based methods in a progressivemanner

(3) Comprehensive experiments on three benchmark datasets such asTHUMOS14 ActivityNet v13 and UCF-101-24 demonstrate thatour approach outperforms the state-of-the-art methods for bothtemporal and spatial-temporal action localization

13 Spatio-temporal Visual Grounding

Vision and language play important roles for human to understand theworld In recent years with the remarkable progress of deep neural net-works various vision-language tasks (eg image captioning and visualgrounding) have attracted increasing attention from researchers

Spatial-Temporal Video Grounding (STVG) which was introducedin the recent work [102] is a novel and challenging vision-language taskGiven an untrimmed video and a textual description of an object theSTVG task aims to produce a spatio-temporal tube (ie a sequence of

13 Spatio-temporal Visual Grounding 9

bounding boxes) for the target object described by the given text descrip-tion Different from the existing grounding tasks in images both spatialand temporal localizations are required in the STVG task Besides howto effectively align visual and textual information through cross-modalfeature learning in both spatial and temporal domains is also a key issuefor accurately localizing the target object especially in the challengingscenarios where different persons often perform similar actions withinone scene

Spatial localization in imagesvideos is a related visual groundingtask and spatial localization results have been improved in recent works [4296 45 82 87 88 41 86 7] In most existing works a pre-trained objectdetector is often required to pre-generate object proposals Howeverthese approaches suffer from two limitations (1) The localization perfor-mance heavily rely on the quality of the pre-generated object proposals(2) It is difficult for a pre-trained object detector to be well generalized toany new datasets with unseen classes Although the recent works [92 3448 91] have attempted to remove the dependency of the pre-generationprocess in the image grounding tasks such efforts have not been madefor the video grounding tasks

Beyond spatial localization temporal localization also plays an im-portant role for the STVG task in which significant progress has beenmade in recent years The existing methods can be mainly divided intoanchor-based and frame-based approaches In the anchor-based approaches [141] the output segments often have relatively high overlap with the ground-truth ones but the temporal boundaries are less accurate because theyregress the ground-truth boundaries by using the features extracted fromthe anchors (ie the candidate segments) The frame-based approaches [6]directly determine the temporal boundaries based on the frame-levelprediction scores which often produce more accurate temporal bound-aries but may falsely detect the temporal segments Intuitively it is de-sirable to explore the complementarity of the two lines of methods inorder to improve the temporal localization performance

Motivated by the above observations in this work we propose avisual-linguistic transformer based framework called STVGBert for the

10 Chapter 1 Introduction

STVG task which includs a Spatial Visual Grounding network (SVG-net) and a Temporal Boundary Refinement network (TBR-net) The SVG-net first takes a pair of video clip and textual query as the input andproduces an initial spatio-temporal tube The initial tube is then fed intothe TBR-net where its temporal boundaries are further refined BothSVG-net and TBR-net employ a visual-linguistic transformer for cross-modal feature learning given its promising results achieved for variousvision-language tasks [47 70 33]

Furthermore both SVG-net and the TBR-net have their special de-signs Specifically the key component of the SVG-net is an improvedversion of ViLBERT [47] named ST-ViLBERT In addition to encodingtemporal information as in the original ViLBERT our ST-ViLBERT alsopreserves spatial information in the visual input feature As a result ourSVG-net can effectively learn the cross-modal representation and pro-duce the initial spatio-temporal tubes with reasonable quality withoutrequiring any pre-trained object detectors Besides we also propose anovel ViLBERT-based temporal localization network referred to as TBR-net to take advantage of the complementarity between an anchor-basedbranch and a frame-based branch Our TBR-net iteratively refines thetemporal boundaries by using the output from the frameanchor-basedbranch as the input for the anchorframe-based branch This multi-stage two-branch design turns out to be effective for progressively re-fining the temporal boundaries of the initial tube produced by the SVG-net We evaluate our proposed framework STVGBert on two benchmarkdatasets VidSTG [102] and HC-STVG [76] and our framework outper-forms the state-of-the-art methods

Our contributions can be summarized as follows

(1) We propose a novel two-step visual-linguistic transformer basedframework STVGBert for the spatio-temporal video grounding taskTo the best of our knowledge this is the first STVG framework thatdoes not require any pre-trained object detectors

(2) We introduce a spatial visual grounding network (SVG-net) with anewly proposed cross-modal feature learning module Moreover

14 Thesis Outline 11

we propose a temporal boundary refinement network (TBR-net)which progressively refines the temporal boundaries of the objecttubes by exploiting the complementarity of the frame-based andanchor-based methods

(3) Comprehensive experiments conducted on two benchmark datasetsVidSTG and HC-STVG demonstrat the effectiveness of our frame-work for the STVG task

14 Thesis Outline

The rest of this thesis is organized into five chapters We summarize themain content of each chapter as follows

Chapter 2 Literature Review In this chapter we give a comprehensivereview on the background and the three video localization tasks to helpreaders better understand the studies in the following chapters

Chapter 3 Progressive Cross-stream Cooperation in Spatial and Tem-poral Domain for Action Localization In this chapter we proposea Progressive Cross-stream Cooperation framework in both spatial andtemporal domains to solve the spatio-temporal action localization taskby leveraging the complementary information between appearance andmotion clues at the proposal and feature levels We also evaluate theeffectiveness of our proposed method on two benchmarks UCF-101-24 [68] and J-HMDB [24]

Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation forTemporal Action Localization In this chapter in order to solve the tem-poral action localization task we integrate two lines of previous work(ie anchor-based and frame-based methods) and propose a new Anchor-Frame Cooperation (AFC) module to exchange cross-granularity infor-mation along with a two-stream cooperation strategy to encourage col-laboration between the complementary appearance and motion cluesThe effectiveness of our proposed method is validated on three bench-marks THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

12 Chapter 1 Introduction

Chapter 5 STVGBert A Visual-linguistic Transformer based Frame-work for Spatio-temporal Video Grounding In this chapter we de-velop a spatio-temporal visual grounding framework based on a visual-linguistic transformer to learn cross-modality representation in order tobetter align the visual and textual information we also propose a tem-poral boundary refinement network to progressively refine the temporalboundaries of the generated results by leveraging the complementarityof the frame-based and the anchor-based methods Two benchmarksVidSTG [102] and HC-STVG [76] are used to evaluate our proposedmethod

Chapter 6 Conclusions and Future Work In this chapter we concludethe contributions of this work and point out the potential directions ofvideo localization in the future

13

Chapter 2

Literature Review

In this chapter we review the related work on the deep neural networkand three tasks studied in this thesis We first introduce the backgroundof the deep neural network and then summarized related work on thespatio-temporal action localization temporal action localization and spatio-temporal visual grounding tasks

21 Background

211 Image Classification

Recently deep neural network has been widely used in the computer vi-sion task Krizhevsky et al [27] used a set of convolutional layers withnonlinear functions as activation functions inbetween each layer to builda convolutional neural network named AlexNet for the image classifica-tion task Each convolutional layer contains a set of weights and thenumber of weights in each layer is depended on the kernel size usedin the convolutional filters The convolutional neural network can betrained to classify the categories for images by using a set of trainingsamples Specifically the loss of ground truth and prediction for eachtraining sample is calculated and the weights of the convolutional neu-ral network are updated by using back-propagation and gradient de-scent to minimize the loss Additionally due to the gradient vanishingissue cause by sigmoid function a new nonlinear function named Rec-tified Linear Units (Relu) is introduced and used as the activation func-tions to address this issue However the large number of parameters

14 Chapter 2 Literature Review

(weights) used in the convolutional neural network can easily lead thenetwork overfit the training samples In order to ensure the generaliza-tion for the network strategies such as data augmentation and dropoutare introduced to avoid the overfitting problem This network achieveda winning top-5 test error rate of 153 in the lagest image classificationcompetition (ILSVRC-2012) at that time which was a huge improvementfrom the second best (262) This work has activated the research inter-est of using deep learning method to solve computer vision problem

Later Simonyan and Zisserman [65] proposed a deeper convolu-tional neural network named VGG-Net to further improve the imageclassification performance They observed that convolutional layers withlarge kernel size were used in AlexNet which can be replaced by a stackof several convolutional layers with low kernel size For example a stackof two 3times 3 convolutional layers can achieve the same effective recep-tive field of a 5times 5 convolutional layer and the effective receptive field ofa 7times 7 convolutional layer can be achieved by stacking three 3times 3 con-volutional layers By decomposing the convolutional layers with largekernel size into several 3times 3 convolutional layers with activation func-tions inbetween the nonlinearity of the convolutional neural networkincreases and the number of parameters used in the network decreaseswithout losing the effective receptive field of the network VGG-Netachieved 732 top-5 error rate in the ILSVRC-2014 competition

Similarly Szegedy et al [74] proposed a deep convolutional neu-ral network named Inception for the image classification task whichachieved 667 top-5 error rate and defeated VGG-Net in the ILSVRC-2014 competition In stead of decomposing convolutional layers withlarge kernel size into several convolutional layers with low kernel sizeInception employs multiple kernel sizes in each convolutional layer inorder to take advantage of the complementary information brought bymultiple resolutions Specifically they introduced an inception modulewhich takes the output of previous incpeption module as the input of aset of parallel convolutional layers with different kernel sizes followedby pooling layers and the output of these layers are concatenated toform the final output for the inception module The deep convolutional

21 Background 15

neural network is built by stacking such inception modules to achievehigh image classification performance

VGG-Net and Inception have shown that performance on the im-age classification problem can be improved by making the convolutionalneural network deeper However when simply increasing the numberof the convolutional layers and stacking them to build the network theimage classification performance easily gets saturated and then degradesrapidly This is because as the convolutional neural network goes deeperthe error gradients accumulate to a very large value This results in largeupdate during training process and leads to an unstable network Inorder to address the gradient explosion issue He et al [20] proposed ashortcut connection architecture to allow the gradients to be easily back-propageted from the deep layers to the shallow layers without gradientaccumulation By using the shortcut connection architecture the convo-lutional neural network can be built with extremely high depth to im-prove the image classification performance

Instead of increasing the depth of the convolutional neural networkHuang et al [22] introduced another shortcut connection architecture toenlarge the width of the convolutional neural network The proposednetwork is built by stacking a set of dense blocks with each dense blockhas shortcut connections with other blocks Specifically for each denseblock it takes the combination of the output from all the previous denseblocks as the input so that the features from each layers are able to bereused in the following layers Note that the network becomes wider asit goes deeper

212 Object Detection

The object detection task aims to detect the locations and the categoriesof the objects appearing in the given image With the promising imageclassification performance achieved by using deep convolutional neuralnetwork recently researchers have investigate how to use deep learningmethods to solve the object detection problem

16 Chapter 2 Literature Review

Girshick et al [17] proposed a convolutioal neural network basedobject detection method called RCNN They used selective search to gen-erate a set of region proposals which were then used to crop the givenimage to obtain the corresponding patches of the input image All thecropped patches were the candidates which potentially contains objectsEach of these patches was fed into a convolutional neural network fol-lowed by a detection head which contains a classification head and a re-gression head to predict the category of the patch and regress the offsetsto the coordinates of the corresponding region proposal respectivelyHowever one of the drawback of RCNN is that all the generated patcheshave to be fed into the convolutional neural network which leads to highcomputational cost and large training and inference time

To make the convolutional neural network based object detectionframeworks more efficient Girshick et al [16] proposed Fast RCNN toimprove the time cost for detecting objects in a given image Specifi-cally similar to RCNN they first used selective search to generate regionproposal set However instead of generating patch candidates from theoriginal image to be fed into the convolutional neural network they onlyfed the entire image to the network to generate a feature map and thenapplied a newly proposed Region-of-Interest (ROI) pooling operation ontop of the generated feature map to obtain fixed size feature maps forthe corresponding region proposals so that they can be passed to a fullyconnected layer followed by a detection head to produce the categoryprediction and the offsets By doing so the training and inference timehave been largely reduced

Both RCNN and Fast RCNN use selective search to generating re-gion proposals which is very time-consuming and affects the object de-tection performance As a result in order to further improve the ef-ficiency of the object detection frameworks Ren et al [58] introduceda new Region Proposal Network (RPN) that can learn to generate re-gion proposals and proposed a new object detection framework FasterRCNN Similar to Fast RCNN the entire image was fed into a convolu-tional neural network to generate a feature map A set of pre-defined

21 Background 17

anchors were used at each location of the generated feature map to gen-erate region proposals by a separate network The final detection resultswere obtained from a detection head by using the pooled features basedon the ROI pooling operation on the generated feature map

Faster RCNN is considered as a two-stage objected detection frame-work where a RPN is first used to generate region proposals and thenROI pooling operation and detection head are applied to obtain the finalresults Redmon et al [57] proposed a one-stage real-time object detec-tion framework called Yolo They split the entire image into an S times Sgrid with each location in the gird contained several pre-defined an-chors For each anchor the network predicted the class probability andthe offsets to the anchor Yolo is faster than Faster RCNN as it does notrequire region feature extraction by using ROI pooling operation whichis time-consuming

Similarly Liu et al [44] proposed a single shot multi-box detector(SSD) to generate bounding boxes in one stage In [44] a set of boxeswith different sizes and aspect ratios are predefined at each locations offeature maps These predefined boxes were used to directly regress thecoordinates of the target objects based on the features at the correspond-ing locations on the feature maps

All the aforementioned object detection frameworks are anchor-basedmethods which requires pre-defined anchors or generated region pro-posals Zhou et al [104] proposed CenterNet for object detection with-out using any pre-defined anchors which predicted the locations of thecenter points of the objects Specifically a dense heat map was predictedby the convolutional neural network to represent the probability of eachpoint in the heat map being the center point of an object Additionallythe network also predicted the bounding box size (height and width) foreach point in the heat map

18 Chapter 2 Literature Review

213 Action Recognition

Convolutional neural network has been used to solve image computervision problems such as classification object detection Recently the re-search interest of using convolutional neural network to address com-puter vision problems on videos has been also raised Simonyan andZisserman [64] introduced Two-Stream Convolutional Networks for theaction recognition task in video To classify the action categories forvideos they designed a spatial network based on the convolutional neu-ral network which took the RGB frames from the given videos as in-put and outputted the action class predictions In addition to the ap-pearance information they also found that a stack of multi-frame denseoptical flow can represent motion information which also helped theaction recognition performance Specifically a separate temporal net-work was designed to take the optical flow maps as input to predictionaction classes which can also achieve good action recognition perfor-mance Moreover combining spatial and temporal networks can havefurther improvement on the performance of the action recognition taskwhich demonstrated that appearance and motion information is comple-mentary to each other

For the work in [12] Feichtenhofer et al further discussed the effectsof several different strategies for combining the spatial and the tempo-ral networks proposed in [64] on the action recognition performance Bysetting up experiments they concluded that among the proposed strate-gies by using convolutional layer to fuse the features from the last con-volutional layer of the spatial and the temporal networks the combinednetwork can achieve the best action recognition performance which pro-vided the insights of how to effectively combine information from differ-ent complementary modalities

Based on the two-stream architecture in [64] Wang et al [81] pro-posed Temporal Segment Networks to improve the performance for ac-tion recognition They claimed that temporal information is importantbut consecutive frames from a video are redundant because of their highsimilarity As a result of that they divided a video into segments and

21 Background 19

randomly chose a frame from each segment to represent the whole seg-ment The chosen frames and corresponding optical flow maps wereused as input to the two-stream network and then consensus was ap-plied on output for each stream Finally the output from each streamwere combined to consider the complementary information between twomodalities Additionally they also took advantage of features extractedfrom warped flow and RGB difference which can further benefits the ac-tion recognition performance Besides it can help the framework fromoverfitting the training samples by using pretrained model as initializa-tion and applying partial batch normalization

2D convolutional neural networks has proven its effective usage onthe action recognition task Tran et al [77] also investigated the effectof 3D convolution on feature extraction for videos They claimed thatwhile 2D convolution only consider spatial information 3D convolutionis able to take advantage of temporal information of the input which isalso important in a video They proved that 3D convolution with ker-nel size of 3x3x3 is a reasonable setting among several proposed settingsand designed a 3D convolutional network with such convolutional lay-ers named C3D Compared to the two-stream framework 3D convolu-tional neural network can model the spatio-temporal information simul-taneously which is more time-efficient

Similar to C3D Carreira et al [4] proposed a deeper 3D convolu-tional neural network called I3D for the action recognition task I3D wasbuilt by inflating every 2D convolutional layers in Inception [74] to 3Dconvolutional layers to learn spatio-temporal representations for videosThey trained the proposed network on a large action recognition datasetnamed Kinetics and demonstrated that it can help significantly improvethe action recognition performance on other relatively small dataset suchas UCF-101 I3D is widely used to extract spatio-temporal features forvideos

20 Chapter 2 Literature Review

22 Spatio-temporal Action Localization

Spatio-temporal action localization aims to localize action instances withinuntrimmed videos The input of this task is an untrimmed video whilethe output is a set of bounding boxes representing the spatial locationsin each frame within the temporal durations of the corresponding actioninstances Recently deep convolutional neural network has been widelyused to solve this computer vision task

Weinzaepfel et al [83] proposed a spatio-temporal action localiza-tion framework built upon two-stream (RGB and optical flow) featuresThey used object detection methods to detect action locations for eachframe and then tracked the proposals with high scores by using tracking-be-detection methods A multi-scale slide-window approach was thenused as the temporal anchors to detect the temporal segments to per-form the temporal localization

Similar to action recognition both appearance and motion clues areimportant for spatio-temporal action localization Peng et al [54] pro-posed a multi-region two-stream RCNN framework for spatio-temporalaction localization In addition to a RCNN framework that takes RGBimages as input to generate region proposals they also developed anoptical flow based RCNN framework and proved that it can generatehigh quality proposals by using motion information Moreover insteadof detecting the human body as a whole they introduced a multi-regionscheme in the RCNN framework to generate proposals for different re-gions of human body so that complementary information can be addedon human body parts The proposed framework generated frame-leveldetections and Viterbi algorithm was used to link these detections fortemporal localization

Saha et al [60] proposed a three-stage framework for detecting thespatio-temporal locations of action instance They first took advantageof both appearance (RGB) and motion (optical flow) information to lo-calize and predict action scores Then the appearance and motion scoreswere combined to leverage the complementary information from these

22 Spatio-temporal Action Localization 21

two modalities Finally action tubes were constructed by linking the de-tections generated from the previous stages The detections were firstlinked based on their action class scores and spatial overlap and thenthe linked action tubes were temporally trimmed via ensuring label con-sistency

As computing optical flow from consecutive RGB images is verytime-consuming it is hard to achieve real-time performance for the two-stream based spatio-temporal action localization methods To addressthis issue Singh et al [67] used a real-time one-stage object detectionmethod to replace the RCNN framework used in [54] and [60] in orderto reduce the additional time consumption brought by the two-stage ob-ject detection framework Additionally they also introduced an efficientonline algorithm to incrementally link the generated frame-level detec-tions and construct action tubes By doing so the proposed method iscapable of performing real-time spatio-temporal action localization

The aforementioned spatio-temporal action localization methods relyon frame-level detections Kalogeiton et al [26] proposed an ACT-detectorto consider the temporal information of videos Instead of handling oneframe at a time ACT-detector took a sequence of RGB frames as inputand outputted a set of bounding boxes to form a tubelet Specificallyconvoulutional neural network was used to extract features for each ofthese RGB frames which were stacked to form representation for thewhole segment The representation was then used to regress the coordi-nates of the output tubelet The generated tubelets were more robustthan the linked action tube from frame-level detections because theywere produced by leveraging the temporal continuity of videos

A progressive learning framework for spatio-temporal action local-ization named STEP [89] was proposed by Yang et al They first usedseveral hand-designed coarse anchors to search for action of interestsand then progressively refined the coordinates of the anchors throughiterations in a coarse-to-fine manner In each iteration the anchors weretemporally extended to gradually include more temporal context in or-der to generate high quality proposals Moreover by doing so STEP wasable to naturally handle the spatial displacement within action tubes

22 Chapter 2 Literature Review

23 Temporal Action Localization

Unlike spatio-temporal action localization which requires to obtain thespatial locations of the action instances temporal action localization fo-cuses on accurately detecting when the action instances start and endin the time domain of the untrimmed videos Early works used slidingwindows to generate segment proposals and then apply the classifiersto classify actions for each proposal Yuan et al [97] captured motioninformation at multiple resolutions by using sliding windows to obtaina Pyramid of Score Distribution Feature They then considered the in-ter frame consistency by incorporating the pyramid of score distribu-tion feature which effectively boosted the temporal action localizationperformance Shou et al [63] used sliding window and C3D [77] as abinary classification task to generate background and action proposalsAll these proposals were fed to a classification network based on C3Dto predict labels Then they used the trained classification model to ini-tialize a C3D model as a localization model and took proposals and itsoverlap score as inputs of this localization model and proposed a newloss function to reduce the classification error

Most recent temporal action localization methods can be roughlycategorized into two types The first set of works used [49 10 29 98103] convolutional neural networks to extract frame-level or snippet-level features for the given videos which were then used to predictframe-wise or snippet-wise actionness scores By simply thresholdingthe actionness scores the action starting or ending time can be selectedwhich were then used to determine the temporal boundaries of actionsfor proposal generation

Shou et al [62] developed a new 3D convolutional architecture namedCDC to predict the actionness scores for each frame in order to deter-mine the temporal boundaries of action instances CDC consisted of twoparts In the first part 3D convolution layers were used to reduce thespatial and the temporal resolutions simultaneously In the second part2D convolutional layers were applied in the spatial domain to reduce

23 Temporal Action Localization 23

the resolution while deconvolutional layers were employed in the tem-poral domain to restore the resolution As a result the output of CDCcan have the same length as the input in the temporal domain so that theproposed framework was able to model the spatio-temporal informationand predict action label for each frame

Zhao et al [103] introduced a new effective proposal generation methodcalled temporal actionness grouping to address the temporal action lo-calization problem The input videos were first divided into snippetsand an actionness classifier was used to predict the binary actionnessscores for each snippets Consecutive snippets with high actionness scorewere grouped to form proposals which were used to build a featurepyramid to evaluate the action completeness for them Their methodhas improvement on the performance on temporal action localizationtask compared to other state of the art methods

In the first type of methods the generated proposals often have rel-atively accurate temporal boundaries as they are determined based onframe-wise or snippet-wise labels at a finer level with larger aperture oftemporal details However it is easy for them to wrongly predict lowactionness scores for frames within the action instances which results inlow recall rates

Another set of works [9 15 85 5] applied anchor-based object de-tection framework which first generated as set of temporal segment an-chors and then detection heads consisting of boundary regressors andaction classifiers were used to refine the locations and predict actionclasses for the anchors

Lin et al [36] introduced a single shot detection network to predictthe temporal locations of action instances in videos The idea came fromthe Single Shot Detection [44] framework used for the object detectiontask They first used convolutional neural networks to extract appear-ance and motion features which were then used to build a base layerand three anchor layers to generate prediction anchor instances Offsetregression and action class prediction were then performed based on the

24 Chapter 2 Literature Review

prediction anchor instances Their method can achieve comparable re-sults to those of the state of the art methods but the temporal boundariesof the detected action segments were less precise

Gao et al [15] proposed a temporal action localization network namedTRUN which decomposed long videos into short clip units and built an-chor pyramid to predict offsets and action class labels for these clip unitsin order to generate temporal action proposals By doing so the time costof the proposal generation process was significantly reduced and TURNcan achieve run time of 880 frames per second on a TITAN X GPU

Inspired by Faster RCNN [58] object detection framework Chao etal [5] proposed an improved approach for temporal action localizationnamed TAL-Net They applied the two-stage Faster RCNN frameworkon the temporal domain to detect the starting and ending points of ac-tion instances In addition to this in order to address the reception fieldalignment issue brought by the temporal localization problem they useda multi-scale architecture so that extreme variation of action duration canbe accommodated Moreover they also considered the temporal contextof action instances for better proposals generation and action class pre-diction

In the second category as the proposals are generated by the pre-defined anchors with various time intervals over the untrimmed videosthey are able to cover most ground-truth instances Unfortunately theyusually has lower performances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic way

24 Spatio-temporal Visual Grounding 25

24 Spatio-temporal Visual Grounding

Spatio-temporal visual grounding is a newly proposed vision and lan-guage task The goal of spatio-temporal visual grounding is to localizethe target objects from the given videos by using the textual queries thatdescribe the target objects

Zhang et al [102] proposed STGRN to address the spatio-temporalvisual grounding task They first used a pre-trained object detector todetect object proposals from each frame of the input videos These ob-ject proposals were then used to extract region features to build a spatio-temporal graph for relationship modelling The spatio-temporal graphconsisted of the implicit and explicit spatial subgraphs to reason the spa-tial relationship within each frame and the dynamic temporal subgraphto exploit the temporal relationship across consecutive frames The pro-posed graph was used to produce cross-modality spatio-temporal rep-resentations which were used to generate spatio-temporal tubes by aspatial localizer and a temporal localizer

In [101] Zhang et al proposed a spatial-temporal video groundingframework to model the relationship between the target objects and therelated objects while suppressing the relations of unnecessary objectsSpecifically multi-branch architecture was used to handle different typesof relations with each branch focused on its corresponding objects

Similar to the work in [102] Tang et al [76] also used a pre-trainedobject detector to generate human proposals Instead of using these pro-posals to model the relationship within each frame they used linking al-gorithm to link proposals across frames to generate human tubes Eachhuman tube was then used to extract features which were then fed into avisual-linguistic transformer to learn cross-modality representations andmodel temporal information The cross-modality representation werethen used to select target human tubes from a set of human tube pro-posals Additionally temporal trimming process was used to refine thetemporal boundaries of the selected human tubes

One of the drawbacks for the aforementioned spatio-temporal vi-sual grounding methods is that they rely on pre-trained object detectors

26 Chapter 2 Literature Review

to generate a set of object proposals which require additional trainingdata Also when the categories of the target objects do not appeared inthe training data of the pre-trained object detectors it is hard for themto generate proposals with these categories which degrades the spatio-temporal visual grounding performance

25 Summary

In this chapter we have summarized the related works on the back-ground vision tasks and three vision problems that are studied in thisthesis Specifically we first briefly introduce the existing methods onthe image classification problem Then we review the one-stage two-stage and anchor-free convlutional neural network based object detec-tion framework We then summarize the two-stream framework and 3Dconvolutional neural networks for spatio-temporal representation learn-ing Besides we also review the existing spatio-temporal action localiza-tion methods Additionally we also review the related works on differ-ent types of temporal action localization methods including frame-basedmethods and anchor-based methods The recent works on combiningthese two types of works to leverage the complementary information arealso discussed Finally we give a review on the new vision and languagetask spatio-temporal visual grounding

In this thesis we develop new deep learning approaches on spatio-temporal action localization temporal action localization and spatio-temporal visual grounding Although there are many works related tothese vision problems in the literature as summarized above in this the-sis we propose new frameworks to address these tasks by using deepneural networks and correspondingly design effective algorithms

Specifically in Chapter 3 we propose a new algorithm called pro-gressive cross-stream cooperation for spatio-temporal action lcoalizationin which we build a two-stream two-stage object detection frameworkand exchange complementary information between these two streamsat the proposal and feature levels The newly proposed strategy allows

25 Summary 27

the information from one stream to help better train another stream andprogressively refine the spatio-temporal action localization results

In Chapter 4 based on the observation of the existing anchor-basedand frame-based temporal action localization methods we propose anovel anchor-frame cooperation module which can exploit the comple-mentary between the anchor-based and frame-based temporal action lo-calization methods In addition we also leverage the complementary in-formation between appearance and motion clues to enhance the anchor-based features which can further benefit the information fusion betweentwo granularities

In Chapter 5 we build upon the recent proposed visual-linguistictransformer to learn cross-modality representations for the spatio-temporalvisual grounding task Additionally by introducing a simple but effec-tive spatio-temporal combination and decomposition module our pro-posed framework can remove the dependence on the pre-trained objectdetectors We use the anchor-free object detection framework to directlygenerate object tubes from frames in the input videos

29

Chapter 3

Progressive Cross-streamCooperation in Spatial andTemporal Domain for ActionLocalization

Spatio-temporal action localization consists of three levels of tasks spa-tial localization action classification and temporal localization In thischapter we propose a new Progressive Cross-stream Cooperation (PCSC)framework that improves all three tasks above The basic idea is to uti-lize both spatial region (resp temporal segment proposals) and featuresfrom one stream (ie the FlowRGB stream) to help another stream (iethe RGBFlow stream) to iteratively generate better bounding boxes inthe spatial domain (resp temporal segments in the temporal domain)In this way not only the actions could be more accurately localized bothspatially and temporally but also the action classes could be predictedmore precisely Specifically we first combine the latest region propos-als (for spatial detection) or segment proposals (for temporal localiza-tion) from both streams to form a larger set of labelled training sam-ples to help learn better action detection or segment detection modelsSecond to learn better representations we also propose a new mes-sage passing approach to pass information from one stream to anotherstream which also leads to better action detection and segment detec-tion models By first using our newly proposed PCSC framework forspatial localization at the frame-level and then applying our temporal

30Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

PCSC framework for temporal localization at the tube-level the actionlocalization results are progressively improved at both the frame leveland the video level Comprehensive experiments on two benchmarkdatasets UCF-101-24 and J-HMDB demonstrate the effectiveness of ournewly proposed approaches for spatio-temporal action localization in re-alistic scenarios

31 Background

311 Spatial temporal localization methods

Spatio-temporal action localization involves three types of tasks spa-tial localization action classification and temporal localization A hugeamount of efforts have been dedicated to improve the three tasks fromdifferent perspectives First for spatial localization the state-of-the-arthuman detection methods are utilized to obtain precise object propos-als which includes the use of fast and faster R-CNN in [8 60] as well asSingle Shot Multibox Detector (SSD) in [67 26]

Second discriminant features are also employed for both spatial lo-calization and action classification For example some methods [26 8]stack neighbouring frames near one key frame to extract more discrimi-nant features in order to remove the ambiguity of actions in each singleframe and to better represent this key frame Other methods [66 558] utilize recurrent neural networks to link individual frames or use 3DCNNs [4 77] to exploit temporal information Several works [31 93]detect actions in a single frame and predict action locations in the neigh-bouring frames to exploit the spatio-temporal consistency Additionallyanother work [105] uses a Markov chain model to integrate the appear-ance motion and pose clues for action classification and spatial-temporalaction localization

Third temporal localization is to form action tubes from per-frame

31 Background 31

detection results The methods for this task are largely based on the as-sociation of per-frame detection results such as the overlap continu-ity and smoothness of objects as well as the action class scores To im-prove localization accuracies a variety of temporal refinement methodshave been proposed eg the traditional temporal sliding windows [54]dynamic programming [67 60] tubelets linking [26] and thresholding-based refinement [8] etc

Finally several methods were also proposed to improve action de-tection efficiency For example without requiring time-consuming multi-stage training process the works in [60 54] proposed to train a sin-gle CNN model by simultaneously performing action classification andbounding box regression More recently an online real-time spatio-temporallocalization method is also proposed in [67]

312 Two-stream R-CNN

Based on the observation that appearance and motion clues are oftencomplementary to each other several state-of-the-art action detectionmethods [60 26 54 8] followed the standard two-stream R-CNN ap-proach The features extracted from the two streams are fused to im-prove action detection performance For example in [60] the softmaxscore from each motion bounding box is used to help the appearancebounding boxes with largest overlap In [8] three types of fusion strate-gies are discussed i) simply averaging the softmax outputs of the twostreams ii) learning per-class weights to weigh the original pre-softmaxoutputs and applying softmax on the weighted sum and iii) traininga fully connected layer on top of the concatenated output from eachstream It is reported in [8] that the third fusion strategy achieves thebest performance In [94] the final detection results are generated byintegrating the results from an early-fusion method which concatenatesthe appearance and motion clues as the input to generate detection re-sults and a late-fusion method which combines the outputs from thetwo streams

Please note that most appearance and motion fusion approaches in

32Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

the existing works as discussed above are based on the late fusion strat-egy They are only trained (if there is any training process) on top ofthe detection networks of the two streams In contrast in this workwe iteratively use both types of features and bounding boxes from onestream to progressively help learn better action detection models foranother stream which is intrinsically different with these existing ap-proaches [67 26] that fuse two-stream information only at the featurelevel in a late fusion fashion

32 Action Detection Model in Spatial Domain

Building upon the two-stream framework [54] we propose a Progres-sive Cross-stream Cooperation (PCSC) method for action detection atthe frame level In this method the RGB (appearance) stream and theflow (motion) stream iteratively help each other at both feature level andregion proposal level in order to achieve better localization results

321 PCSC Overview

The overview of our PCSC method (ie the frame-level detection method)is provided in Fig 31 As shown in Fig 31 (a) our PCSC is composed ofa set of ldquostages Each stage refers to one round of cross-stream coopera-tion process in which the features and region proposals from one streamwill help improve action localization performance for another streamSpecifically each stage comprises of two cooperation modules and a de-tection head module Our detection head module is a standard onewhich consists of several layers for region classification and regression

The two cooperation modules include region-proposal-level coop-eration and feature-level cooperation which are introduced in details inSection 322 and Section 323 respectively For region-proposal-levelcooperation the detection results from one stream (eg the RGB stream)are used as the additional region proposals which are combined with theregion proposals from the other stream (eg the flow stream) to refine theregion proposals and improve the action localization results Based onthe refined region proposals we also perform feature-level cooperation

32 Action Detection Model in Spatial Domain 33

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

t

IFlow

t

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(a)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(b)

=cup

Flow

4

Flow

2RGB

3

Stag

e 2

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

Feature

Cooperation

Detection

Head

=cup

RGB

1

RGB

0Flow

0=

cup

Flow

2

Flow

0

RGB

1=

cup

RGB

3RGB

1

Flow

2

Glo

bal F

low

Fe

atur

es

RGB

1

Flow

2

RGB

3

Flow

4

FRGB

1FFlow

2FRGB

3FFlow

4

Glo

bal R

GB

Fe

atur

es I

RGB

IFlow

Stag

e 3

Stag

e 4

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

1times1

1times1

RO

I Fea

ture

Extra

ctor

1times1

IRGB

IFlow

RGB

t

RGB

t

F RGB

t

F Flow

t

FRGB

t

Con

v

1times1

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

RO

I Fea

ture

Extra

ctor

IRGB

IFlow

Flow

t

Flow

t

F RGB

t

F Flow

tFFlow

t

Con

v

Con

v

RO

I Fea

ture

Extra

ctor

Mes

sage

Pa

ssin

g

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Impr

oved

Fe

atur

es

Det

ectio

n B

oxes

D

etec

tion

Box

es

Det

ectio

n B

oxes

(c)

Figu

re3

1(a

)O

verv

iew

ofou

rC

ross

-str

eam

Coo

pera

tion

fram

ewor

k(f

our

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

regi

onpr

opos

als

and

feat

ures

from

the

flow

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

and

Stag

e3

whi

leth

ere

gion

prop

osal

san

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2an

dSt

age

4Ea

chst

age

com

pris

esof

two

coop

erat

ion

mod

ules

and

the

dete

ctio

nhe

ad(

b)(c

)The

deta

ilsof

our

feat

ure-

leve

lcoo

pera

tion

mod

ules

thro

ugh

mes

sage

pass

ing

from

one

stre

amto

anot

her

stre

am

The

flow

feat

ures

are

used

tohe

lpth

eR

GB

feat

ures

in(b

)w

hile

the

RG

Bfe

atur

esar

eus

edto

help

the

flow

feat

ures

in(c

)

34Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

by first extracting RGBflow features from these ROIs and then refin-ing these RGBflow features via a message-passing module shown inFig 31 (b) and Fig 31 (c) which will be introduced in Section 323The refined ROI features in turn lead to better region classification andregression results in the detection head module which benefits the sub-sequent action localization process in the next stage By performing theaforementioned processes for multiple rounds we can progressively im-prove the action detection results The whole network is trained in anend-to-end fashion by minimizing the overall loss which is the summa-tion of the losses from all stages

After performing frame-level action detection our approach linksthe per-frame detection results to form action tubes in which the tempo-ral boundary is further refined by using our proposed temporal bound-ary refinement module The details are provides in Section 33

322 Cross-stream Region Proposal Cooperation

We employ the two-stream Faster R-CNN method [58] for frame-levelaction localization Each stream has its own Region Proposal Network(RPN) [58] to generate the candidate action regions and these candidatesare then used as the training samples to train a bounding box regressionnetwork for action localization Based on our observation the regionproposals generated by each stream can only partially cover the true ac-tion regions which degrades the detection performance Therefore inour method we use the region proposals from one stream to help an-other stream

In this paper the bounding boxes from RPN are called region pro-posals while the bounding boxes from the detection head are called de-tection boxes

The region proposals from the RPNs of the two streams are firstused to train their own detection head separately in order to obtain theirown corresponding detection boxes The set of region proposals for eachstream is then refined by combining two subsets The first subset is fromthe detection boxes of its own stream (eg RGB) at the previous stage

32 Action Detection Model in Spatial Domain 35

The second subset is from the detection boxes of another stream (egflow) at the current stage To remove more redundant boxes a lowerNMS threshold is used when the detection boxes in another stream (egflow) are used for the current stream (eg RGB)

Mathematically let P (i)t and B(i)t denote the set of region propos-

als and the set of the detection boxes respectively where i indicatesthe ith stream t denotes the tth stage The region proposal P (i)

t is up-dated as P (i)

t = B(i)tminus2 cup B(j)tminus1 and the detection box B(j)

t is updated as

B(j)t = G(P (j)

t ) where G(middot) denotes the mapping function from the de-tection head module Initially when t = 0 P (i)

0 is the region proposalfrom RPN for the ith stream This process is repeated between the twostreams for several stages which will progressively improve the detec-tion performance of both streams

With our approach the diversity of region proposals from one streamwill be enhanced by using the complementary boxes from another streamThis could help reduce the missing bounding boxes Moreover onlythe bounding boxes with high confidence are utilized in our approachwhich increases the chances to add more precisely detected boundingboxes and thus further help improve the detection performance Theseaforementioned strategies together with the cross-modality feature co-operation strategy that will be described in Section 323 effectively im-prove the frame-level action detection results

Our cross-stream cooperation strategy at the region proposal levelshares similar high-level ideas with the two-view learning method co-training [2] as both methods make use of predictions from one streamviewto generate more training samples (ie the additional region proposals inour approach) to improve the prediction results for another streamviewHowever our approach is intrinsically different with co-training in thefollowing two aspects As a semi-supervised learning method the co-training approach [2] selects unlabelled testing samples and assignspseudo-labels to those selected samples to enlarge the training set Incontrast the additional region proposals in our work still come from thetraining videos instead of testing videos so we know the labels of the

36Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

new training samples by simply comparing these additional region pro-posals with the ground-truth bounding boxes Also in co-training [2]the learnt classifiers will be directly used for predicting the labels of test-ing data in the testing stage so the complementary information from thetwo views for the testing data is not exploited In contrast in our workthe same pipeline used in the training stage will also be adopted for thetesting samples in the testing stage and thus we can further exploit thecomplementary information from the two streams for the testing data

323 Cross-stream Feature Cooperation

To extract spatio-temporal features for action detection similar to [19]we use the I3D network as the backbone network for each stream More-over we follow Feature Pyramid Network (FPN) [38] to build featurepyramid with high-level semantics which has been demonstrated to beuseful to improve bounding box proposals and object detection This in-volves a bottom-up pathway and a top-down pathway and lateral con-nections The bottom-up pathway uses the feed-forward computationalong the backbone I3D network which produces a feature hierarchywith increasing semantic levels but decreasing spatial resolutions Thenthese features are upsampled by using the top-down pathway whichare merged with the corresponding features in the bottom-up pathwaythrough lateral connections

Following [38] we use the feature maps at the layers of Conv2cMixed3d Mixed4f Mixed5c in I3D to construct the feature hierarchy inthe bottom-up pathway and denote these feature maps as Ci

2 Ci3 Ci

4Ci

5 where i isin RGB Flow indicates the RGB and the flow streamsrespectively Accordingly the corresponding feature maps in the top-down pathway are denoted as Ui

2 Ui3 Ui

4 Ui5

Most two-stream action detection frameworks [67 54 26] only ex-ploit the complementary RGB and flow information by fusing softmaxscores or concatenating the features before the final classifiers which are

32 Action Detection Model in Spatial Domain 37

insufficient for the features from the two streams to exchange informa-tion from one stream to another and benefit from such information ex-change Based on this observation we develop a message-passing mod-ule to bridge these two streams so that they help each other for featurerefinement

We pass the messages between the feature maps in the bottom-uppathway of the two streams Denote l as the index for the set of featuremaps in Ci

2 Ci3 Ci

4 Ci5 l isin 2 3 4 5 Let us use the improvement

of the RGB features as an example (the same method is also applicableto the improvement of the flow features) Our message-passing mod-ule improves the features CRGB

l with the help of the features CFlowl as

followsCRGB

l = fθ(CFlowl )oplus CRGB

l (31)

where oplus denotes the element-wise addition of the feature maps fθ(middot)is the mapping function (parameterized by θ) of our message-passingmodule The function fθ(CFlow

l ) nonlinearly extracts the message fromthe feature CFlow

l and then uses the extracted message for improvingthe features CRGB

l

The output of fθ(middot) has to produce the feature maps with the samenumber of channels and resolution as CRGB

l and CFlowl To this end

we design our message-passing module by stacking two 1times 1 convolu-tional layers with relu as the activation function The first 1times 1 convo-lutional layer reduces the channel dimension and the second convolu-tional layer restores the channel dimension back to its original numberThis design saves the number of parameters to be learnt in the moduleand exchanges message by using two-layer non-linear transform Oncethe feature maps CRGB

l in the bottom-up pathway are refined the corre-sponding features maps URGB

l in the top-down pathway are generatedaccordingly

The above process is for image-level messaging passing only Theimage-level message passing is only performed from the Flow stream tothe RGB stream once This message passing provides good features toinitialize the message-passing stages

38Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Denote the image-level feature map sets for the RGB and flow streamsby IRGB and IFlow respectively They are used to extract features FRGB

t

and FFlowt by ROI pooling at each stage t as shown in Fig 31 At Stage 1

and Stage 3 the ROI-feature FFlowt of the flow stream is used to help im-

prove the ROI-feature FRGBt of the RGB stream as illustrated in Fig 31

(b) Specifically the improved RGB feature FRGBt is obtained by apply-

ing the same method in Eqn (31) Similarly at Stage 2 and Stage 4 theROI-feature FRGB

t of the RGB stream is also used to help improve theROI-feature FFlow

t of the flow stream as illustrated in Fig 31 (c) Themessage passing process between ROI-features aims to provide betterfeatures for action box detection and regression which benefits the nextcross-stream cooperation stage

324 Training Details

For better spatial localization at the frame-level we follow [26] to stackneighbouring frames to exploit temporal context and improve actiondetection performance for key frames A key frame is a frame con-taining the ground-truth actions Each training sample which is usedto train the RPN in our PCSC method is composed of k neighbouringframes with the key frame in the middle The region proposals gener-ated from the RPN are assigned with positive labels when they have anintersection-over-union (IoU) overlap higher than 05 with any ground-truth bounding box or negative labels if their IoU overlap is lower than05 with all ground-truth boxes This label assignment strategy also ap-plies to the additional bounding boxes from the assistant stream duringthe region proposal-level cooperation process

33 Temporal Boundary Refinement for Action

Tubes

After the frame-level detection results are generated we then build theaction tubes by linking them Here we use the same linking strategy asin [67] except that we do not apply temporal labeling Although this

33 Temporal Boundary Refinement for Action Tubes 39

Initi

al

Segm

ents

(a) I

nitia

l Seg

men

t Gen

erat

ion

Initi

al S

egm

ent

Gen

erat

ion

RG

B o

r Flo

w F

eatu

res

from

Act

ion

Tube

s

=cup

2

0

1

Initi

al S

egm

ent

Gen

erat

ion

Initi

al

Segm

ents

Stag

e 2

Segm

ent F

eatu

reC

oope

ratio

nD

etec

tion

Hea

dSe

gmen

t Fea

ture

Coo

pera

tion

Det

ectio

nH

ead

=cup

1

0

0

1

2

1

2

Impr

oved

Seg

men

t Fe

atur

es

Det

ecte

d Se

gmen

ts

Stag

e 1

Prop

osal

Coo

pera

tion

Prop

osal

Coo

pera

tion

Impr

oved

Seg

men

tFe

atur

es

Det

ecte

d Se

gmen

ts

(b) T

empo

ral P

CSC

RG

B F

eatu

res

from

A

ctio

n Tu

bes

0

Flow

Fea

ture

s fr

om

Act

ion

Tube

s

0

Actio

nnes

sD

etec

tor (

AD)

Actio

n Se

gmen

tD

etec

tor (

ASD

)Se

gmen

ts fr

om A

D

Uni

on o

f Seg

men

tsSe

gmen

ts fr

om A

SD

Initi

al S

egm

ents

Figu

re3

2O

verv

iew

oftw

om

odul

esin

our

tem

pora

lbo

unda

ryre

finem

ent

fram

ewor

k(a

)th

eIn

itia

lSe

gmen

tG

en-

erat

ion

Mod

ule

and

(b)

the

Tem

pora

lPr

ogre

ssiv

eC

ross

-str

eam

Coo

pera

tion

(T-P

CSC

)M

odul

eTh

eIn

itia

lSe

gmen

tG

ener

atio

nm

odul

eco

mbi

nes

the

outp

uts

from

Act

ionn

ess

Det

ecto

ran

dA

ctio

nSe

gmen

tDet

ecto

rto

gene

rate

the

init

ial

segm

ents

for

both

RG

Ban

dflo

wst

ream

sO

urT-

PCSC

fram

ewor

kta

kes

thes

ein

itia

lseg

men

tsas

the

segm

entp

ropo

sals

and

iter

ativ

ely

impr

oves

the

tem

pora

lact

ion

dete

ctio

nre

sult

sby

leve

ragi

ngco

mpl

emen

tary

info

rmat

ion

oftw

ost

ream

sat

both

feat

ure

and

segm

entp

ropo

sall

evel

s(t

wo

stag

esar

eus

edas

anex

ampl

efo

rbe

tter

illus

trat

ion)

The

segm

entp

ro-

posa

lsan

dfe

atur

esfr

omth

eflo

wst

ream

help

impr

ove

acti

onse

gmen

tdet

ecti

onre

sult

sfo

rth

eR

GB

stre

amat

Stag

e1

whi

leth

ese

gmen

tpro

posa

lsan

dfe

atur

esfr

omth

eR

GB

stre

amhe

lpim

prov

eac

tion

dete

ctio

nre

sult

sfo

rth

eflo

wst

ream

atSt

age

2Ea

chst

age

com

pris

esof

the

coop

erat

ion

mod

ule

and

the

dete

ctio

nhe

adT

hese

gmen

tpro

posa

llev

elco

oper

-at

ion

mod

ule

refin

esth

ese

gmen

tpro

posa

lsP

i tby

com

bini

ngth

em

ostr

ecen

tseg

men

tpro

posa

lsfr

omth

etw

ost

ream

sw

hile

the

segm

ent

feat

ure

leve

lcoo

pera

tion

mod

ule

impr

oves

the

feat

ures

Fi tw

here

the

supe

rscr

ipt

iisinR

GB

Flo

w

deno

tes

the

RG

BFl

owst

ream

and

the

subs

crip

ttde

note

sth

est

age

num

ber

The

dete

ctio

nhe

adis

used

toes

tim

ate

the

acti

onse

gmen

tloc

atio

nan

dth

eac

tion

ness

prob

abili

ty

40Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

linking strategy is robust to missing detection it is still difficult to ac-curately determine the start and the end of each action tube which is akey factor that degrades video-level performance of our action detectionframework

To address this problem the temporal boundaries of the action tubesneed to be refined Our temporal boundary refinement method uses thefeatures extracted from action tubes to refine the temporal boundaries ofthe action tubes It is built upon two-stream framework and consists offour modules Pre-processing Module Initial Segment Generation mod-ule Temporal Progressive Cross-stream Cooperation (T-PCSC) moduleand Post-processing Module An overview of the two major modulesie the Initial Segment Generation Module and the T-PCSC Module isprovided in Fig 32 Specifically in the Pre-processing module we useeach bounding box in the action tubes to extract the 7times 7 feature mapsfor the detected region on each frame by ROI-Pooling and then tempo-rally stack these feature maps to construct the features for the actiontubes These features are then used as the input to the Initial SegmentGeneration module As shown in Fig 32 we develop two segment gen-eration methods in the Initial Segment Generation module (1) a class-specific actionness detector to evaluate actionness (ie the happeningof an action) for each frame and (2) an action segment detector basedon two-stream Faster-RCNN framework to detect action segments Theoutputs of the actionness detector and the action segment detector arecombined to generate the initial segments for both the RGB and flowstreams Our T-PCSC Module extends our PCSC framework from thespatial domain to the temporal domain for temporal localization and ituses features from the action tubes and the initial segments generatedby the Initial Segment Generation module to encourage the two streamsto help each other at both feature level and segment proposal level toimprove action segment detection After multiple stages of cooperationthe output action segments from each stage are combined in the Post-processing module in which NMS is also applied to remove redundancyand generate the final action segments

33 Temporal Boundary Refinement for Action Tubes 41

331 Actionness Detector (AD)

We first develop a class-specific actionness detector to detect the action-ness at each given location (spatially and temporally) Taking advantageof the predicted action class labels produced by frame-level detectionwe construct our actionness detector by using N binary classifiers whereN is the number of action classes Each classifier addresses the action-ness of a specific class (ie whether an action happens or not at a givenlocation in a frame) This strategy is more robust than learning a gen-eral actionness classifier for all action classes Specifically after frame-level detection each bounding box has a class label based on which thebounding boxes from the same class are traced and linked to form actiontubes [67] To train the binary actionness classifier for each action classi the bounding boxes that are within the action tubes and predicted asclass i by the frame-level detector are used as the training samples Eachbounding box is labeled either as 1 when its overlap with the ground-truth box is greater than 05 or as 0 otherwise Note that the trainingsamples may contain bounding boxes falsely detected near the temporalboundary of the action tubes Therefore these samples are the ldquoconfus-ing samples and are useful for the actionness classifier to learn the subtlebut critical features which better determine the beginning and the endof this action The output of the actionness classifier is a probability ofactionness belonging to class i

At the testing stage given an action tube formed by using [67] weapply the class-specific actionness detector to every frame-level bound-ing box in this action tube to predict its actionness probability (called ac-tionness score) Then a median filter over multiple frames is employedto smooth the actionness scores of all bounding boxes in this tube If abounding box has a smoothed score lower than a predefined thresholdit will be filtered out from this action tube and then we can refine theaction tubes so that they have more accurate temporal boundaries Notethat when a non-action region near a temporal boundary is falsely de-tected it is included in the training set to train our class-specific action-ness detector Therefore our approach takes advantage of the confusingsamples across temporal boundary to obtain better action tubes at the

42Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

testing stage

332 Action Segment Detector (ASD)

Motivated by the two-stream temporal Faster-RCNN framework [5] wealso build an action segment detector which directly utilizes segmentproposals to detect action segments Our action segment detector takesboth RGB and optical flow features from the action tubes as the inputand follows the fast-RCNN framework [5] to include a Segment ProposalNetwork (SPN) for temporal segment proposal generation and detectionheads for segment proposal regression Specifically we extract the seg-ment features by using these two-stream features from each action tubeand the segment proposals produced by SPN The segment features arethen sent to the corresponding detection head for regression to gener-ate the initial detection of segments This process is similar to the ac-tion detection method described in Section 32 A comparison betweenthe single-stream action detection method and the single-stream actionsegment detector is shown in Figure 33 in which the modules withineach blue dashed box share similar functions at the spatial domain andthe temporal domain respectively Our implementation details for SPNfeature extraction and detection head are introduced below

(a) SPN In Faster-RCNN the anchors of different object sizes are pre-defined Similary in our SPN K types of action anchors with differentduration (time lengths) are pre-defined The boundaries of the action an-chors are regressed by two layers of temporal convolution (called Con-vNet in [5]) using RGB or flow features from the action tube [5] Notethat the object sizes in an image only vary in a relatively small rangebut the temporal span of an action can dramatically change in a rangefrom less than one second to more than one minute Consequently itis inappropriate for anchors with different scales to share the temporalfeatures from the same receptive field To address this issue we use themulti-branch architecture and dilated temporal convolutions in [5] foreach anchor prediction based on the anchor-specific features Specifi-cally for each scale of anchors a sub-network with the same architecture

33 Temporal Boundary Refinement for Action Tubes 43

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventuallyremoved by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

Figure 33 Comparison of the single-stream frameworks of the actiondetection method (see Section 3) and our action segment detector (seeSection 42) The modules within each blue dashed box share similarfunctions at the spatial domain and the temporal domain respectively

but different dilation rate is employed to generate features with differentreceptive fields and decide the temporal boundary In this way our SPNconsists of K sub-networks targeting at different time lengths In orderto consider the context information for each anchor with the scale s werequire the receptive field to cover the context information with tempo-ral span s2 before and after the start and the end of the anchor We thenregress temporal boundaries and predict actionness probabilities for theanchors by applying two parallel temporal convolutional layers with thekernel size of 1 The output segment proposals are ranked according totheir proposal actionness probabilities and NMS is applied to remove theredundant proposals Please refer to [5] for more details

(b) Segment feature extraction After generating segment proposals weconstruct segment features to refine the temporal boundaries and predictthe segment actionness probabilities For each segment proposal withthe length ls the start point tstart and the end point tend we extract three

44Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

types of features Fcenter Fstart and Fend We use temporal ROI-Poolingon top of the segment proposal regions to pool action tube features andobtain the features Fcenter isin RDtimeslt where lt is the temporal length and Dis the dimension of the features at each temporal location As the con-text information immediately before and after the segment proposal ishelpful to decide the temporal boundaries of an action this informationshould be represented by our segment features Therefore we addition-ally use temporal ROI-Pooling to pool features Fstart (resp Fend) from thesegments centered at tstart (resp tend) with the length of 2ls5 to repre-sent the starting (reps ending) information We then concatenate FstartFcenter and Fend to construct the segment features Fseg

(c) Detection head The extracted segment features are passed to thedetection head to obtain the refined segment boundaries and the seg-ment actionness probabilities As the temporal length of the extractedtemporal features is not fixed similar to SPN the detection head mod-ule consists of M sub-networks corresponding to the M groups of thetemporal length of the segment proposals resulting from SPN For eachsub-network a spatial fully-connected layer is first used to reduce thespatial resolution and then two temporal convolution layers with thekernel size of 3 are applied to non-linearly combine temporal informa-tion The output features are used to predict the actionness probabilitiesand regress the segment boundaries For actionness classification in ad-dition to minimize the segment-level actionness loss we also minimizethe frame-level actionness loss We observe that it is beneficial to extractdiscriminative features by predicting actionness for each frame whichwill eventually help refine the boundaries of segments

333 Temporal PCSC (T-PCSC)

Our PCSC framework can leverage the complementary information atthe feature level and the region proposal level to iteratively improve ac-tion detection performance at the spatial domain Similarly we proposea Temporal PCSC (T-PCSC) module that extends our PCSC framework

33 Temporal Boundary Refinement for Action Tubes 45

from the spatial domain to the temporal domain for temporal localiza-tion Specifically we combine the outputs generated by the actionnessdetector and the action segment detector to obtain the initial segmentsfor each stream These initial segments are then used by our T-PCSCmodule as the initial segment proposals to generate action segments

Similar to our PCSC method described in Section 32 our T-PCSCmethod consists of the feature cooperation module and proposal coop-eration module as shown in Fig 32 Note that for the message-passingblock in the feature cooperation module the two 1times 1 convolution lay-ers are replaced by two temporal convolution layers with the kernel sizeof 1 We only apply T-PCSC for two stages in our action segment de-tector method due to memory constraints and the observation that wealready achieve good results after using two stages Similar to our PCSCframework shown in Figure 31 in the first stage of our T-PCSC frame-work the initial segments from the RGB stream are combined with thosefrom the flow stream to form a new segment proposal set by using theproposal cooperation module This new segment proposal set is used toconstruct the segment features as described in Section 332 (b) for boththe RGB and the flow streams and then the feature cooperation modulepasses the messages from the flow features to the RGB features to refinethe RGB features The refined RGB features are then used to generate theoutput segments by the detection head described in Section 332 (c) Asimilar process takes place in the second stage But this time the outputsegments from the flow stream and the first stage are used to form thesegment proposal set and the message passing process is conducted fromthe RGB features to the flow features to help refine the flow features Wethen generate the final action segments by combining the output actionsegments of each stage and applying NMS to remove redundancy andobtain the final results

46Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

34 Experimental results

We introduce our experimental setup and datasets in Section 341 andthen compare our method with the state-of-the-art methods in Section 342and conduct ablation study in Section 343

341 Experimental Setup

Datasets We evaluate our PCSC method on two benchmarks UCF-101-24 [68] and J-HMDB-21 [24] UCF-101-24 contains 3207 untrimmedvideos from 24 sports classes which is a subset of the UCF-101 datasetwith spatio-temporal annotations provided by [67] Following the com-mon practice we use the predefined ldquosplit 1 protocol to split the train-ing and test sets and report the results based on this split J-HMDB-21contains 928 videos from 21 action classes All the videos are trimmedto contain the actions only We perform the experiments on three prede-fined training-test splits and report the average results on this dataset

Metrics We evaluate the action detection performance at both frame-level and video-level by mean Average precision (mAP) To calculatemAP a detection box is considered as a correct one when its overlapwith a ground-truth box or tube is greater than a threshold δ The over-lap between our detection results and the ground truth is measured bythe intersection-over-union (IoU) score at the frame-level and the spatio-temporal tube overlap at the video-level respectively In addition wealso report the results based on the COCO evaluation metric [40] whichaverages the mAPs over 10 different IoU thresholds from 05 to 095 withthe interval of 005

To evaluate the localization accuracy of our method we also calcu-late Mean Average Best Overlap (MABO) We compute the IoU scoresbetween the bounding boxes from our method and the ground-truthboxes and then average the IoU scores from the bounding boxes withlargest overlap to the ground-truth boxes for each class The MABO iscalculated by averaging the averaged IoU scores over all classes

34 Experimental results 47

To evaluate the classification performance of our method we reportthe average confident score over different IoU threshold θ The aver-age confident score is computed by averaging the confident scores of thebounding boxes from our method that have an IoU with the ground-truth boxes greater than θ

Implementation Details We use the I3D features [4] for both streamsand the I3D model is pretrained with Kinetics The optical flow imagesare extracted by using FlowNet v2 [23] The mini-batch size used totrain the RPN and the detection heads in our action detection methodis 256 and 512 respectively Our PCSC based action detection methodis trained for 6 epochs by using three 1080Ti GPUs The initial learningrate is set as 001 which drops 10 at the 5th epoch and another 10at the 6th epoch For our actionness detector we set the window sizeas 17 when using median filter to smooth the actionness scores we alsouse a predefined threshold to decide the starting and ending locationsfor action tubes which is empirically set as 001 For segment genera-tion we apply the anchors with the following K scales (K = 11 in thiswork) 6 12 24 42 48 60 84 92 144 180 228 The length rangesR which are used to group the segment proposals are set as follows[20 80] [80 160] [160 320] [320 infin] ie we have M = 4 and the corre-sponding feature lengths lt for the length ranges are set to 15 30 60 120The numbers of the scales and the lengths are empirically decided basedon the frame numbers In this work we apply NMS to remove redun-dancy and combine the detection results from different stages with anoverlap threshold of 05 The mini-batch sizes used to train the SPN andthe detection heads in our temporal refinement method are set as 128and 32 respectively Our PCSC based temporal refinement method istrained for 5 epochs and the initial learning rate is set as 0001 whichdrops 10 at the 4th epoch and another 10 at the 5th epoch

342 Comparison with the State-of-the-art methods

We compare our PCSC method with the state-of-the-art methods Theresults of the existing methods are quoted directly from their original pa-pers In addition we also evaluate the object detection method (denoted

48Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

as ldquoFaster R-CNN + FPN) in [38] with the I3D network as its backbonenetwork in which we report the results by only using the single stream(ie RGB or Flow) features as well as the result by applying NMS to theunion set of the two stream bounding boxes Specifically the only differ-ence between the method in [38] and our PCSC is that our PCSC methodhas the cross-stream cooperation modules while the work in [38] doesnot have them Therefore the comparison between the two approachescan better demonstrate the benefit of our cross-stream cooperation strat-egy

Results on the UCF-101-24 Dataset

The results on the UCF-101-24 dataset are reported in Table 31 and Ta-ble 32

Table 31 Comparison (mAPs at the frame level) of different methodson the UCF-101-24 dataset when using the IoU threshold δ at 05

Streams mAPsWeinzaepfel Harchaoui and Schmid [83] RGB+Flow 358Peng and Schmid [54] RGB+Flow 657Ye Yang and Tian [94] RGB+Flow 670Kalogeiton et al [26] RGB+Flow 695Gu et al [19] RGB+Flow 763Faster R-CNN + FPN (RGB) [38] RGB 732Faster R-CNN + FPN (Flow) [38] Flow 647Faster R-CNN + FPN [38] RGB+Flow 755PCSC (Ours) RGB+Flow 792

Table 31 shows the mAPs from different methods at the frame-levelon the UCF-101-24 dataset All mAPs are calculated based on the IoUthreshold δ = 05 From Table 31 we observe that our PCSC methodachieves an mAP of 792 outperforming all the existing methods by alarge margin Especially our PCSC performs better than Faster R-CNN+ FPN [38] by an improvement of 37 This improvement is fully dueto the proposed cross-stream cooperation framework which is the onlydifference between [38] and our PCSC It is interesting to observe thatboth methods [26] and [19] additionally utilize the temporal context for

34 Experimental results 49

Table 32 Comparison (mAPs at the video level) of different methodson the UCF-101-24 dataset when using different IoU thresholds HereldquoAD is for actionness detector and ldquoASD is for action segment detector

IoU threshold δ Streams 02 05 075 05095Weinzaepfel et al [83] RGB+Flow 468 - - -

Zolfaghari et al [105] RGB+Flow+Pose 476 - - -

Peng and Schmid [54] RGB+Flow 735 321 027 073Saha et al [60] RGB+Flow 666 364 08 144Yang Gao and Nevatia [93] RGB+Flow 735 378 - -Singh et al [67] RGB+Flow 735 463 150 204Ye Yang and Tian [94] RGB+Flow 762 - - -Kalogeiton et al [26] RGB+Flow 765 492 197 234Li et al [31] RGB+Flow 779 - - -Gu et al [19] RGB+Flow - 599 - -Faster R-CNN+ FPN (RGB) [38] RGB 775 513 142 219

Faster R-CNN+ FPN (Flow) [38] Flow 727 442 74 172

Faster R-CNN+ FPN [38] RGB+Flow 801 532 159 237

PCSC + AD [69] RGB+Flow 843 610 230 278PCSC + ASD+ AD+ T-PCSC (Ours) RGB+Flow 847 631 236 289

per frame-level action detection However both methods do not fully ex-ploit the complementary information between the appearance and mo-tion clues and therefore they are worse than our method Moreover ourmethod also outperforms [26] and [19] by 97 and 29 respectively interms of frame-level mAPs

Table 32 reports the video-level mAPs at various IoU thresholds(02 05 and 075) on the UCF-101-24 dataset The results based onthe COCO evaluation metrics [40] are reported in the last column of Ta-ble 32 Our method is denoted as ldquoPCSC + ASD + AD + T-PCSC inTable 32 where the proposed temporal boundary refinement methodconsisting of three modules (ie the actionness detector module the ac-tion segment detector module and the Temporal PCSC) is applied to re-fine the action tubes generated from our PCSC method Our preliminary

50Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

work [69] which only uses the actionness detector module as the tempo-ral boundary refinement method is denoted as rdquoPCSC + AD Consis-tent with the observations at the frame-level our method outperformsall the state-of-the-art methods under all evaluation metrics When us-ing the IoU threshold δ = 05 we achieve an mAP of 631 on the UCF-101-24 dataset This result beats [26] and [19] which only achieve themAPs of 492 and 599 respectively Moreover as the IoU thresholdincreases the performance of our method drops less when comparedwith other state-of-the-art methods The results demonstrate that ourdetection method achieves higher localization accuracy than other com-petitive methods

Results on the J-HMDB Dataset

For the J-HMDB dataset the frame-level mAPs and the video-level mAPsfrom different mothods are reported in Table 33 and Table 34 respec-tively Since the videos in the J-HMDB dataset are trimmed to only con-tain actions the temporal refinement process is not required so we donot apply our temporal boundary refinement module when generatingaction tubes on the J-HMDB dataset

Table 33 Comparison (mAPs at the frame level) of different methodson the J-HMDB dataset when using the IoU threshold δ at 05

Streams mAPsPeng and Schmid [54] RGB+Flow 585Ye Yang and Tian [94] RGB+Flow 632Kalogeiton et al [26] RGB+Flow 657Hou Chen and Shah [21] RGB 613Gu et al [19] RGB+Flow 733Sun et al [72] RGB+Flow 779Faster R-CNN + FPN (RGB) [38] RGB 647Faster R-CNN + FPN (Flow) [38] Flow 684Faster R-CNN + FPN [38] RGB+Flow 702PCSC (Ours) RGB+Flow 748

We have similar observations as in the UCF-101-24 dataset At thevideo-level our method is again the best performer under all evalua-tion metrics on the J-HMDB dataset (see Table 34) When using the IoU

34 Experimental results 51

Table 34 Comparison (mAPs at the video level) of different methodson the J-HMDB dataset when using different IoU thresholds

IoU threshold δ Streams 02 05 075 05095Gkioxari and Malik [18] RGB+Flow - 533 - -Wang et al [79] RGB+Flow - 564 - -Weinzaepfel et al [83] RGB+Flow 631 607 - -Saha et al [60] RGB+Flow 726 715 433 400Peng and Schmid [54] RGB+Flow 741 731 - -Singh et al [67] RGB+Flow 738 720 445 416Kalogeiton et al [26] RGB+Flow 742 737 521 448Ye Yang and Tian [94] RGB+Flow 758 738 - -

Zolfaghari et al [105] RGB+Flow+Pose 782 734 - -

Hou Chen and Shah [21] RGB 784 769 - -Gu et al [19] RGB+Flow - 786 - -Sun et al [72] RGB+Flow - 801 - -Faster R-CNN+ FPN (RGB) [38] RGB 745 739 541 442

Faster R-CNN+ FPN (Flow) [38] Flow 779 761 561 463

Faster R-CNN+ FPN [38] RGB+Flow 791 785 572 476

PCSC (Ours) RGB+Flow 826 822 631 528

threshold δ = 05 our PCSC method outperforms [19] and [72] by 36and 21 respectively

At the frame-level (see Table 33) our PCSC method performs thesecond best which is only worse than a very recent work [72] Howeverthe work in [72] uses S3D-G as the backbone network which providesmuch stronger features when compared with the I3D features used inour method In addition please note that our method outperforms [72]in terms of mAPs at the video level (see Table 34) which demonstratespromising performance of our PCSC method Moreover as a generalframework our PCSC method could also take advantage of strong fea-tures provided by the S3D-G method to further improve the resultswhich will be explored in our future work

52Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

343 Ablation Study

In this section we take the UCF-101-24 dataset as an example to investi-gate the contributions of different components in our proposed method

Table 35 Ablation study for our PCSC method at different trainingstages on the UCF-101-24 dataset

Stage PCSC wo featurecooperation PCSC

0 755 7811 761 7862 764 7883 767 7914 767 792

Table 36 Ablation study for our temporal boundary refinement methodon the UCF-101-24 dataset Here ldquoAD is for the actionness detectorwhile ldquoASD is for the action segment detector

Video mAP ()Faster R-CNN + FPN [38] 532PCSC 564PCSC + AD 610PCSC + ASD (single branch) 584PCSC + ASD 607PCSC + ASD + AD 619PCSC + ASD + AD + T-PCSC 631

Table 37 Ablation study for our Temporal PCSC (T-PCSC) method atdifferent training stages on the UCF-101-24 dataset

Stage Video mAP ()0 6191 6282 631

Progressive cross-stream cooperation In Table 35 we report theresults of an alternative approach of our PCSC (called PCSC wo fea-ture cooperation) In the second column of Table 35 we remove thefeature-level cooperation module from Fig 31 and only use the region-proposal-level cooperation module As our PCSC is conducted in a pro-gressive manner we also report the performance at different stages to

34 Experimental results 53

verify the benefit of this progressive strategy It is worth mentioning thatthe output at each stage is obtained by combining the detected bound-ing boxes from both the current stage and all previous stages and thenwe apply non-maximum suppression (NMS) to obtain the final bound-ing boxes For example the output at Stage 4 is obtained by applyingNMS to the union set of the detected bounding boxes from Stages 01 2 3 and 4 At Stage 0 the detection results from both RGB and theFlow streams are simply combined From Table 35 we observe that thedetection performance of our PCSC method with or without the feature-level cooperation module is improved as the number of stages increasesHowever such improvement seems to become saturated when reachingStage 4 as indicated by the marginal performance gain from Stage 3 toStage 4 Meanwhile when comparing our PCSC with the alternative ap-proach PCSC wo feature cooperation at every stage we observe thatboth region-proposal-level and feature-level cooperation contributes toperformance improvement

Action tubes refinement In Table 36 we investigate the effective-ness of our temporal boundary refinement method by switching on andoff the key components in our temporal boundary refinement modulein which video-level mAPs at the IoU threshold δ = 05 are reportedIn Table 36 we investigate five configurations for temporal boundaryrefinement Specifically in lsquoPCSC + AD we use the PCSC method forspatial localization together with the actionness detector (AD) for tempo-ral localization In ldquoPCSC + ASD we use the PCSC method for spatiallocalization while we use the action segment detection method for tem-poral localization In ldquoPCSC + ASD (single branch) we use the ldquoPCSC+ ASD configuration but replace the multi-branch architecture in theaction segment detection method with the single-branch architecture InldquoPCSC + ASD + AD the PCSC method is used for spatial localizationtogether with both AD and ASD for temporal localization In ldquoPCSC +ASD + AD + T-PCSC the PCSC method is used for spatial localizationand both actionness detector and action segment detector are used togenerate the initial segments and our T-PCSC is finally used for tempo-ral localization By generating higher quality detection results at eachframe our PCSC method in the spatial domain outperforms the work

54Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

Table 38 The large overlap ratio (LoR) of the region proposals from thelatest RGB stream with respective to the region proposals from the latestflow stream at each stage in our PCSC action detector method on theUCF-101-24 dataset

Stage LoR()1 6112 7193 9524 954

in [38] that does not use the two-stream cooperation strategy in the spa-tial domain by 32 After applying our actionness detector module tofurther boost the video-level performance we arrive at the video-levelmAP of 610 Meanwhile our action segment detector without usingT-PCSC has already been able to achieve the video-level mAP of 607which is comparable to the performance from the actionness detectormodule in terms of video-level mAPs On the other hand the video-levelmAP drops 23 after replacing the multi-branch architecture in the ac-tion segment detector with the single-branch architecture which demon-strates the effectiveness of the multi-branch architecture Note that thevideo-level mAP is further improved when combining the results fromactionness detector and the action segment detector without using thetemporal PCSC module which indicates that the frame-level and thesegment-level actionness results could be complementary When we fur-ther apply our temporal PCSC framework the video-level mAP of ourproposed method increases 12 and arrives at the best performanceof 631 which demonstrates the effectiveness of our PCSC frameworkfor temporal refinement In total we achieve about 70 improvementafter applying our temporal boundary refinement module on top of ourPCSC method which indicates the necessity and benefits of the temporalboundary refinement process in determining the temporal boundaries ofaction tubes

Temporal progressive cross-stream cooperation In Table 37 wereport the video-level mAPs at the IoU threshold δ = 05 by using ourtemporal PCSC method at different stages to verify the benefit of thisprogressive strategy in the temporal domain Stage 0 is our ldquoPCSC +

34 Experimental results 55

Table 39 The Mean Average Best Overlap (MABO) of the boundingboxes with the ground-truth boxes at each stage in our PCSC action de-tector method on the UCF-101-24 dataset

Stage MABO()1 8412 8433 8454 846

Table 310 The average confident scores () of the bounding boxeswith respect to the IoU threshold θ at different stages on the UCF-101-24 dataset

θ Stage 1 Stage 2 Stage 3 Stage 405 553 557 563 57006 621 632 643 65507 68 685 694 69508 739 745 748 75109 784 784 795 801

Table 311 Detection speed in frame per second (FPS) for our actiondetection module in the spatial domain on UCF-101-24 when using dif-ferent number of stages

Stage 0 1 2 3 4

Detection speed (FPS) 31 29 28 26 24

Table 312 Detection speed in video per second (VPS) for our tempo-ral boundary refinement method for action tubes on UCF-101-24 whenusing different number of stages

Stage 0 1 2

Detection speed (VPS) 21 17 15

ASD + AD method Stage 1 and Stage 2 refer to our ldquoPCSC + ASD + AD+ T-PCSC method using one-stage and two-stage temproal PCSC re-spectively The video-level performance after using our temporal PCSCmethod is improved as the number of stages increases which demon-strates the effectiveness of the progressive strategy in the temporal do-main

56Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

344 Analysis of Our PCSC at Different Stages

In this section we take the UCF-101-24 dataset as an example to fur-ther investigate our PCSC framework for spatial localization at differentstages

In order to investigate the similarity change of proposals generatedfrom two streams over multiple stages for each proposal from the RGBstream we compute its maximum IoU (mIoU) with respect to all the pro-posals from the flow stream We define the large overlap ratio (LoR) asthe percentage of the RGB-stream proposals whose mIoU with all theflow-stream proposals is larger than 07 over the total number of RGBproposals Table 38 shows the LoR of the latest RGB stream with re-spect to the latest flow stream at every stage on the UCF-101-24 datasetSpecifically at stage 1 or stage 3 LoR indicates the overlap degree ofthe RGB-stream proposals generated at the current stage (ie stage 1 orstage 3 respectively) with respect to the flow-stream proposals gener-ated at the previous stage (ie stage 0 or stage 2 respectively) at stage 2or stage 4 LoR indicates the overlap degree of the RGB-stream propos-als generated at the previous stage (ie stage 1 or stage 3 respectively)with respect to the flow-stream proposals generated at the current stage(ie stage 2 or stage 4 respectively) We observe that the LoR increasesfrom 611 at stage 1 to 954 at stage 4 which demonstrates that thesimilarity of the region proposals from these two streams increases overstages At stage 4 the region proposals from different streams are verysimilar to each other and therefore these two types of proposals are lackof complementary information As a result there is no further improve-ment after stage 4

To further evaluate the improvement of our PCSC method over stageswe report MABO and the average confident score to measure the local-ization and classification performance of our PCSC method over stagesTable 39 and Table 310 show the MABO and the average confident scoreof our PCSC method at every stages on the UCF-101-24 dataset In termsof both metrics improvement after each stage can be observed clearlywhich demonstrates the effectiveness of our PCSC method for both lo-calization and classification

34 Experimental results 57

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detection instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 34 An example of visualization results from different methodsfor one frame (a) single-stream Faster-RCNN (RGB stream) (b) single-stream Faster-RCNN (flow stream) (c) a simple combination of two-stream results from Faster-RCNN [38] and (d) our PCSC (Best viewedon screen)

345 Runtime Analysis

We report the detection speed of our action detection method in spa-tial domain and our temporal boundary refinement method for detect-ing action tubes when using different number of stages and the resultsare provided in Table 311 and Table 312 respectively We observe thatthe detection speed of both our action detection method in the spatialdomain and the temporal boundary refinement method for detectingaction tubes slightly decreases as the number of stages increases (notethat Stage 0 in Table 311 corresponds to the baseline method Faster-RCNN [38] and Stage 0 in Table 312 corresponds to the results from ourInitial Segment Generation module) Therefore there is a trade-off be-tween the localization performance and the speed by using the optimalnumber of stages

58Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Bounding Boxesfor RGB or Flow Stream

Detected Segmentsfor RGB or Flow Stream

(a) Action Detection Method (b) Action Segment Detector

Detected instances for the leftright ground-truth instance from Faster-RCNN

LeftRight ground-truth instance

Ground-truth Instance

Detected instances from our PCSCDetected Instances from two-stream Faster-RCNN

Detected instances from our PCSC that are eventually removed by our TBR

Detected Instances for the leftright ground-truth instance from our PCSC

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(a) (b)

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

RPN SPN

RGB or Optical FlowImages

RGB or Optical Flow Features from Action Tubes

Initial Proposals

ROI PoolingSegmentFeature

Extraction

Initial Segment Proposals

DetectionHead

DetectionHead

Region Features Segment Features

Detected Boxesfor RGB or Flow

Detected Segmentsfor RGB or Flow

Single-stream Action Detection Model

Single-stream Action Segment Detector

(c) (d)

Figure 35 Example results from our PCSC with temporal boundary re-finement (TBR) and the two-stream Faster-RCNN method There is oneground-truth instance in (a) and no ground-truth instance in (b-d) In (a)both of our PCSC with TBR method and the two-stream Faster-RCNNmethod [38] successfully capture the ground-truth action instance ldquoTen-nis Swing In (b) and (c) both of our PCSC method and the two-streamFaster-RCNN method falsely detect the bounding boxes with the classlabel ldquoTennis Swing for (b) and ldquoSoccer Juggling for (c) However ourTBR method can remove the falsely detected bounding boxes from ourPCSC in both (b) and (c) In (d) both of our PCSC method and the two-stream Faster-RCNN method falsely detect the action bounding boxeswith the class label ldquoBasketball and our TBR method cannot remove thefalsely detected bounding box from our PCSC in this failure case (Bestviewed on screen)

346 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 34 we show the visualization results generated by usingeither single RGB or flow stream simple combination of two streams

34 Experimental results 59

and our two-stream cooperation method The detection method usingonly the RGB or flow stream leads to either missing detection (see Fig34 (a)) or false detection (see Fig 34 (b)) As shown in Fig 34 (c) asimple combination of these two results can solve the missing detectionproblem but cannot avoid the false detection In contrast our actiondetection method can refine the detected bounding boxes produced byeither stream and remove some falsely detected bounding boxes (see Fig34 (d))

Moreover the effectiveness of our temporal refinement method isalso demonstrated in Figure 35 We compare the results from our methodand the two-stream Faster-RCNN method [38] Although both our PCSCmethod and the two-stream Faster-RCNN method can correctly detectthe ground-truth action instance (see Fig 35 (a)) after the refinementprocess by using our temporal boundary refinement module (TBR) theactionness score for the correctly detected bounding box from our PCSCis further increased On the other hand because of the similar sceneand motion patterns in the foreground frame (see Fig 35 (a)) and thebackground frame (see Fig 35 (b)) both our PCSC method and thetwo-stream Faster-RCNN method generate bounding boxes in the back-ground frame although there is no ground-truth action instance How-ever for the falsely detected bounding box from our PCSC method ourtemporal boundary refinement (TBR) module can significantly decreasethe actionness score from 074 to 004 which eventually removes thefalsely detected bounding box The similar observation can be foundin Fig 35 (c)

While our method can remove some falsely detected bounding boxesas shown in Fig 35 (b) and Fig 35 (c) there are still some challengingcases that our method cannot work (see Fig 35 (d)) Although the ac-tionness score of the falsely detected action instance could be reducedafter the temporal refinement process by using our TBR module it couldnot be completely removed due to the relatively high score We also ob-serve that the state-of-the-art methods (eg two-stream Faster-RCNN [38])cannot work well for this challenging case either Therefore how to re-duce false detection from the background frames is still an open issue

60Chapter 3 Progressive Cross-stream Cooperation in Spatial and

Temporal Domain for Action Localization

which will be studied in our future work

35 Summary

In this chapter we have proposed the Progressive Cross-stream Coop-eration (PCSC) framework to improve the spatio-temporal action local-ization results in which we first use our PCSC framework for spatiallocalization at the frame level and then apply our temporal PCSC frame-work for temporal localization at the action tube level Our PCSC frame-work consists of several iterative stages At each stage we progressivelyimprove action localization results for one stream (ie RGBflow) byleveraging the information from another stream (ie RGBflow) at bothregion proposal level and feature level The effectiveness of our newlyproposed approaches is demonstrated by extensive experiments on bothUCF-101-24 and J-HMDB datasets

61

Chapter 4

PCG-TAL ProgressiveCross-granularity Cooperationfor Temporal Action Localization

There are two major lines of works ie anchor-based and frame-basedapproaches in the field of temporal action localization But each line ofworks is inherently limited to a certain detection granularity and can-not simultaneously achieve high recall rates with accurate action bound-aries In chapter work we propose a progressive cross-granularity co-operation (PCG-TAL) framework to effectively take advantage of com-plementarity between the anchor-based and frame-based paradigms aswell as between two-view clues (ie appearance and motion) Specifi-cally our new Anchor-Frame Cooperation (AFC) module can effectivelyintegrate both two-granularity and two-stream knowledge at the featureand proposal levels as well as within each AFC module and across adja-cent AFC modules Specifically the RGB-stream AFC module and theflow-stream AFC module are stacked sequentially to form a progres-sive localization framework The whole framework can be learned inan end-to-end fashion whilst the temporal action localization perfor-mance can be gradually boosted in a progressive manner Our newlyproposed framework outperforms the state-of-the-art methods on threebenchmark datasets THUMOS14 ActivityNet v13 and UCF-101-24 whichclearly demonstrate the effectiveness of our framework

62Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

41 Background

411 Action Recognition

Video action recognition is one important research topic in the commu-nity of video analysis With the prominent development of deep learn-ing in recent years a large amount of works [84 12 64 81 4 77 56]were proposed by applying the convolutional neural networks to solvethis classification problem It is well-known that both the appearanceand motion clues are complementary to each other and discriminant forvideo action recognition The two-stream networks [12 64 81] exploitedthe RGB frames and the optical flow maps to extract appearance and mo-tion features and then combined these two streams for complementaryinformation exchange from both modalities Additionally the worksin [84 4 77 56] employed 3D convolutional neural layers to instanta-neously model spatio-temporal information for videos However actionrecognition is limited to the trimmed input videos where one action isspanned over the entire video In addition the action recognition mod-els can also be applied to extract frame-level or snippet-level featuresfrom untrimmed videos For example the temporal action localizationmethod just employ the I3D model [4] for base feature extraction

412 Spatio-temporal Action Localization

The spatio-temporal action localization task aims to generate action tubesfor the predicted action categories with their spatial locations and tem-poral durations Most works [67 66 55] used the object detection frame-work to spatially localize actions from individual frames and then tem-porally linked the detected boxes to form an action tube Several state-of-the-arts works [26 54 69] took advantage of the two-stream RCNNframework to improve the spatio-temporal action localization perfor-mances by exploiting the complementary information between the ap-pearance and motion clues

41 Background 63

413 Temporal Action Localization

Unlike spatio-temporal action localization temporal action localizationfocuses on how to accurately detect the starting and ending points ofa predicted action instance in the time domain of an untrimmed videoEarly works [97 50] used slide windows to generate segment propos-als and then applied the classifiers to classify actions for each proposalMost recent works can be roughly categorized into two types (1) [49 1029 98 103] used frame-level or snippet-level features to evaluate frame-wise or snippet-wise actionness by predicting the action starting or end-ing time which were then used to determine the temporal boundariesof actions for proposal generation (2) another set of works [9 15 85 5]applied two-stage object detection framework to generate temporal seg-ment proposals and then utilized boundary regressor for segment re-finement and action classifier for action prediction on a given proposalIn the first class the segments generated based on frame-wise or snippet-wise labels have relatively accurate temporal boundaries as they are de-termined at a finer level with larger aperture of temporal details How-ever these segments often have low recall rate In the second categoryas the proposals are generated by the predefined anchors with varioustime intervals over the untrimmed videos they are able to cover mostground-truth instances Unfortunately they usually has lower perfor-mances in terms of temporal boundary accuracy

Either line of the aforementioned works can take advantage of in-herent merits of the methods in the other line resulting in the trade-offbetween the boundary accuracy and recall rate Some recent attemptssuch as CTAP [13] and MGG [46] were proposed to take the comple-mentary merits of both lines of works into consideration They eitherused several isolated components with the stage-wise training schemesuch as CTAP [13] or simply employed frame-wise or snippet-wise la-bels to filter the boundaries of anchor-based segment proposals [46] butwithout fully exploring the complementarity in a more systematic wayIn our work our Anchor-Frame Cooperation (AFC) module can be end-to-end optimized and also allows rich feature-level and segment-level

64Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

interactions between action semantics extracted from these two comple-mentary lines of works which offers a significant performance gain to-ward our temporal action localization framework Two-stream fusion isanother cue to enhance the performance Recent state-of-the-art meth-ods [5 99] used late fusion as the only operation to combine comple-mentary information from two modalities where each stream is trainedindependently In our work each stream progressively and explicitlyhelps the other stream at the feature level and the segment proposallevel Two streams can also be trained collaboratively in an end-to-endfashion which is thus beneficial to fully exploit the complementarity be-tween the appearance and motion clues

42 Our Approach

In the following section we will at first briefly introduce our overall tem-poral action localization framework (in Sec 421) and then present howto perform progressive cross-granularity cooperation (in Sec 422) aswell as the newly proposed Anchor-Frame Cooperation (AFC) module(in Sec 423) The training details and discussions will be provided inSec 424 and Sec 425 respectively

421 Overall Framework

We denote an untrimmed video as V = It isin RHtimesWtimes3Tt=1 where It is

an RGB frame at time t with the height of H and the width of W Weseek to detect the action categories yiN

i=1 and their temporal durations(tis tie)N

i=1 in this untrimmed video where N is the total number ofdetected action instances tis and tie indicate the starting and endingtime of each detected action yi

The proposed progressive cross-granularity cooperation frameworkbuilds upon the effective cooperation strategy between anchor-based lo-calization methods and frame-based schemes We briefly introduce eachcomponent as follows

42 Our Approach 65

Base Feature Extraction For each video V we extract its RGB and flowfeatures BRGB and BFlow from a pre-trained and fixed feature extractor(eg the I3D feature extractor [4]) The size of each feature is C times Twhere C indicates the number of channels for the feature

Anchor-based Segment Proposal Generation This module providesthe anchor-based knowledge for the subsequent cooperation stage whichadopts the structure as that in TAL-Net [5] This module is based on thebase features BRGB and BFlow and each stream generates a set of tempo-ral segment proposals of interest (called proposals in later text) such asPRGB = pRGB

i |pRGBi = (tRGB

is tRGBie )NRGB

i=1 and PFlow = pFlowi |pFlow

i =

(tFlowis tFlow

ie )NFlow

i=1 NRGB and NFlow are the number of proposals forthe two streams We then extract the anchor-based features PRGB

0 PFlow0

for each stream by using two temporal convolutional layers on top ofBRGBBFlow which can be readily used as the input to a standaloneldquoDetection Head (described in Section 423) after a segment-of-interestpooling operation [5] The outputs from the detection head are the re-fined action segments SRGB

0 SFlow0 and the predicted action categories

Frame-based Segment Proposal Generation The frame-based knowl-edge come from a frame-based proposal generation module Based onthe base features BRGB and BFlow we apply two 1D convolutional lay-ers (similar as [46]) along the temporal dimension to extract the frame-based features for both streams termed as QRGB

0 and QFlow0 respectively

The frame-based feature in each stream is then processed by two parallel1times 1 convolutions together with the sigmoid activation functions whichoutputs the probability distributions of temporal starting and endingpositions of an action respectively We then gather a set of temporalsegment proposals for each stream ie QRGB

0 and QFlow0 respectively

after pairing highly confident starting and ending indices with validdurations and removing redundant pairs by non-maxium suppression(NMS)

Progressive Cross-granularity Cooperation Framework Given the two-stream features and proposals by the preceding two proposal generation

66Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Anchor Feautres (Output)

Frame Features (Output)

Cross-granularity

RGB-StreamAFC Module

(Stage 1)

Flow-StreamAFC Module

(Stage 2)

RGB1

RGB-StreamAFC Module

(Stage 3)

RGB1Flow

1

Flow-StreamAFC Module

(Stage4)

Features RGB0 Flow

0Flow

0

Segments

Proposals RGB0

Flow0

Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

RGB

RGB

Flow

Message Passing

Flowminus2 RGB

minus2Flowminus1

Flowminus1

RGBminus2

Flow

RGB

Anchor AnchorFrame

Anchor Features (Output)

1 x 1 Conv

1 x 1 Conv

Detected Segments (Output)

Proposal Combination= cup cupRGB

Flowminus1

RGBminus2

Flow

InputProposals

Input Features

DetectionHead

1 x 1 Conv

1 x 1 Conv

Frame-based SegmentProposal Generation

1 x 1 Conv

1 x 1 Conv

Proposal Combination= cup cupFlow

RGBminus1

Flowminus2

RGB

Input Features

RGBminus1

Flowminus2

InputProposals

Flowminus2 RGB

minus1 RGBminus2

RGB

RGB

Flow

Flow

Flow

Cross-stream Cross-granularity

Anchor Anchor Frame

Message Passing

Features

Proposals Flow2

Features RGB3 Flow

2 RGB2

ProposalsRGB3

Flow2

Flow2

SegmentsFeatures Flow

2RGB2

SegmentsFeatures Flow

3

RGB3

RGB3

Flow4

Flow4RGB

4

SegmentsFeatures

Detected Segments (Output)

Frame Features (Output)

Message Passing

Outputs

Features RGB1 Flow

0RGB

0

ProposalsRGB1

Flow0

RGB1

RGB1 Flow

2 Flow1

Inputs

(a) Overview

(b) RGB-streamAFC module

(c) Flow-stremAFC module

Cross-stream Message Passing

Figure 41 (a) Overview of our Progressive Cross-granularity Cooper-ation framework At each stage our Anchor-Frame Cooperation (AFC)module focuses on only one stream (ie RGBFlow) The RGB-streamAFC modules at Stage 1 and 3(resp the flow-stream AFC modules atStage 2 and 4) generate anchor-based features and segments as well asframe-based features for the RGB stream (resp the Flow stream) Theinputs of each AFC module are the proposals and features generated bythe previous stages (b) (c) Details of our RGB-stream and Flow streamAFC modules at different stages n Note that we use the RGB-streamAFC module in (b) when n is an odd number and the Flow-stream AFCmodule in (c) when n is an even number For better representationwhen n = 1 the initial proposals SRGB

nminus2 and features PRGBnminus2 QFlow

nminus2 aredenoted as SRGB

0 PRGB0 and QFlow

0 The initial proposals and featuresfrom anchor-based and frame-based branches are obtained as describedin Section 423 In both (b) and (c) the frame-based and anchor-basedfeatures are improved by the cross-granularity and cross-stream messagepassing modules respectively The improved frame-based features areused to generate better frame-based proposals which are then combinedwith anchor-based proposals from both RGB and flow streams Thenewly combined proposals together with the improved anchor-basedfeatures are used as the input to ldquoDetection Head to generate the up-dated segments

42 Our Approach 67

methods our method progressively uses the newly proposed Anchor-Frame Cooperation (AFC) module (described in Sec 423) to output aseries of gradually improved localization results Note that while eachAFC module focuses on knowledge exchange within one stream thereare sufficient cross-granularity and cross-stream cooperations within eachAFC module and between two adjacent AFC modules After stacking aseries of AFC modules the final action localization results are obtainedafter performing the NMS operation on the union set of the output actionsegments across all stages

422 Progressive Cross-Granularity Cooperation

The proposed progressive cross-granularity cooperation framework aimsat encouraging rich feature-level and proposal-level collaboration simul-taneously from two-granularity scheme (ie the anchor-based and frame-based temporal localization methods) and from two-stream appearanceand motion clues (ie RGB and flow)

To be specific this progressive cooperation process is performed byiteratively stacking multiple AFC modules with each focusing on oneindividual stream As shown in Fig 41(a) the first AFC stage focusesthe on RGB stream and the second AFC stage on flow stream whichare respectively referred to as RGB-stream AFC and flwo-stream AFCfor better presentation For example the RGB-stream AFC at stage n(n is an odd number) employs the preceding two-stream detected ac-tion segments SFlow

nminus1 and SRGBnminus2 as the input two-stream proposals and

uses preceding two-stream anchor-based features PFlownminus1 and PRGB

nminus2 andflow-stream frame-based feature QFlow

nminus2 as the input features This AFCstage will output the RGB-stream action segments SRGB

n and anchor-based feature PRGB

n as well as update the flow-stream frame-based fea-tures as QFlow

n These outputs will be used as the inputs to the subse-quent AFC stages thus enabling cross-granularity and cross-stream co-operation over multiple AFC stages

68Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

In each AFC stage which will be discussed in Sec 423 we in-troduce a set of feature-level message passing and proposal-level co-operation strategies to effectively integrate the complementary knowl-edge from the anchor-based branch to the frame-based branch throughldquoCross-granularity Message Passing and then back to the anchor-basedbranch We argue that within each AFC stage this particular structurewill seamlessly encourage collaboration through ldquoProposal Combina-tion (see Sec 423 for more details) by exploiting cross-granularity andcross-stream complementarty at both the feature and proposal levels aswell as between two temporal action localization schemes

Since the rich cooperations can be performed within and across ad-jacent AFC modules the successes in detecting temporal segments at theearlier AFC stages will be faithfully propagated to the subsequent AFCstages thus the proposed progressive localization strategy can graduallyimprove the localization performances

423 Anchor-frame Cooperation Module

The Anchor-frame Cooperation (AFC) module is based on the intuitionthat complementary knowledge from two-stream anchor-based and frame-based temporal localization methods would not only improve the actionboundary localization accuracy but also increase the recall rate of the de-tected action segments Therefore the AFC module is designed to minethe rich cross-granularity and cross-stream knowledge at both the fea-ture and proposal levels also between the aforementioned two lines ofaction localization methods To be specific there are RGB-stream andflow-stream AFC modules according to the main stream that an AFCmodule would like to focus on In this section we will present the de-tailed structures of two types of AFC modules as shown in Fig 41(b-c)

Initialization As the AFC module is applied in a progressive mannerit uses the inputs from the preceding AFC stages However for the firststage (ie n = 1) some inputs do not have valid predecessors Insteadwe use the initial anchor-based and frame-based two-stream features

42 Our Approach 69

PRGB0 PFlow

0 and QRGB0 QFlow

0 as well as proposals SRGB0 SFlow

0 as theinput as shown in Sec 421 and Fig 41(a)

Single-stream Anchor-to-Frame Message Passing (ie ldquoCross-granularityMessage Passing in Fig 41(b-c)) Taking the RGB-stream AFC moduleas an example The network structure is shown in Fig 41(b) About thefeature-level message passing we use a simple message passing struc-ture to propagate the feature-level message from the anchor-based fea-tures PFlow

nminus1 to the frame-based features QFlownminus2 both from the flow-stream

The message passing operation is defined as

QFlown =Manchorrarr f rame(PFlow

nminus2 )oplusQFlownminus2 (41)

oplus is the element-wise addition operation and Manchorrarr f rame(bull) is theflow-stream message passing function from the anchor-based branch tothe frame-based branch which is formulated as two stacked 1times 1 convo-lutional layers with ReLU non-linear activation functions This improvedfeature QFlow

n can generate more accurate action starting and endingprobability distributions (aka more accurate action durations) as theyabsorb the temporal duration-aware anchor-based features which cangenerate better frame-based proposals QFlow

n as described in Sec 421For the flow-stream AFC module (see Fig 41(c)) the message passingprocess is similarly defined as QRGB

n = Manchorrarr f rame(PRGBnminus2 ) oplus QRGB

nminus2 which flips the superscripts from Flow to RGB QRGB

n also leads to betterframe-based RGB-stream proposals QRGB

n

Cross-stream Frame-to-Anchor Proposal Cooperation (ie ldquoProposal Com-bination in Fig 41(b-c)) We also use the RGB-stream AFC modulefor instance which is shown in Fig 41(b) to enable cross-granularitytwo-stream proposal-level cooperation we gather sufficient valid frame-based proposals QFlow

n by using the ldquoFrame-based Segment ProposalGeneration method described in Sec 421 based on the improved frame-based features QFlow

n And then we combine QFlown together with the

two-stream anchor-based proposals SFlownminus1 and SRGB

nminus2 to form an updatedset of anchor-based proposals at stage n ie PRGB

n = SRGBnminus2 cup SFlow

nminus1 cup

70Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

QFlown This set of proposals not only provides more comprehensive pro-

posals but also guides our approach to pool more accurate proposal fea-tures both of which would significantly benefit the subsequent bound-ary refinement and action class prediction procedures in the detectionhead The flow-stream AFC module also perform the similar process asin the RGB-stream AFC module ie the anchor-based flow-stream pro-posals are updated as PFlow

n = SFlownminus2 cup SRGB

nminus1 cupQRGBn

Cross-stream Anchor-to-Anchor Message Passing (ie ldquoCross-streamMessage Passing in Fig 41(b-c)) In addition to anchor-frame coopera-tion we also include two-stream feature cooperation module The two-stream anchor-based features are updated similarly as equation (41) bya message passing operation ie PRGB

n =M f lowrarrrgb(PFlownminus1 )oplus PRGB

nminus2 forthe RGB stream and PFlow

n =Mrgbrarr f low(PRGBnminus1 )oplus PFlow

nminus2 for flow stream

Detection Head The detection head aims at refining the segment pro-posals and predicting the action categories for each proposal Its in-puts are the segment proposal features generated by segment-of-interest(SOI) pooling [5] with respect to the updated proposals PRGB

n PFlown and

the updated anchor-based features PRGBn PFlow

n To increase the percep-tual aperture of action boundaries we add two more SOI pooling op-erations around the starting and ending indices of the proposals witha given interval (in this paper it is fixed to 10) and then concatenatethese pooled features after the segment proposal features The concate-nated segment proposal features are fed to the segment regressor for seg-ment refinement and the action classifier for action prediction We alsouse GCN [99] to explore the proposal-to-proposal relations which ad-ditionally provides performance gains independent to our cooperationschemes

424 Training Details

At each AFC stage the training losses are used at both anchor-based andframe-based branches To be specific the training loss in each anchor-based branch has two components one is the cross-entropy loss forthe action classification task and the other is the smooth-L1 loss for the

42 Our Approach 71

boundary regression task On the other hand the training loss in eachframe-based branch is also an cross-entropy loss which indicates theframe-wise action starting and ending positions We sum up all lossesand jointly train our model We follow the strategy in [46] to assign theground-truth labels for supervision at all AFC modules

425 Discussions

Variants of the AFC Module Besides using the aforementioned AFCmodule as illustrated in Fig 41(b-c) one can also simplify the structureby either removing the entire frame-based branch or the cross-streamcooperation module (termed as ldquowo CG and ldquowo CS in Sec 433 re-spectively) In the first case (ldquowo CG) the variant looks like we incor-porate the progressive feature and proposal cooperation strategy withinthe two-stream anchor-based localization framework thus is to some ex-tent similar to the method proposed in [69] But this variant is totallyapplied in the temporal domain for temporal action localization whilethe work in [69] is conducted in spatial domain for spatial action local-ization We argue that our framework after incorporating the newly pro-posed cross-granularity cooperation strategy significantly improves thisvariant Since it enhances the boundary-sensitive frame-based featuresby leveraging anchor-based features with more duration-aware charac-teristics and in turn help generate high-quality anchor-based proposalswith better action boundaries and high recall rates Thus it is more likelythat the final localization results can capture accurate action segmentsIn the second case (ldquowo CS) the variant applies a simple two-streamfusion strategy by aggregating RGB and flow features as lengthy featurevectors before the first stage instead of using the ldquoCross-stream MessagePassing block for fusing two-stream features But this variant suffersfrom early performance saturation (namely the localization results be-come stable after the second stage as shown in Table 45) which is alsoinferior to our AFC We compare our method with these two variants inour experiments and show that our AFC framework outperforms thesetwo variants

Similarity of Results Between Adjacent Stages Note that even though

72Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

each AFC module focuses on one stream the rich cross-stream interac-tions eventually make the two-stream features and proposals be similarto each other By applying more AFC stages it can encourage more ac-curate proposals and more representative action features from graduallyabsorbed cross-stream information and cross-granularitry knowledgesand thus significantly boosts the performance of temporal action local-ization But the performance gain will be less significant after a certainnumber of stages Empirically we observe a three-stage design achievesthe best localization results

Number of Proposals The number of proposals will not rapidly in-crease when the number of stages increases because the NMS opera-tions applied in previous detection heads remarkably reduce redundantsegments but still preserve highly confident temporal segments But thenumber of proposals will not decrease as well as the proposed proposalcooperation strategy still absorbs sufficient number of proposals fromcomplementary sources

43 Experiment

431 Datasets and Setup

Datasets We evaluate our temporal action localization framework onthree datasets THUMOS14 [25] ActivityNet v13 [3] and UCF-101-24 [68]

- THUMOS14 contains 1010 and 1574 untrimmed videos from 20action classes for validation and testing respectively We use 200 tempo-rally annotated videos in the validation set as the training samples and212 temporally annotated videos in the test set as the testing samples

- ActivityNet v13 contains 19994 videos from 200 different activitiesThe whole video set is divided into training validation and testing setswith the ratio of 211 We evaluate our model on the validation set asthe annotation for the testing set is not publicly available

- UCF-101-24 consists of 3207 untrimmed videos from 24 actionclasses with spatio-temporal annotations provided by [67] We use an

43 Experiment 73

existing spatial action detection method [69] to spatially locate actionsat each frame and generate action tubes by linking the detection resultsacross frames We consider these action tubes as untrimmed videos anduse our model to identify the temporal action boundaries We report andcompare the results based on split 1

Implementation Details We use a two-stream I3D [4] model pretrainedon Kinetics as the feature extractor to extract base features For trainingwe set the size of minibatch as 256 for the anchor-based proposal genera-tion network and 64 for detection head respectively The initial learningrate is 0001 for both the THUMOS14 and the UCF-101-24 datasets and0005 for the ActivityNet v13 During training we divide the learningrate by 10 for every 12 epoches

Evaluation Metrics We evaluate our model by mean average precision(mAP) A detected action segment or action tube is considered as cor-rect if its overlapping with the ground-truth instances is larger than athreshold δ and its action label is correctly predicted To measure theoverlapping area we use temporal intersection-over-union (tIoU) fortemporal action localization on THUMOS14 and ActivityNet v13 andspatio-temporal intersection-over-union (st-IoU) for spatio-temporal ac-tion localization on UCF-101-24

432 Comparison with the State-of-the-art Methods

We compare our PCG-TAL framework with the state-of-the-art methodson THUMOS14 ActivityNet v13 and UCF-101-24 All results of the ex-isting methods are quoted from their original papers

THUMOS14 We report the results on the THUMOS14 in Table 41 Ourmethod PCG-TAL outperforms the existing methods at all tIoU thresh-olds Similar to our method both TAL-Net [5] and P-GCN [99] use I3Dfeatures as their base features and apply a two-stream framework andlate fusion strategy to exploit the complementary information betweenappearance and motion clues When using the tIoU threshold δ = 05

74Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 41 Comparison (mAP ) on the THUMOS14 dataset using dif-ferent tIoU thresholds

tIoU thresh δ 01 02 03 04 05SCNN [63] 477 435 363 287 190

REINFORCE[95] 489 440 360 264 171CDC [62] - - 401 294 233SMS[98] 510 452 365 278 178

TRUN [15] 540 509 441 349 256R-C3D [85] 545 515 448 356 289SSN [103] 660 594 519 410 298BSN [37] - - 535 450 369

MGG [46] - - 539 468 374BMN [35] - - 560 474 388

TAL-Net [5] 598 571 532 485 428P-GCN [99] 695 678 636 578 491

ours 712 689 651 595 512

our method outperforms TAL-Net and P-GCN by 84 and 21 respec-tively The improvement largely comes from our well designed progres-sive cooperation framework that fully exploits the cross-granularity in-formation between anchor-based and frame-based methods Howeverthe mAP of our method is 138 higher than that of MGG possibly be-cause MGG can not well capture the complementary information be-tween these two methods (ie the frame actionness producer and thesegment proposal producer) It is worth mentioning that following P-GCN [99] we also apply the GCN module in [99] to consider proposal-to-proposal relations in our detection head and our method can stillachieve mAP of 4837 without using the GCN module which is com-parable to the performance of P-GCN

ActivityNet v13 We also compare the mAPs at different tIoU thresh-olds δ = 05 075 095 on the validation set of the ActivityNet v13dataset Besides we report average mAP as well which is calculated byaveraging all mAPs at multiple tIoU thresholds δ from 05 to 095 withan interval of 005 All the results are shown in Table 42 In terms ofthe average mAP our method can achieve mAP of 2885 which out-performs P-GCN [99] SSN [103] and TAL-Net [5] by 186 687 and863 respectively In Table 43 it is observed that both BSN [37] and

43 Experiment 75

Table 42 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds No extra external video labels are used

tIoU threshold δ 05 075 095 AverageCDC [62] 4530 2600 020 2380TCN [9] 3644 2115 390 -

R-C3D [85] 2680 - - -SSN [103] 3912 2348 549 2398

TAL-Net [5] 3823 1830 130 2022P-GCN[99] 4290 2814 247 2699

ours 4431 2985 547 2885

Table 43 Comparison (mAPs ) on the ActivityNet v13 (val) datasetusing different tIoU thresholds The external video labels fromUntrimmedNet [80] are applied

tIoU threshold δ 05 075 095 AverageBSN [37] 4645 2996 802 3003

P-GCN [99] 4826 3316 327 3111BMN [35] 5007 3478 829 3385

ours 5204 3592 797 3491

BMN [35] are superior than the best result shown in Table 42 HoweverBSN and BMN have to rely on the extra external video-level action labelsfrom untrimmedNet [80] Similarly the performance of our method canbe boosted by using extra external labels as in [35] and proposals fromBMN whose average mAP (3491) is better than BSN and BMN

UCF-101-24 To further evaluate our method we conduct another com-parison for the spatio-temporal action localization task on the UCF-101-24 dataset [68] We report mAPs at st-IoU threshold δ = 02 05 075and average mAP when δ is set between 05 and 095 with an intervalof 005 The input action tubes before temporal refinement were pro-vided by the authors in [69] The results are reported in Table 44 Ascan be seen our method achieves better results then the state-of-the-art methods over all st-IoU thresholds which shows that our proposedmethod can also help improve the spatio-temporal action localization re-sults Note that our method and the temporal refinement module in [69]use the same action tubes and the mAP of our method is 25 higher atthe st-IoU threshold δ = 05

76Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 44 Comparison (mAPs ) on the UCF-101-24 dataset using dif-ferent st-IoU thresholds

st-IoU thresh δ 02 05 075 05095LTT[83] 468 - - -

MRTSR[54] 735 321 027 073MSAT[60] 666 364 079 144ROAD[67] 735 463 150 204ACT [26] 765 492 197 234TAD[19] - 599 - -PCSC[69] 843 610 230 278

ours 849 635 237 292

433 Ablation study

We perform a set of ablation studies to investigate the contribution ofeach component in our proposed PCG-TAL framework

Anchor-Frame Cooperation We compare the performances among fiveAFC-related variants and our complete method (1) wo CS-MP we re-move the ldquoCross-stream Message Passing block from our AFC mod-ule (2) wo CG-MP we remove the ldquoCross-granularity Message Passingblock from our AFC module (3) wo PC we remove the ldquoProposal Com-bination block from our AFC module in which QFlow

n and QRGBn are

directly used as the input toldquoDetection Head in the RGB-stream AFCmodule and the flow-stream AFC module respectively (4) wo CS weaggregate two-stream features as lengthy feature vectors for both frame-based and anchor-based branches before the first stage in which the ag-gregated features are directly used as the input to the ldquoCross-granularityMessage Passing block and ldquoDetection Head in Fig 41(b-c) (5) wo

CG we remove the entire frame-based branch and in this case only twoproposal sets based on the RGB and flow features generated from theanchor-based scheme are combined in the ldquoProposal Combination blockin Fig 41(b-c) (6) AFC our complete method Note that two alterna-tive methods ldquowo CS and ldquowo CS-MP have the same network struc-ture in each AFC module except that the input anchor-based featuresto the ldquoCross-granularity Message Passing block and ldquoDetection Headare respectively PFlow

nminus1 and PRGBnminus2 (for the RGB-stream AFC module) and

43 Experiment 77

Table 45 Ablation study among different AFC-related variants at dif-ferent training stages on the THUMOS14 dataset

Stage 0 1 2 3 4AFC 4475 4692 4814 4837 4822

wo CS 4421 4496 4547 4548 4537wo CG 4355 4595 4671 4665 4667

wo CS-MP 4475 4534 4551 4553 4549wo CG-MP 4475 4632 4697 4685 4679

wo PC 4214 4321 4514 4574 4577

PRGBnminus1 and PFlow

nminus2 (for the flow-stream AFC module) in ldquowo CS-MP whilethe inputs are the same (ie the aggregated features) in ldquowo CS Alsothe discussions for the alternative methods ldquowo CS and ldquowo GS areprovided in Sec 425 We take the THUMOS14 dataset at tIoU thresholdδ = 05 as an example to report the mAPs of different variants at differ-ent stages in Table 45 Note that we remove the GCN module from ourmethod for fair comparison

In Table 45 the results for stage 0 are obtained based on the initialproposals generated by our complete method and five alternative meth-ods before applying our AFC module The initial proposals for ldquowoCS-MP and ldquowo CG-MP are the same as our complete method so the re-sults of these three methods at stage 0 are the same The initial proposalsare PRGB

0 cupPFlow0 for ldquowo CG andQRGB

0 cupQFlow0 for ldquowo PC We have

the following observations (1) the two alternative methods ldquowo CSandldquowo CS-MP achieve comparable results when the number of stagesincrease as these two methods share the same network structure (2) ourcomplete method outperforms ldquowo CS-MP ldquowo CG-MP andldquowo PCwhich demonstrates that it is useful to include three cross-stream andorcross-granularity operations at both feature and proposal levels in eachAFC module within our framework (3) our complete method also out-performs the alternative method ldquowo CG which demonstrates that itis beneficial to encourage cross-granularity collaboration by passing thefeatures from the anchor-based approach to the frame-based scheme andsimultaneously providing the segment proposals from the frame-basedscheme back to ldquoDetection Head of the anchor-based approach (4) atthe first few stages the results of all methods are generally improved

78Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

Table 46 LoR () scores of proposals from adjacent stages This scoreis calculated for each video and averaged for all videos in the test set ofthe THUMOS14 dataset

Stage Comparisons LoR ()

1 PRGB1 vs SFlow

0 6712 PFlow

2 vs PRGB1 792

3 PRGB3 vs PFlow

2 8954 PFlow

4 vs PRGB3 971

when the number of stages increases and the results will become stableafter 2 or 3 stages

Similarity of Proposals To evaluate the similarity of the proposals fromdifferent stages for our method we define a new metric named as Large-over-Ratio (LoR) score Taking the proposal sets PRGB

1 at Stage 1 andPFlow

2 at Stage 2 as the examples The LoR score at Stage 2 is the ratio ofthe similar proposals in PFlow

2 among all proposals in PFlow2 A proposal

is defined as a similar proposal if its maximum temporal intersection-over-union (max-tIoU) score is higher than a pre-defined threshold 07with the preceding proposalsPRGB

1 The max-tIoU score for one proposalin the corresponding proposal set (say PFlow

2 ) is calculated by comparingit with all proposals in the other proposal set (say PRGB

1 ) returning themaximum temporal intersection-over-union score Therefore a higherLoR score indicates that two sets of proposals from different streams aremore similar to each other

Table 46 shows the LoR scores at different stages where LoR scoresat Stage 1 and Stage 3 examine the similarities between RGB-stream pro-posals against the preceding flow-stream proposals and those at Stage2 and Stage 4 indicate the similarities between flow-stream proposalsagainst the preceding RGB-stream proposals In Table 46 we observethat the LoR score gradually increases from 671 at Stage 1 to 971at Stage 4 This large LoR score at Stage 4 indicates that the proposalsin PFlow

4 and those in PRGB3 are very similar even they are gathered for

different streams Thus little complementary information can be furtherexploited between these two types of proposals In this case we can-not achieve further performance improvements by using our temporal

43 Experiment 79

action localization method after Stage 3 which is consistent with the re-sults in Table 45

Quality of the Generated Action Segments To further analyze ourPCG-TAL framework we evaluate the quality of the action segmentsgenerated from our proposed method We follow the work in [37] tocalculate average recall (AR) when using different average number ofproposals (AN) which is referred to as ARAN On the THUMOS14dataset [25] we compute the average recall by averaging eleven recallrates based on the tIoU thresholds between 05 and 10 with an intervalof 005

In Fig 42 we compare the ARAN results of action segments gen-erated from the following baseline methods and our PCG-TAL method atdifferent number of stages (1) FB we use only the frame-based methodto generate action segments (2) AB we use only the anchor-based methodto generate action segments (3) SC we simply combine the segment pro-posals from the ldquoFB method and the ldquoAB method to generate actionsegments Note that the baseline methods ldquoFB ldquoAB and ldquoSC are notbased on the multi-stage framework for better presentation their av-erage recall rates at stage 0 are the same as their results at stages 1-4The results from our PGC-TAL method at stage 0 are the results fromthe ldquoSC method We observe that the action segments generated fromthe ldquoABrsquo method are better than those from the ldquoFB method as its av-erage recall rates are higher Additionally a simple combination of ac-tion segments as used in the ldquoSC method can lead to higher average re-call rates which demonstrates the complementarity between the frame-based and anchor-based temporal action localization methods We alsoobserve that the average recall rates from our PCG-TAL method improveas the number of stages increases The results together with the quanti-tative results shown in Table 45 demonstrate that our newly proposedPCG-TAL framework which takes advantage of cross-granularity andcross-stream complementary information can help progressively gen-erate better action segments with higher recall rates and more accurateboundaries

80Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

(a) AR50 () for various methods at different number of stages

(b) AR100 () for various methods at different number of stages

(c) AR200 () for various methods at different number of stages

(d) AR500 () for various methods at different number of stages

Figure 42 ARAN () for the action segments generated from variousmethods at different number of stages on the THUMOS14 dataset

43 Experiment 81

Ground Truth

Anchor-based Method

PCG-TAL at Stage 3

977s 1042s

954s 1046s

973s 1043s

Frame-based Method Missed

Video Clip

964s 1046s

97s 1045s

PCG-TAL at Stage 1

PCG-TAL at Stage 2

Time

1042s 1046s977s

954s

954s

Figure 43 Qualitative comparisons of the results from two base-line methods (ie the anchor-based method [5] and the frame-basedmethod [37]) and our PCG-TAL method at different number of stageson the THUMOS14 dataset More frames during the period between the954-th second to the 977-th second of the video clip are shown in theblue boxes where the actor slowly walks into the circle spot to start theaction The true action instance of the action class ldquoThrow Discus withthe starting frame and the ending frame from the 977-th second to the1042-th second of the video clip are marked in the red box The framesduring the period from the 1042-th second to the 1046-th second of thevideo clip are shown in the green boxes where the actor just finishes theaction and starts to straighten himself up In this example the anchor-based method detects the action segment with less accurate boundarieswhile the frame-based method misses the true action instance Our PCG-TAL method can progressively improve the temporal boundaries of theaction segment

434 Qualitative Results

In addition to the quantitative performance comparison we also providesome visualization results to further demonstrate the performance of ourmethod In Fig 43 we show the visualization results generated by us-ing either anchor-based method [5] or frame-based method [37] and our

82Chapter 4 PCG-TAL Progressive Cross-granularity Cooperation for

Temporal Action Localization

PCG-TAL method at different number of stages Although the anchor-based method can successfully capture the action instance it generatesaction segments with less accurate temporal boundaries Meanwhilethe frame-based method fails to detect the action instance due to the lowprediction probabilities as the starting or ending positions during thatperiod of time On the other hand our PCG-TAL method can take ad-vantage of both the anchor-based and frame-based methods to progres-sively improve the temporal boundaries of the detected action segmentand eventually generate the action segment with more accurate temporalboundaries which demonstrates the effectiveness of our proposed AFCmodule

44 Summary

In this chapter we have proposed a progressive cross-granularity co-operation (PCG-TAL) framework to gradually improve temporal ac-tion localization performance Our PCG-TAL framework consists of anAnchor-Frame Cooperation (AFC) module to take advantage of bothframe-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration betweenthe complementary appearance and motion clues In our frameworkthe cooperation mechanisms are conducted in a progressive fashion atboth feature level and segment proposal level by stacking multiple AFCmodules over different stages Comprehensive experiments and abla-tion studies on THUMOS14 ActivityNet v13 and UCF-101-24 datasetsshow the effectiveness of our proposed framework

83

Chapter 5

STVGBert A Visual-linguisticTransformer based Frameworkfor Spatio-temporal VideoGrounding

Spatio-temporal video grounding (STVG) aims to localize a spatio-temporaltube of a target object in an untrimmed video based on a query sen-tence In this work we propose a two-step visual-linguistic transformerbased framework called STVGBert for the STVG task which consists ofa Spatial Visual Grounding network (SVG-net) and a Temporal Bound-ary Refinement network (TBR-net) In the first step the SVG-net directlytakes a video and a query sentence as the input and produces the ini-tial tube prediction In the second step the TBR-net refines the tempo-ral boundaries of the initial tube to further improve the spatio-temporalvideo grounding performance As the key component in our SVG-netthe newly introduced cross-modal feature learning module improvesViLBERT by additionally preserving spatial information when extract-ing text-guided visual features Besides the TBR-net is a novel ViLBERT-based multi-stage temporal localization network which exploits the com-plementarity of the anchor-based and the frame-based temporal bound-ary refinement methods Different from the existing works for the videogrounding tasks our proposed framework does not rely on any pre-trained object detector Comprehensive experiments demonstrate our

84Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

newly proposed framework outperforms the state-of-the-art methods ontwo benchmark datasets VidSTG and HC-STVG

51 Background

511 Vision-language Modelling

Transformer-based neural networks have been widely explored for vari-ous vision-language tasks [75 32 33 47 70] such as visual question an-swering image captioning and image-text retrieval Apart from theseworks designed for image-based visual-linguistic tasks Sun et al [71]proposed VideoBERT for the video captioning task by modeling tempo-ral variations across multiple video frames However this work does notmodel spatial information within each frame so that it cannot be appliedfor the spatio-temporal video grounding task discusseed in this work

The aforementioned transformer-based neural networks take the fea-tures extracted from either Region of Interests (RoIs) in images or videoframes as the input features but spatial information in the feature spacecannot be preserved when transforming them to visual tokens In thiswork we propose an improved version of ViLBERT [47] to better modelspatio-temporal information in videos and learn better cross-modal rep-resentations

512 Visual Grounding in ImagesVideos

Visual grounding in imagesvideos aims to localize the object of interestin an imagevideo based on a query sentence In most existing meth-ods [42 96 45 82 87 88 41 86 7 102] a pre-trained object detector isoften required to pre-generate object proposals The proposal that bestmatches the given input description is then selected as the final resultFor the visual grounding tasks in images some recent works [92 34 4891] proposed new one-stage grounding frameworks without using thepre-trained object detector For example Liao et al [34] used the anchor-free object detection method [104] to localize the target objects based onthe cross-modal representations For the video grounding task Zhang et

52 Methodology 85

al [102] proposed a new method (referred to as STGRN) that does notrely on the pre-generated tube proposals Unfortunately this work [102]still requires a pre-trained object detector to first generate object propos-als since the output bounding boxes are retrieved from these candidatebounding boxes In the concurrent work STGVT [76] similar as in ourproposed framework it also adopted a visual-linguistic transformer tolearn cross-modal representations for the spatio-temporal video ground-ing task But this work [76] also needs to first generate the tube proposalsas in most existing methods [7 86] Additionally in the works [59 30101 90] the pre-trained object detectors are also required to generateobject proposals for object relationship modelling

In contrast to these works [102 76 59 30 101 90] in this work weintroduce a new framework with an improved ViLBERT module to gen-erate spatio-temporal object tubes without requiring any pre-trained ob-ject detectors We also propose a temporal boundary refinement networkto further refine the temporal boundaries which has not been exploredin the existing methods

52 Methodology

521 Overview

We denote an untrimmed video V with KT frames as a set of non-overlappingvideo clips namely we have V = Vclip

k Kk=1 where Vclip

k indicates thek-th video clip consisting of T frames and K is the number of video clipsWe also denote a textual description as S = snN

n=1 where sn indicatesthe n-th word in the description S and N is the total number of wordsThe STVG task aims to output the spatio-temporal tube B = btte

t=ts

containing the object of interest (ie the target object) between the ts-thframe and the te-th frame where bt is a 4-d vector indicating the top-leftand bottom-right spatial coordinates of the target bounding box in thet-th frame ts and te are the temporal starting and ending frame indexesof the object tube B

86Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Pooled Video Feature

Input Textual Tokens

Spatial VisualG

rouding Netw

ork(SVG

-net)

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

HW T

Video Frames

Video Feature

Initial Object Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]

Textual Query

Word Em

beddingInput Textual Tokens

ST-ViLBERTVidoe C

lipSam

pling

Video FeatureC

lip Feature

Input Textual Tokens

Deconvs

Head

Head

Correlation

MLP

Spatial Pooling

Global Textual Feature

Text-guided Viusal Feature

Initial TubeG

eneration

Initial Object Tube

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

Input Textual Tokens

(a) Overview

of Our M

ethod

(b) Our SVG

-net(c) O

ur TBR

-net

Bounding Box Size

Bounding Box Center

Relevance Scores

Pooled Video Feature

Initial Object

TubePooled Video

Feature

SVG-net-core

Temporal Boundary

Refinem

ent Netow

rk(TBR

-net)Im

ageEncoder

H T

VideoC

lip Features

Initial Object

Tube

Refined O

bject Tube

[CLS] a child in a hat

hugs a toy [SEP]Textual Q

ueryW

ord Embedding

Input Textual Tokens

ST-ViLBERTC

lip Feature

Input Textual Tokens

Deconv

Head

Head

Correlation

amp Sigmoid

MLP

Spatial Pooling

Global

Textual Feature

Text-guided Viusal Feature

Anchor-based Branch

Frame-based Branch

Spatial Pooling

Refined O

bject Tube

(a) Overview

of Our M

ethod

(b) Our SVG

-net-core Module

(c) Our TB

R-net

Bounding Box Sizes

Bounding Box Centers

Relevance Scores

Pooled Video Feature

Initial Object

Tube

W

BB Centers

BB Sizes

Relevance Scores

SVG-net-core

Initial TubeG

eneration

Spatial Visual Grounding N

etwork (SVG

-net)

Figure51

(a)T

heoverview

ofour

spatio-temporalvideo

groundingfram

ework

STVG

BertO

urtw

o-stepfram

ework

consistsof

two

sub-networksSpatial

Visual

Grounding

network

(SVG

-net)and

Temporal

BoundaryR

efinement

net-w

ork(T

BR-net)w

hichtakes

videoand

textualquerypairs

asthe

inputtogenerate

spatio-temporalobjecttubes

contain-ing

theobject

ofinterest

In(b)the

SVG

-netgenerates

theinitialobject

tubesIn

(c)theTBR

-netprogressively

refinesthe

temporalboundaries

oftheinitialobjecttubes

andgenerates

thefinalspatio-tem

poralobjecttubes

52 Methodology 87

In this task it is required to perform both spatial localization andtemporal localization based on the query sentence Therefore we pro-pose a two-step framework which consists of two sub-networks Spa-tial Visual Grounding network (SVG-net) and Temporal Boundary Refine-ment network (TBR-net) As shown in Fig 51 the SVG-net first takesthe untrimmed video and the textual description as the input and pre-dicts the initial object tube based on which the TBR-net progressivelyupdates the temporal boundaries by using a newly proposed ViLBERT-based multi-stage framework to produce the final spatio-temporal objecttube We will describe each step in details in the following sections

522 Spatial Visual Grounding

Unlike the previous methods [86 7] which first output a set of tube pro-posals by linking the pre-detected object bounding boxes our SVG-netdoes not require any pre-trained object detector In general our SVG-netextracts the visual features from the video frames and then produces thetext-guided visual feature by using our newly introduced cross-modalfeature learning module and finally generates an initial object tube in-cluding the bounding boxes bt for the target object in each frame and theindexes of the starting and ending frames

Visual Feature and Textual Feature Encoding We first use ResNet-101 [20] as the image encoder to extract the visual feature The outputfrom the 4th residual block is reshaped to the size of HW times C with HW and C indicating the height width and the number of channels ofthe feature map respectively which is then employed as the extractedvisual feature For the k-th video clip we stack the extracted visualfeatures from each frame in this video clip to construct the clip featureFclip

k isin RTtimesHWtimesC which is then fed into our cross-modal feature learn-ing module to produce the multi-modal visual feature

For the textual descriptions we use a word embedding module tomap each word in the description as a word vector and each word vec-tor is considered as one textual input token Additionally we add two

88Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

special tokens [lsquoCLSrsquo] and [lsquoSEPrsquo] before and after the textual input to-kens of the description to construct the complete textual input tokensE = enN+2

n=1 where en is the i-th textual input token

Multi-modal Feature Learning Given the visual input feature Fclipk

and the textual input tokens E we develop an improved ViLBERT mod-ule called ST-ViLBERT to learn the visual-linguistic representation Fol-lowing the structure of ViLBERT [47] our ST-ViLBERT module consistsof a visual branch and a textual branch where both branches adopt themulti-layer transformer encoder [78] structure As shown in Figure 52(a)the visual branch interacts with the textual branch via a set of co-attentionlayers which exchanges information between the key-value pairs to gen-erate the text-guided visual feature (or vice versa) Please refer to thework [47] for further details

The work ViLBERT [47] takes the visual features extracted from allpre-generated proposals within an image as the visual input to learn thevisual-linguistic representation (see Figure 52(a)) However since thesevisual features are spatially pooled by using the average pooling oper-ation spatial information in the visual input feature space will be lostWhile such information is important for bounding boxes prediction thisis not an issue for ViLBERT as it assumes the bounding boxes are gener-ated by using the pre-trained object detectors

Our ST-ViLBERT module extends ViLBERT for the spatial localiza-tion task without requiring any pre-generated bounding boxes so weneed to preserve spatial information when performing cross-modal fea-ture learning Specifically we introduce a Spatio-temporal Combinationand Decomposition (STCD) module to replace the Multi-head Attentionand Add amp Norm modules for the visual branch in ViLBERT As shownin Figure 52(b) our STCD module respectively applies the spatial andtemporal average pooling operations on the input visual feature (ie thevisual output from the last layer) to produce the initial temporal fea-ture with the size of T times C and the initial spatial feature with the sizeof HW times C which are then concatenated to construct the combined vi-sual feature with the size of (T + HW)times C We then pass the combinedvisual feature to the textual branch which is used as key and value in

52 Methodology 89

Multi-head AttentionQuery Key Value

Add amp Norm

Multi-head AttentionQueryKeyValue

Add amp Norm

Feed Forward

Add amp Norm

Feed Forward

Add amp Norm

Visual Input (Output from the Last Layer)

Visual Output for the Current Layer

Textual Input (Output from the Last Layer)

Textual Output for the Current Layer

Visual Branch Textual Branch

Spatial Pooling Temporal Pooling

Combination

Multi-head Attention

Query Key ValueTextual Input

Combined Visual Feature

Decomposition

Replication Replication

Intermediate Visual Feature

Visual Input

Norm

(a) One Co-attention Layer in ViLBERT

(b) Our Spatio-temporal Combination and Decomposition Module

Figure 52 (a) The overview of one co-attention layer in ViLBERT [47]The co-attention block consisting of a visual branch and a textualbranch generates the visual-linguistic representations by exchanging thekey-value pairs for the multi-head attention blocks (b) The structure ofour Spatio-temporal Combination and Decomposition (STCD) modulewhich replaces the ldquoMulti-head attention and ldquoAdd amp Norm blocks inthe visual branch of ViLBERT (marked in the green dotted box in (a))

90Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

the multi-head attention block of the textual branch Additionally thecombined visual feature is also fed to the multi-attention block of the vi-sual branch together with the textual input (ie the textual output fromthe last layer) to generate the initial text-guided visual feature with thesize of (T + HW) times C which is then decomposed into the text-guidedtemporal feature with the size of T times C and the text-guided spatial fea-ture with the size of HW times C These two features are then respectivelyreplicated HW and T times to match the dimension of the input visualfeature The replicated features and the input visual feature are addedup and normalized to generate the intermediate visual feature with thesize of T times HW times C The remaining part (including the textual branch)are the same as that in ViLBERT [47] In our ST-ViLBERT we take theoutput from the visual and textual branches in the last co-attention layeras the text-guided visual feature Ftv isin RTtimesHWtimesC and the visual-guidedtextual feature respectively

Spatial Localization Given the text-guided visual feature Ftv our SVG-net predicts the bounding box bt at each frame We first reshape Ftv

to the size of T times H timesW times C Taking the feature from each individualframe (with the size of H timesW times C) as the input of three deconvolutionlayers we then upsample the spatial resolution by a factor of 8 Sim-ilar as in CenterNet [104] the upsampled feature is used as the inputof two parallel detection heads with each head consisting of a 3 times 3convolution layer for feature extraction and a 1 times 1 convolution layerfor dimension reduction The first detection head outputs a heatmapA isin R8Htimes8W where the value at each spatial location indicates the prob-ability of this position being the bounding box center of the target objectand the second head regresses the size (ie the height and width) of thetarget bounding box at each position In the heatmap we select the spa-tial location with the highest probability as the predicted bounding boxcenter and use the corresponding predicted height and width at this se-lected position to calculate the top-left and the bottom-right coordinatesof the predicted bounding box

Initial Tube Generation Sub-module In addition to bounding boxprediction the SVG-net also predicts the temporal starting and ending

52 Methodology 91

frame positions As shown in Fig 51(b) we apply spatial average pool-ing on the text-guided visual feature Ftv to produce the global text-guidedvisual feature Fgtv isin RTtimesC for each input video clip with T frames Wealso feed the global textual feature (ie the visual-guided textual fea-ture corresponding to the token [lsquoCLSrsquo]) into a MLP layer to produce theintermediate textual feature in the C-dim common feature space suchthat we can compute the initial relevance score for each frame betweenthe corresponding visual feature from Fgtv and the intermediate textualfeature in the common feature space by using the correlation operationAfter applying the Sigmoid activation function we obtain the final rele-vance score oobj for each frame in one video clip which indicates howeach frame is relevant to the textual description in the common fea-ture space We then combine the relevance scores of all frames from allvideo clips to construct a sequence of relevance scores for all KT framesin the whole video and then apply a median filter operation to smooththis score sequence After that the positions of the starting and endingframes can be determined by using a pre-defined threshold Namely thebounding boxes with the smoothed relevance scores less than the thresh-old will be removed from the initial tube If more than one starting andending frame pairs are detected we select the pair with the longest dura-tion as our initial starting and ending frame prediction result (t0

s t0e ) The

initial temporal boundaries (t0s t0

e ) and the predicted bounding boxes bt

form the initial tube prediction result Binit = btt0e

t=t0s

Loss Functions We use a combination of a focal loss a L1 loss and across-entropy loss to train the SVG-net At the training stage we ran-domly sample a set of video clips with T consecutive frames from eachinput videos and then we select the video clips which contain at leastone frame having a ground-truth bounding box as the training sam-ples Below we take one frame as an example to introduce the lossesSpecifically for each frame in a training video clip we denote the cen-ter width and height as well as the relevance label of the ground-truthbounding box as (x y) w h and o respectively When the frame con-tains a ground-truth bounding box we have o = 1 otherwise we haveo = 0 Based on the ground-truth bounding box we follow [104] to gen-erate a center heatmap A for each frame by using the Gaussian kernel

92Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

axy = exp(minus (xminusx)2+(yminusy)2

2σ2 ) where axy is the value of A isin R8Htimes8W atthe spatial location (x y) and σ is the bandwidth parameter which isadaptively determined based on the object size [28] For training ourSVG-net the objective function LSVG for each frame in the training videoclip is defined as follows

LSVG =λ1Lc(A A) + λ2Lrel(oobj o)

+ λ3Lsize(wxy hxy w h)(51)

where A isin R8Htimes8W is the predicted heatmap oobj is the predicted rel-evance score wxy and hxy are the predicted width and height of thebounding box centered at the position (xy) Lc is the focal loss [39]for predicting the bounding box center Lsize is a L1 loss for regressingthe size of the bounding box Lrel is a cross-entropy loss for relevancescore prediction We empirically set the loss weights λ1 = λ2 = 1 andλ3 = 01 We then average LSVG over all frames in each training videoclip to produce the total loss for this training video clip Please refer toSupplementary for further details

523 Temporal Boundary Refinement

The accurate temporal boundaries of the generated tube play an impor-tant role in the spatio-temporal video grounding task For the temporallocalization task we observe that the anchor-based methods can oftenproduce temporal segments covering the entire ground-truth segmentsbut their boundaries may not be accurate While the frame-based meth-ods can accurately determine boundaries in local startingending re-gions it may falsely detect the temporal segments because overall segment-level information is missing in these methods To this end we developa multi-stage ViLBERT-based Temporal Boundary Refinement network(TBR-net) consisting of an anchor-based branch and a frame-based branchto progressively refine the temporal boundaries of the initial tube

As shown in Fig 53 at the first stage the output from the SVG-net (ie Binit) is used as an anchor and its temporal boundaries are firstrefined by using the anchor-based branch which takes advantage of theoverall feature of the anchor to produce the refined temporal boundaries

52 Methodology 93

Text

ual I

nput

To

kens

Star

ting

amp En

ding

Pr

obab

ilitie

s

Tem

pora

l Ext

ensi

onU

nifo

rm S

ampl

ing

ViLB

ERT

Boun

dary

Reg

ress

ion

Loca

l Fea

ture

Ext

ract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Exte

nded

Stra

ting

amp En

ding

Pos

ition

sAn

chor

Fea

ture

Refi

ned

Star

ting

amp En

ding

Pos

ition

s (

From

Anc

hor-b

ased

Bra

nch)

and

Star

ting

amp En

ding

Reg

ion

Feat

ures

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Anch

or-b

ased

Bran

chFr

ame-

base

dBr

anch

Initi

al S

tarti

ng amp

End

ing

Posi

tions

Refi

nde

Star

ting

amp En

ding

Pos

ition

s (F

rom

Fra

me-

base

d Br

anch

)

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Vide

o Po

oled

Fea

ture

Text

ual I

nput

Tok

ens

Anc

hor-

base

d B

ranc

h

Fram

e-ba

sed

Bra

nch

Refi

ned

Obj

ect

Tube

Initi

al O

bjec

t Tu

be

Vide

o Po

oled

Fe

atur

e

Text

ual I

nput

To

kens

Stag

e 1

Stag

e 2

Stag

e 3

Star

ting

Endi

ng

Prob

abilit

yAn

chor

Fea

ture

Extra

ctio

nVi

LBER

TBo

unda

ryR

egre

ssio

nLo

cal F

eatu

reEx

tract

ion

ViLB

ERT

1D C

onv

Boun

dary

Sele

ctio

n

Refi

ned

Posi

tions

(Fro

m A

B)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Anch

or-b

ased

Bran

ch (A

B)Fr

ame-

base

dBr

anch

(FB)

Initi

al P

ositi

ons

Refi

ned

Posi

tions

(F

rom

FB)

Pool

ed V

ideo

Fe

atur

eTe

xtua

l Inp

ut

Toke

nsPo

oled

Vid

eo

Feat

ure

Anc

hor-

base

d B

ranc

h (A

B)

Fram

e-ba

sed

Bra

nch

(FB

)

Refi

ned

Obj

ect T

ube

from

Initi

al O

bjec

t Tu

be

Pool

ed V

ideo

Fea

ture

Text

ual I

nput

Tok

ens

Stag

e 1

Stag

e 2

Figu

re5

3O

verv

iew

ofou

rTe

mpo

ralB

ound

ary

Refi

nem

entn

etw

ork

(TBR

-net

)w

hich

isa

mul

ti-s

tage

ViL

BER

T-ba

sed

fram

ewor

kw

ith

anan

chor

-bas

edbr

anch

and

afr

ame-

base

dbr

anch

(tw

ost

ages

are

used

for

bett

erill

ustr

atio

n)

The

inpu

tsta

rtin

gan

den

ding

fram

epo

siti

ons

atea

chst

age

are

thos

eof

the

outp

utob

ject

tube

from

the

prev

ious

stag

eA

tst

age

1th

ein

puto

bjec

ttub

eis

gene

rate

dby

our

SVG

-net

94Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

ts and te The frame-based branch then continues to refine the bound-aries ts and te within the local regions by leveraging the frame-level fea-tures The improved temporal boundaries from the frame-based branch(ie ts and te) can be used as a better anchor for the anchor-based branchat the next stage The iterative process will repeat until satisfactory re-sults are obtained We take the first stage of our SVG-net as an exampleto introduce the details of each branch and the details in other stages aresimilar to those in the first stage

Anchor-based branch For the anchor-based branch the core idea is totake advantage of the feature extracted from the whole anchor in orderto regress the target boundaries and predict the offsets

Given an anchor with the temporal boundaries (t0s t0

e ) and the lengthl = t0

e minus t0s we temporally extend the anchor before and after the start-

ing and the ending frames by l4 to include more context informa-tion We then uniformly sample 20 frames within the temporal dura-tion [t0

s minus l4 t0e + l4] and generate the anchor feature F by stacking the

corresponding features along the temporal dimension from the pooledvideo feature Fpool at the 20 sampled frames where the pooled videofeature Fpool is generated by stacking the spatially pooled features fromall frames within each video along the temporal dimension

The anchor feature F together with the textual input token E areused as the input to the co-attention based transformer module of ViL-BERT to generate the text-guided anchor feature based on which theaverage pooling operation is used to produce the pooled text-guided an-chor feature This feature is then fed into an MLP layer to regress therelative positions with respect to t0

s and t0e and generate the new starting

and ending frame positions t1s and t1

e

Frame-based branch For the frame-based branch we extract the fea-ture from each frame within the startingending duration and predictsthe possibility of this frame being a startingending frame Below wetake the starting position as an example Given the refined starting posi-tion t1

s generated from the previous anchor-based branch we define thestarting duration as Ds = [t1

s minusD2+ 1t1s + D2] where D is the length

53 Experiment 95

of the starting duration We then construct the frame-level feature Fs

by stacking the corresponding features of all frames within the startingduration from the pooled video feature Fpool We then take the frame-level feature Fs and the textual input token E as the input and adopt theViLBERT module to produce the text-guided frame-level feature whichis then used as the input to a 1D convolution operation to predict theprobability of each frame within the starting duration being the startingframe We then select the frame with the highest probability as the start-ing frame position t1

s Accordingly we can produce the ending frameposition t1

e Note that the kernel size of the 1D convolution operations is1 and the parameters related to the operations for generating the startingand ending frame positions are not shared Once the new starting andending frame positions t1

s and t1e are generated they can be used as the

input for the anchor-based branch at the next stage

Loss Functions To train our TBR-net we define the training objectivefunction LTBR as follow

LTBR =λ4Lanc(∆t ∆t) + λ5Ls(ps ps) + λ6Le(pe pe) (52)

where Lanc is the regression loss used in [14] for regressing the temporaloffsets Ls and Le are the cross-entropy loss for predicting the startingand the ending frames respectively In Lanc ∆t and ∆t are the predictedoffsets and the ground-truth offsets respectively In Ls ps is the pre-dicted probability of being the starting frame for the frames within thestarting duration while ps is the ground-truth label for starting frame po-sition prediction Similarly we use pe and pe for the loss Le In this workwe empirically set the loss weights λ4 = λ5 = λ6 = 1

53 Experiment

531 Experiment Setup

Datasets We evaluate our proposed framework on the VidSTG [102]dataset and the HC-STVG [76] dataset

96Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

-VidSTG This dataset consists of 99943 sentence descriptions with44808 declarative sentences and 55135 interrogative sentences describ-ing 79 types of objects appearing in the untrimmed videos Follow-ing [102] we divide the sentence descriptions into the training set thevalidation set and the testing set with 36202 (resp 44482) 3996 (resp4960) and 4610 (resp 5693) declarative (resp interrogative) sentencesThe described objects in the untrimmed videos are annotated with thespatio-temporal tubes

-HC-STVG This dataset consists of 5660 video-description pairsand all videos are untrimmed This dataset is human-centric since allvideos are captured in multi-person scenes and the descriptions containrich expressions related to human attributes and actions This datasetis divided into the training set and the testing set with 4500 and 1160video-sentence pairs respectively All target persons are annotated withspatio-temporal tubes

Implementation details We use the ResNet-101 [20] network pretrainedon ImageNet [11] as our image encoder to extract the visual features fromthe RGB frames in the input videos For the ST-ViLBERT module in ourSVG-net and the ViLBERT module in our TBR-net we employ the ViL-BERT model pretained on the Conceptual Caption dataset [61] for ini-tialization Following [102] we sample the input videos at the framerate of 5fps The batch size and the initial learning rate are set to be 6and 000001 respectively After training our models (for both SVG-netand TBR-net) for 50 epochs we decrease the learning rate by a factor of10 and then train the models for another 10 epochs The window size ofthe median filter the pre-defined threshold θp and the temporal lengthof each clip K are set as 9 03 and 4 respectively We apply three stagesin the temporal boundary refinement step as the results are already sat-urated Our method is implemented by using PyTorch on the machinewith a single GTX 1080Ti GPU

Evaluation metrics We follow [102] to use m_vIoU and vIoUR asour evaluation criteria The vIoU is calculated as vIoU = 1

|SU | sumtisinSIrt

where rt is the IoU between the detected bounding box and the ground-truth bounding box at frame t the set SI contains the intersected frames

53 Experiment 97

between the detected tubes and the ground-truth tubes (ie the intersec-tion set between the frames from both tubes) and SU is the set of framesfrom either the detected tubes or the ground-truth tubes The m_vIoUscore is defined as the average vIoU score over all testing samples andvIoUR refers to the ratio of the testing samples with vIoU gt R over allthe testing samples

Baseline Methods

-STGRN [102] is the state-of-the-art method on the VidSTC datasetAlthough this method does not require pre-generate tube proposals itstill requires the pre-trained detector to produce the bounding box pro-posals in each frame which are then used to build the spatial relationgraph and the temporal dynamic graph And the final bounding boxesare selected from these proposals Therefore its performance is highlydependent on the quality of the pre-generated proposals

-STGVT [76] is the state-of-the-art method on the HC-STVG datasetSimilar to our SVG-net it adopts a visual-linguistic transformer modulebased on ViLBERT [47] to learn the cross-modal representations How-ever STGVT relies on a pre-trained object detector and a linking algo-rithm to generate the tube proposals while our framework does not re-quire any pre-generated tubes

-Vanilla (ours) In our proposed SVG-net the key part is ST-ViLBERTDifferent from the original ViLBERT [47] which does not model spatialinformation for the input visual features the ST-ViLBERT in our SVG-net can preserve spatial information from the input visual features andmodel spatio-temporal information To evaluate the effectiveness of theST-ViLBERT we introduce this baseline by replacing the ST-ViLBERTwith the original ViLBERT

-SVG-net (ours) As the first step of our proposed framework theSVG-net can predict the initial tube Here we treat it as a baseline methodBy comparing the performance of this baseline method and our two-stepframework we can evaluate the effectiveness of our TBR-net

98Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

532 Comparison with the State-of-the-art Methods

We compare our proposed method with the state-of-the-art methods onboth VidSTG and HC-STVG datasets The results on these two datasetsare shown in Table 51 and Table 52 From the results we observe that1) Our proposed method outperforms the state-of-the-art methods bya large margin on both datasets in terms of all evaluation metrics 2)SVG-net (ours) achieves better performance when compared with theVanilla model indicating that it is more effective to use our ST-ViLBERTmodule to learn cross-modal representations than the original ViLBERTmodule 3) Our proposed Temporal Boundary Refinement method canfurther boost the performance When compared with the initial tubegeneration results from our SVG-net we can further achieve more than2 improvement in terms of all evaluation metrics after applying ourTBR-net to refine the tmeporal boundaries

533 Ablation Study

In this section we take the VidSTG dataset as an example to conduct theablation study and investigate the contributions of different componentsin our proposed method

Effectiveness of the ST-ViLBERT Module As shown in Table 51 andTable 52 the video visual grounding results after using our SVG-netare much higher than our baseline method Vanilla which demonstratesit is useful to additionally preserve spatial information when modelingspatio-temporal information in the spatial visual grounding step

It is also interesting to observe that the performance improvementbetween our SVG-net and our baseline method Vanilla on the HC-STVGdataset is larger than that on the VidSTG dataset which indicates it ismore useful to employ the proposed ST-ViLBERT module on the HC-STVG dataset A possible explanation is that the HC-STVG dataset hasmany complex multi-person scenes which makes the spatial localizationtask more challenging than the VidSTG dataset Without pre-generating

53 Experiment 99

Tabl

e5

1R

esul

tsof

diff

eren

tmet

hods

onth

eV

idST

Gda

tase

tldquo

ind

icat

esth

ere

sult

sar

equ

oted

from

its

orig

inal

wor

k

Met

hod

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)m

_vIo

U(

)vI

oU

03(

)

vIoU

0

5(

)ST

GR

N

[102

]19

75

257

714

60

183

221

10

128

3V

anill

a(o

urs)

214

927

69

157

720

22

231

713

82

SVG

-net

(our

s)22

87

283

116

31

215

724

09

143

3SV

G-n

et+

TBR

-net

(our

s)25

21

309

918

21

236

726

26

160

1

100Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Table 52 Results of different methods on the HC-STVG dataset ldquoindicates the results are quoted from the original work

Method m_vIoU vIoU03 vIoU05STGVT [76] 1815 2681 948Vanilla (ours) 1794 2559 875

SVG-net (ours) 1945 2778 1012SVG-net +

TBR-net (ours)2241 3031 1207

the tube proposals our ST-ViLBERT can learn a more representative cross-modal feature than the original ViLBERT because spatial information ispreserved from the input visual feature in our ST-ViLBERT

Effectiveness of the TBR-net Our temporal boundary refinement methodtakes advantage of complementary information between the anchor-basedand frame-based temporal boundary refinement methods to progres-sively refine the temporal boundary locations of the initial object tubesfrom our SVG-net To verify the effectiveness of combining these twolines of temporal localization methods and the design of our multi-stagestructure we conduct the experiments by using different variants forthe temporal boundary refinement module The results are reported inTable 53 For all methods the initial tubes are generated by using ourSVG-net so these methods are referred to as SVG-net SVG-net + IFB-netSVG-net + IAB-net and SVG-net + TBR-net respectively

Through the results we have the following observations (1) Forthe iterative frame-based method (IFB-net) and the iterative anchor-basedmethod (IAB-net) the spatio-temporal video grounding performance canbe slightly improved after increasing the number of stages (2) The per-formance of both SVG-net + IFB-net and SVG-net + IAB-net at any stagesare generally better than that of the baseline method SVG-net indicat-ing that either the frame-based or the anchor-based method can refinethe temporal boundaries of the initial object tubes (3) Also the re-sults show that the IFB-net can bring more improvement in terms ofvIoU05 while the IAB-net performs better in terms of vIoU03 (4)

53 Experiment 101

Tabl

e5

3R

esul

tsof

our

met

hod

and

its

vari

ants

whe

nus

ing

diff

eren

tnum

ber

ofst

ages

onth

eV

idST

Gda

tase

t

Met

hods

Sta

ges

Dec

lara

tive

Sent

ence

Gro

undi

ngIn

terr

ogat

ive

Sent

ence

Gro

undi

ngm

_vIo

UvI

oU

03

vIoU

0

5m

_vIo

UvI

oU

03

vIoU

0

5SV

G-n

et-

228

728

31

163

121

57

240

914

33

SVG

-net

+IF

B-ne

t1

234

128

84

166

922

12

244

814

57

223

82

289

716

85

225

524

69

147

13

237

728

91

168

722

61

247

214

79

SVG

-net

+IA

B-ne

t1

232

729

57

163

921

78

249

314

31

223

74

299

816

52

222

125

42

145

93

238

530

15

165

722

43

257

114

47

SVG

-net

+T

BR-n

et1

243

229

92

172

222

69

253

715

29

224

95

307

517

02

233

426

03

158

33

252

130

99

182

123

67

262

616

01

102Chapter 5 STVGBert A Visual-linguistic Transformer based

Framework for Spatio-temporal Video Grounding

Our full model SVG-net + TBR-net outperforms the two alternative meth-ods SVG-net + IFB-net and SVG-net + IAB-net and achieves the best re-sults in terms of all evaluation metrics which demonstrates our TBR-netcan take advantage of the merits of these two alternative methods

54 Summary

In this chapter we have proposed a novel two-step spatio-temporal videogrounding framework STVGBert based on the visual-linguistic trans-former to produce spatio-temporal object tubes for a given query sen-tence which consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT in ourSVG-net Our SVG-net can produce bounding boxes without requiringany pre-trained object detector We have also designed a new TemporalBoundary Refinement network which leverages the complementarityof the frame-based and the anchor-based methods to iteratively refinethe temporal boundaries Comprehensive experiments on two bench-mark datasets VidSTG and HC-STVG demonstrate the effectiveness ofour newly proposed framework for spatio-temporal video grounding

103

Chapter 6

Conclusions and Future Work

As the increasing accessibility of personal camera devices and the rapiddevelopment of network technologies we have witnessed an increas-ing number of videos on Internet It can be foreseen that the numberof videos will continually grow in the future The learning techniquesfor video localization will attract more and more attentions from bothacademies and industries In this thesis we have investigated three ma-jor vision tasks for video localization spatio-temporal action localiza-tion temporal action localization and spatio-temporal visual ground-ing and we have also developed new algorithms for each of these prob-lems In this chapter we conclude the contributions of our works anddiscuss the potential directions of video localization in the future

61 Conclusions

The contributions of this thesis are summarized as follows

bull We have proposed the Progressive Cross-stream Cooperation (PCSC)framework to improve the spatio-temporal action localization re-sults in which we first use our PCSC framework for spatial lo-calization at the frame level and then apply our temporal PCSCframework for temporal localization at the action tube level OurPCSC framework consists of several iterative stages At each stagewe progressively improve action localization results for one stream(ie RGBflow) by leveraging the information from another stream(ie RGBflow) at both region proposal level and feature level

104 Chapter 6 Conclusions and Future Work

The effectiveness of our newly proposed approaches is demon-strated by extensive experiments on both UCF-101-24 and J-HMDBdatasets

bull We have proposed a progressive cross-granularity cooperation (PCG-TAL) framework to gradually improve temporal action localiza-tion performance Our PCG-TAL framework consists of an Anchor-Frame Cooperation (AFC) module to take advantage of both frame-based and anchor-based proposal generation schemes along witha two-stream cooperation strategy to encourage collaboration be-tween the complementary appearance and motion clues In ourframework the cooperation mechanisms are conducted in a pro-gressive fashion at both feature level and segment proposal levelby stacking multiple AFC modules over different stages Compre-hensive experiments and ablation studies on THUMOS14 Activi-tyNet v13 and UCF-101-24 datasets show the effectiveness of ourproposed framework

bull We have proposed a novel two-step spatio-temporal video ground-ing framework STVGBert based on the visual-linguistic transformerto produce spatio-temporal object tubes for a given query sentencewhich consists of SVG-net and TBR-net Besides we have intro-duced a novel cross-modal feature learning module ST-ViLBERT inour SVG-net Our SVG-net can produce bounding boxes withoutrequiring any pre-trained object detector We have also designed anew Temporal Boundary Refinement network which leverages thecomplementarity of the frame-based and the anchor-based meth-ods to iteratively refine the temporal boundaries Comprehensiveexperiments on two benchmark datasets VidSTG and HC-STVGdemonstrate the effectiveness of our newly proposed frameworkfor spatio-temporal video grounding

62 Future Work

The potential directions of video localization in the future could be unify-ing the spatial video localization and the temporal video localization into

62 Future Work 105

one framework As described in Chapter 3 most of the existing worksgenerate the spatial and temporal action localization results by two sep-arate networks respectively It is a non-trivial task to integrate thesetwo networks into one framework and optimize it as a whole Similarlywhile we have made progresses on handling the spatial visual ground-ing task and the temporal visual grounding task with our STVGBert in-troduced in Chapter 5 these two tasks are still performed by two sep-arate networks We believe that it still needs more efforts in the futureto combine these two branches to generate the spatial and the temporalvisual grounding results simultaneously

107

Bibliography

[1] L Anne Hendricks O Wang E Shechtman J Sivic T Darrelland B Russell ldquoLocalizing moments in video with natural lan-guagerdquo In Proceedings of the IEEE international conference on com-puter vision 2017 pp 5803ndash5812

[2] A Blum and T Mitchell ldquoCombining Labeled and Unlabeled Datawith Co-trainingrdquo In Proceedings of the Eleventh Annual Conferenceon Computational Learning Theory COLTrsquo 98 Madison WisconsinUSA ACM 1998 pp 92ndash100

[3] F Caba Heilbron V Escorcia B Ghanem and J Carlos NieblesldquoActivitynet A large-scale video benchmark for human activityunderstandingrdquo In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition 2015 pp 961ndash970

[4] J Carreira and A Zisserman ldquoQuo vadis action recognition anew model and the kinetics datasetrdquo In proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 6299ndash6308

[5] Y-W Chao S Vijayanarasimhan B Seybold D A Ross J Dengand R Sukthankar ldquoRethinking the faster r-cnn architecture fortemporal action localizationrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2018 pp 1130ndash1139

[6] S Chen and Y-G Jiang ldquoSemantic proposal for activity localiza-tion in videos via sentence queryrdquo In Proceedings of the AAAI Con-ference on Artificial Intelligence Vol 33 2019 pp 8199ndash8206

[7] Z Chen L Ma W Luo and K-Y Wong ldquoWeakly-SupervisedSpatio-Temporally Grounding Natural Sentence in Videordquo In Pro-ceedings of the 57th Annual Meeting of the Association for Computa-tional Linguistics 2019 pp 1884ndash1894

108 Bibliography

[8] G Cheacuteron A Osokin I Laptev and C Schmid ldquoModeling Spatio-Temporal Human Track Structure for Action Localizationrdquo InCoRR abs180611008 (2018) arXiv 180611008 URL httparxivorgabs180611008

[9] X Dai B Singh G Zhang L S Davis and Y Qiu Chen ldquoTempo-ral context network for activity localization in videosrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 5793ndash5802

[10] A Dave O Russakovsky and D Ramanan ldquoPredictive-correctivenetworks for action detectionrdquo In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition 2017 pp 981ndash990

[11] J Deng W Dong R Socher L-J Li K Li and L Fei-Fei ldquoIma-genet A large-scale hierarchical image databaserdquo In Proceedingsof the IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2009 pp 248ndash255

[12] C Feichtenhofer A Pinz and A Zisserman ldquoConvolutional two-stream network fusion for video action recognitionrdquo In Proceed-ings of the IEEE conference on computer vision and pattern recognition2016 pp 1933ndash1941

[13] J Gao K Chen and R Nevatia ldquoCtap Complementary temporalaction proposal generationrdquo In Proceedings of the European Confer-ence on Computer Vision (ECCV) 2018 pp 68ndash83

[14] J Gao C Sun Z Yang and R Nevatia ldquoTall Temporal activitylocalization via language queryrdquo In Proceedings of the IEEE inter-national conference on computer vision 2017 pp 5267ndash5275

[15] J Gao Z Yang K Chen C Sun and R Nevatia ldquoTurn tap Tem-poral unit regression network for temporal action proposalsrdquo InProceedings of the IEEE International Conference on Computer Vision2017 pp 3628ndash3636

[16] R Girshick ldquoFast R-CNNrdquo In Proceedings of the IEEE InternationalConference on Computer Vision (ICCV) 2015 pp 1440ndash1448

[17] R Girshick J Donahue T Darrell and J Malik ldquoRich featurehierarchies for accurate object detection and semantic segmenta-tionrdquo In Proceedings of the IEEE conference on computer vision andpattern recognition 2014 pp 580ndash587

Bibliography 109

[18] G Gkioxari and J Malik ldquoFinding action tubesrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2015

[19] C Gu C Sun D A Ross C Vondrick C Pantofaru Y Li S Vi-jayanarasimhan G Toderici S Ricco R Sukthankar et al ldquoAVAA video dataset of spatio-temporally localized atomic visual ac-tionsrdquo In Proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2018 pp 6047ndash6056

[20] K He X Zhang S Ren and J Sun ldquoDeep residual learning forimage recognitionrdquo In Proceedings of the IEEE conference on com-puter vision and pattern recognition 2016 pp 770ndash778

[21] R Hou C Chen and M Shah ldquoTube Convolutional Neural Net-work (T-CNN) for Action Detection in Videosrdquo In The IEEE In-ternational Conference on Computer Vision (ICCV) 2017

[22] G Huang Z Liu L Van Der Maaten and K Q WeinbergerldquoDensely connected convolutional networksrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2017pp 4700ndash4708

[23] E Ilg N Mayer T Saikia M Keuper A Dosovitskiy and T BroxldquoFlownet 20 Evolution of optical flow estimation with deep net-worksrdquo In IEEE conference on computer vision and pattern recogni-tion (CVPR) Vol 2 2017 p 6

[24] H Jhuang J Gall S Zuffi C Schmid and M J Black ldquoTowardsunderstanding action recognitionrdquo In Proceedings of the IEEE in-ternational conference on computer vision 2013 pp 3192ndash3199

[25] Y-G Jiang J Liu A R Zamir G Toderici I Laptev M Shahand R Sukthankar THUMOS challenge Action recognition with alarge number of classes 2014

[26] V Kalogeiton P Weinzaepfel V Ferrari and C Schmid ldquoActiontubelet detector for spatio-temporal action localizationrdquo In Pro-ceedings of the IEEE International Conference on Computer Vision2017 pp 4405ndash4413

[27] A Krizhevsky I Sutskever and G E Hinton ldquoImagenet classifi-cation with deep convolutional neural networksrdquo In Advances inneural information processing systems 2012 pp 1097ndash1105

110 Bibliography

[28] H Law and J Deng ldquoCornernet Detecting objects as paired key-pointsrdquo In Proceedings of the European Conference on Computer Vi-sion (ECCV) 2018 pp 734ndash750

[29] C Lea M D Flynn R Vidal A Reiter and G D Hager ldquoTem-poral convolutional networks for action segmentation and detec-tionrdquo In proceedings of the IEEE Conference on Computer Vision andPattern Recognition 2017 pp 156ndash165

[30] J Lei L Yu T L Berg and M Bansal ldquoTvqa+ Spatio-temporalgrounding for video question answeringrdquo In arXiv preprint arXiv190411574(2019)

[31] D Li Z Qiu Q Dai T Yao and T Mei ldquoRecurrent tubelet pro-posal and recognition networks for action detectionrdquo In Proceed-ings of the European conference on computer vision (ECCV) 2018pp 303ndash318

[32] G Li N Duan Y Fang M Gong D Jiang and M Zhou ldquoUnicoder-VL A Universal Encoder for Vision and Language by Cross-ModalPre-Trainingrdquo In AAAI 2020 pp 11336ndash11344

[33] L H Li M Yatskar D Yin C-J Hsieh and K-W Chang ldquoVisu-alBERT A Simple and Performant Baseline for Vision and Lan-guagerdquo In Arxiv 2019

[34] Y Liao S Liu G Li F Wang Y Chen C Qian and B Li ldquoA Real-Time Cross-modality Correlation Filtering Method for ReferringExpression Comprehensionrdquo In Proceedings of the IEEECVF Con-ference on Computer Vision and Pattern Recognition 2020 pp 10880ndash10889

[35] T Lin X Liu X Li E Ding and S Wen ldquoBMN Boundary-MatchingNetwork for Temporal Action Proposal Generationrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 3889ndash3898

[36] T Lin X Zhao and Z Shou ldquoSingle shot temporal action de-tectionrdquo In Proceedings of the 25th ACM international conference onMultimedia 2017 pp 988ndash996

[37] T Lin X Zhao H Su C Wang and M Yang ldquoBsn Boundarysensitive network for temporal action proposal generationrdquo In

Bibliography 111

Proceedings of the European Conference on Computer Vision (ECCV)2018 pp 3ndash19

[38] T-Y Lin P Dollar R Girshick K He B Hariharan and S Be-longie ldquoFeature Pyramid Networks for Object Detectionrdquo In TheIEEE Conference on Computer Vision and Pattern Recognition (CVPR)2017

[39] T-Y Lin P Goyal R Girshick K He and P Dollaacuter ldquoFocal lossfor dense object detectionrdquo In Proceedings of the IEEE internationalconference on computer vision 2017 pp 2980ndash2988

[40] T-Y Lin M Maire S Belongie J Hays P Perona D RamananP Dollaacuter and C L Zitnick ldquoMicrosoft coco Common objects incontextrdquo In European conference on computer vision Springer 2014pp 740ndash755

[41] D Liu H Zhang F Wu and Z-J Zha ldquoLearning to assembleneural module tree networks for visual groundingrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2019pp 4673ndash4682

[42] J Liu L Wang and M-H Yang ldquoReferring expression genera-tion and comprehension via attributesrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 4856ndash4864

[43] L Liu H Wang G Li W Ouyang and L Lin ldquoCrowd countingusing deep recurrent spatial-aware networkrdquo In Proceedings ofthe 27th International Joint Conference on Artificial Intelligence AAAIPress 2018 pp 849ndash855

[44] W Liu D Anguelov D Erhan C Szegedy S Reed C-Y Fu andA C Berg ldquoSsd Single shot multibox detectorrdquo In European con-ference on computer vision Springer 2016 pp 21ndash37

[45] X Liu Z Wang J Shao X Wang and H Li ldquoImproving re-ferring expression grounding with cross-modal attention-guidederasingrdquo In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition 2019 pp 1950ndash1959

[46] Y Liu L Ma Y Zhang W Liu and S-F Chang ldquoMulti-granularityGenerator for Temporal Action Proposalrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2019pp 3604ndash3613

112 Bibliography

[47] J Lu D Batra D Parikh and S Lee ldquoVilbert Pretraining task-agnostic visiolinguistic representations for vision-and-languagetasksrdquo In Advances in Neural Information Processing Systems 2019pp 13ndash23

[48] G Luo Y Zhou X Sun L Cao C Wu C Deng and R Ji ldquoMulti-task Collaborative Network for Joint Referring Expression Com-prehension and Segmentationrdquo In Proceedings of the IEEECVFConference on Computer Vision and Pattern Recognition 2020 pp 10034ndash10043

[49] S Ma L Sigal and S Sclaroff ldquoLearning activity progression inlstms for activity detection and early detectionrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2016 pp 1942ndash1950

[50] B Ni X Yang and S Gao ldquoProgressively parsing interactionalobjects for fine grained action detectionrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1020ndash1028

[51] W Ouyang K Wang X Zhu and X Wang ldquoChained cascadenetwork for object detectionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2017 pp 1938ndash1946

[52] W Ouyang and X Wang ldquoSingle-pedestrian detection aided bymulti-pedestrian detectionrdquo In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition 2013 pp 3198ndash3205

[53] W Ouyang X Zeng X Wang S Qiu P Luo Y Tian H Li SYang Z Wang H Li et al ldquoDeepID-Net Object detection withdeformable part based convolutional neural networksrdquo In IEEETransactions on Pattern Analysis and Machine Intelligence 397 (2017)pp 1320ndash1334

[54] X Peng and C Schmid ldquoMulti-region two-stream R-CNN for ac-tion detectionrdquo In European Conference on Computer Vision Springer2016 pp 744ndash759

[55] L Pigou A van den Oord S Dieleman M V Herreweghe andJ Dambre ldquoBeyond Temporal Pooling Recurrence and TemporalConvolutions for Gesture Recognition in Videordquo In InternationalJournal of Computer Vision 1262-4 (2018) pp 430ndash439

Bibliography 113

[56] Z Qiu T Yao and T Mei ldquoLearning spatio-temporal represen-tation with pseudo-3d residual networksrdquo In proceedings of theIEEE International Conference on Computer Vision 2017 pp 5533ndash5541

[57] J Redmon S Divvala R Girshick and A Farhadi ldquoYou onlylook once Unified real-time object detectionrdquo In Proceedings ofthe IEEE conference on computer vision and pattern recognition 2016pp 779ndash788

[58] S Ren K He R Girshick and J Sun ldquoFaster r-cnn Towardsreal-time object detection with region proposal networksrdquo In Ad-vances in neural information processing systems 2015 pp 91ndash99

[59] A Sadhu K Chen and R Nevatia ldquoVideo Object Grounding us-ing Semantic Roles in Language Descriptionrdquo In Proceedings ofthe IEEECVF Conference on Computer Vision and Pattern Recogni-tion 2020 pp 10417ndash10427

[60] S Saha G Singh M Sapienza P H S Torr and F CuzzolinldquoDeep Learning for Detecting Multiple Space-Time Action Tubesin Videosrdquo In Proceedings of the British Machine Vision Conference2016 BMVC 2016 York UK September 19-22 2016 2016

[61] P Sharma N Ding S Goodman and R Soricut ldquoConceptualcaptions A cleaned hypernymed image alt-text dataset for auto-matic image captioningrdquo In Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics (Volume 1 LongPapers) 2018 pp 2556ndash2565

[62] Z Shou J Chan A Zareian K Miyazawa and S-F Chang ldquoCdcConvolutional-de-convolutional networks for precise temporal ac-tion localization in untrimmed videosrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 5734ndash5743

[63] Z Shou D Wang and S-F Chang ldquoTemporal action localizationin untrimmed videos via multi-stage cnnsrdquo In Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition 2016pp 1049ndash1058

114 Bibliography

[64] K Simonyan and A Zisserman ldquoTwo-Stream Convolutional Net-works for Action Recognition in Videosrdquo In Advances in NeuralInformation Processing Systems 27 2014 pp 568ndash576

[65] K Simonyan and A Zisserman ldquoVery deep convolutional net-works for large-scale image recognitionrdquo In arXiv preprint arXiv14091556(2014)

[66] B Singh T K Marks M Jones O Tuzel and M Shao ldquoA multi-stream bi-directional recurrent neural network for fine-grainedaction detectionrdquo In Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition 2016 pp 1961ndash1970

[67] G Singh S Saha M Sapienza P H Torr and F Cuzzolin ldquoOn-line real-time multiple spatiotemporal action localisation and pre-dictionrdquo In Proceedings of the IEEE International Conference on Com-puter Vision 2017 pp 3637ndash3646

[68] K Soomro A R Zamir and M Shah ldquoUCF101 A dataset of 101human actions classes from videos in the wildrdquo In arXiv preprintarXiv12120402 (2012)

[69] R Su W Ouyang L Zhou and D Xu ldquoImproving Action Local-ization by Progressive Cross-stream Cooperationrdquo In Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition2019 pp 12016ndash12025

[70] W Su X Zhu Y Cao B Li L Lu F Wei and J Dai ldquoVL-BERTPre-training of Generic Visual-Linguistic Representationsrdquo In In-ternational Conference on Learning Representations 2020

[71] C Sun A Myers C Vondrick K Murphy and C Schmid ldquoVideobertA joint model for video and language representation learningrdquoIn Proceedings of the IEEE International Conference on Computer Vi-sion 2019 pp 7464ndash7473

[72] C Sun A Shrivastava C Vondrick K Murphy R Sukthankarand C Schmid ldquoActor-centric Relation Networkrdquo In The Euro-pean Conference on Computer Vision (ECCV) 2018

[73] S Sun J Pang J Shi S Yi and W Ouyang ldquoFishnet A versatilebackbone for image region and pixel level predictionrdquo In Ad-vances in Neural Information Processing Systems 2018 pp 762ndash772

Bibliography 115

[74] C Szegedy W Liu Y Jia P Sermanet S Reed D AnguelovD Erhan V Vanhoucke and A Rabinovich ldquoGoing deeper withconvolutionsrdquo In Proceedings of the IEEE conference on computervision and pattern recognition 2015 pp 1ndash9

[75] H Tan and M Bansal ldquoLXMERT Learning Cross-Modality En-coder Representations from Transformersrdquo In Proceedings of the2019 Conference on Empirical Methods in Natural Language Process-ing 2019

[76] Z Tang Y Liao S Liu G Li X Jin H Jiang Q Yu and D XuldquoHuman-centric Spatio-Temporal Video Grounding With VisualTransformersrdquo In arXiv preprint arXiv201105049 (2020)

[77] D Tran L Bourdev R Fergus L Torresani and M Paluri ldquoLearn-ing spatiotemporal features with 3d convolutional networksrdquo InProceedings of the IEEE international conference on computer vision2015 pp 4489ndash4497

[78] A Vaswani N Shazeer N Parmar J Uszkoreit L Jones A NGomez Ł Kaiser and I Polosukhin ldquoAttention is all you needrdquoIn Advances in neural information processing systems 2017 pp 5998ndash6008

[79] L Wang Y Qiao X Tang and L Van Gool ldquoActionness Estima-tion Using Hybrid Fully Convolutional Networksrdquo In The IEEEConference on Computer Vision and Pattern Recognition (CVPR) 2016pp 2708ndash2717

[80] L Wang Y Xiong D Lin and L Van Gool ldquoUntrimmednets forweakly supervised action recognition and detectionrdquo In Proceed-ings of the IEEE conference on Computer Vision and Pattern Recogni-tion 2017 pp 4325ndash4334

[81] L Wang Y Xiong Z Wang Y Qiao D Lin X Tang and L VanGool ldquoTemporal segment networks Towards good practices fordeep action recognitionrdquo In European conference on computer vi-sion Springer 2016 pp 20ndash36

[82] P Wang Q Wu J Cao C Shen L Gao and A v d HengelldquoNeighbourhood watch Referring expression comprehension vialanguage-guided graph attention networksrdquo In Proceedings of the

116 Bibliography

IEEE Conference on Computer Vision and Pattern Recognition 2019pp 1960ndash1968

[83] P Weinzaepfel Z Harchaoui and C Schmid ldquoLearning to trackfor spatio-temporal action localizationrdquo In Proceedings of the IEEEinternational conference on computer vision 2015 pp 3164ndash3172

[84] S Xie C Sun J Huang Z Tu and K Murphy ldquoRethinking spa-tiotemporal feature learning Speed-accuracy trade-offs in videoclassificationrdquo In Proceedings of the European Conference on Com-puter Vision (ECCV) 2018 pp 305ndash321

[85] H Xu A Das and K Saenko ldquoR-c3d Region convolutional 3dnetwork for temporal activity detectionrdquo In Proceedings of the IEEEinternational conference on computer vision 2017 pp 5783ndash5792

[86] M Yamaguchi K Saito Y Ushiku and T Harada ldquoSpatio-temporalperson retrieval via natural language queriesrdquo In Proceedings ofthe IEEE International Conference on Computer Vision 2017 pp 1453ndash1462

[87] S Yang G Li and Y Yu ldquoCross-modal relationship inference forgrounding referring expressionsrdquo In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition 2019 pp 4145ndash4154

[88] S Yang G Li and Y Yu ldquoDynamic graph attention for referringexpression comprehensionrdquo In Proceedings of the IEEE Interna-tional Conference on Computer Vision 2019 pp 4644ndash4653

[89] X Yang X Yang M-Y Liu F Xiao L S Davis and J KautzldquoSTEP Spatio-Temporal Progressive Learning for Video ActionDetectionrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition (CVPR) 2019

[90] X Yang X Liu M Jian X Gao and M Wang ldquoWeakly-SupervisedVideo Object Grounding by Exploring Spatio-Temporal ContextsrdquoIn Proceedings of the 28th ACM International Conference on Multime-dia 2020 pp 1939ndash1947

[91] Z Yang T Chen L Wang and J Luo ldquoImproving One-stage Vi-sual Grounding by Recursive Sub-query Constructionrdquo In arXivpreprint arXiv200801059 (2020)

Bibliography 117

[92] Z Yang B Gong L Wang W Huang D Yu and J Luo ldquoAfast and accurate one-stage approach to visual groundingrdquo InProceedings of the IEEE International Conference on Computer Vision2019 pp 4683ndash4693

[93] Z Yang J Gao and R Nevatia ldquoSpatio-temporal action detec-tion with cascade proposal and location anticipationrdquo In arXivpreprint arXiv170800042 (2017)

[94] Y Ye X Yang and Y Tian ldquoDiscovering spatio-temporal actiontubesrdquo In Journal of Visual Communication and Image Representa-tion 58 (2019) pp 515ndash524

[95] S Yeung O Russakovsky G Mori and L Fei-Fei ldquoEnd-to-endlearning of action detection from frame glimpses in videosrdquo InProceedings of the IEEE Conference on Computer Vision and PatternRecognition 2016 pp 2678ndash2687

[96] L Yu Z Lin X Shen J Yang X Lu M Bansal and T L BergldquoMattnet Modular attention network for referring expression com-prehensionrdquo In Proceedings of the IEEE Conference on Computer Vi-sion and Pattern Recognition 2018 pp 1307ndash1315

[97] J Yuan B Ni X Yang and A A Kassim ldquoTemporal action local-ization with pyramid of score distribution featuresrdquo In Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recogni-tion 2016 pp 3093ndash3102

[98] Z Yuan J C Stroud T Lu and J Deng ldquoTemporal action local-ization by structured maximal sumsrdquo In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition 2017 pp 3684ndash3692

[99] R Zeng W Huang M Tan Y Rong P Zhao J Huang and CGan ldquoGraph Convolutional Networks for Temporal Action Lo-calizationrdquo In Proceedings of the IEEE International Conference onComputer Vision 2019 pp 7094ndash7103

[100] W Zhang W Ouyang W Li and D Xu ldquoCollaborative and Ad-versarial Network for Unsupervised domain adaptationrdquo In Pro-ceedings of the IEEE Conference on Computer Vision and Pattern Recog-nition 2018 pp 3801ndash3809

118 Bibliography

[101] Z Zhang Z Zhao Z Lin B Huai and J Yuan ldquoObject-AwareMulti-Branch Relation Networks for Spatio-Temporal Video Ground-ingrdquo In Proceedings of the Twenty-Ninth International Joint Confer-ence on Artificial Intelligence 2020 pp 1069ndash1075

[102] Z Zhang Z Zhao Y Zhao Q Wang H Liu and L Gao ldquoWhereDoes It Exist Spatio-Temporal Video Grounding for Multi-FormSentencesrdquo In Proceedings of the IEEECVF Conference on ComputerVision and Pattern Recognition 2020 pp 10668ndash10677

[103] Y Zhao Y Xiong L Wang Z Wu X Tang and D Lin ldquoTemporalaction detection with structured segment networksrdquo In Proceed-ings of the IEEE International Conference on Computer Vision 2017pp 2914ndash2923

[104] X Zhou D Wang and P Kraumlhenbuumlhl ldquoObjects as Pointsrdquo InarXiv preprint arXiv190407850 2019

[105] M Zolfaghari G L Oliveira N Sedaghat and T Brox ldquoChainedmulti-stream networks exploiting pose motion and appearancefor action classification and detectionrdquo In Proceedings of the IEEEInternational Conference on Computer Vision 2017 pp 2904ndash2913

  • Abstract
  • Declaration
  • Acknowledgements
  • List of Publications
  • List of Figures
  • List of Tables
  • Introduction
    • Spatial-temporal Action Localization
    • Temporal Action Localization
    • Spatio-temporal Visual Grounding
    • Thesis Outline
      • Literature Review
        • Background
          • Image Classification
          • Object Detection
          • Action Recognition
            • Spatio-temporal Action Localization
            • Temporal Action Localization
            • Spatio-temporal Visual Grounding
            • Summary
              • Progressive Cross-stream Cooperation in Spatial and Temporal Domain for Action Localization
                • Background
                  • Spatial temporal localization methods
                  • Two-stream R-CNN
                    • Action Detection Model in Spatial Domain
                      • PCSC Overview
                      • Cross-stream Region Proposal Cooperation
                      • Cross-stream Feature Cooperation
                      • Training Details
                        • Temporal Boundary Refinement for Action Tubes
                          • Actionness Detector (AD)
                          • Action Segment Detector (ASD)
                          • Temporal PCSC (T-PCSC)
                            • Experimental results
                              • Experimental Setup
                              • Comparison with the State-of-the-art methods
                                • Results on the UCF-101-24 Dataset
                                • Results on the J-HMDB Dataset
                                  • Ablation Study
                                  • Analysis of Our PCSC at Different Stages
                                  • Runtime Analysis
                                  • Qualitative Results
                                    • Summary
                                      • PCG-TAL Progressive Cross-granularity Cooperation for Temporal Action Localization
                                        • Background
                                          • Action Recognition
                                          • Spatio-temporal Action Localization
                                          • Temporal Action Localization
                                            • Our Approach
                                              • Overall Framework
                                              • Progressive Cross-Granularity Cooperation
                                              • Anchor-frame Cooperation Module
                                              • Training Details
                                              • Discussions
                                                • Experiment
                                                  • Datasets and Setup
                                                  • Comparison with the State-of-the-art Methods
                                                  • Ablation study
                                                  • Qualitative Results
                                                    • Summary
                                                      • STVGBert A Visual-linguistic Transformer based Framework for Spatio-temporal Video Grounding
                                                        • Background
                                                          • Vision-language Modelling
                                                          • Visual Grounding in ImagesVideos
                                                            • Methodology
                                                              • Overview
                                                              • Spatial Visual Grounding
                                                              • Temporal Boundary Refinement
                                                                • Experiment
                                                                  • Experiment Setup
                                                                  • Comparison with the State-of-the-art Methods
                                                                  • Ablation Study
                                                                    • Summary
                                                                      • Conclusions and Future Work
                                                                        • Conclusions
                                                                        • Future Work
                                                                          • Bibliography
Page 9: Deep Learning for Spatial and Temporal Video Localization
Page 10: Deep Learning for Spatial and Temporal Video Localization
Page 11: Deep Learning for Spatial and Temporal Video Localization
Page 12: Deep Learning for Spatial and Temporal Video Localization
Page 13: Deep Learning for Spatial and Temporal Video Localization
Page 14: Deep Learning for Spatial and Temporal Video Localization
Page 15: Deep Learning for Spatial and Temporal Video Localization
Page 16: Deep Learning for Spatial and Temporal Video Localization
Page 17: Deep Learning for Spatial and Temporal Video Localization
Page 18: Deep Learning for Spatial and Temporal Video Localization
Page 19: Deep Learning for Spatial and Temporal Video Localization
Page 20: Deep Learning for Spatial and Temporal Video Localization
Page 21: Deep Learning for Spatial and Temporal Video Localization
Page 22: Deep Learning for Spatial and Temporal Video Localization
Page 23: Deep Learning for Spatial and Temporal Video Localization
Page 24: Deep Learning for Spatial and Temporal Video Localization
Page 25: Deep Learning for Spatial and Temporal Video Localization
Page 26: Deep Learning for Spatial and Temporal Video Localization
Page 27: Deep Learning for Spatial and Temporal Video Localization
Page 28: Deep Learning for Spatial and Temporal Video Localization
Page 29: Deep Learning for Spatial and Temporal Video Localization
Page 30: Deep Learning for Spatial and Temporal Video Localization
Page 31: Deep Learning for Spatial and Temporal Video Localization
Page 32: Deep Learning for Spatial and Temporal Video Localization
Page 33: Deep Learning for Spatial and Temporal Video Localization
Page 34: Deep Learning for Spatial and Temporal Video Localization
Page 35: Deep Learning for Spatial and Temporal Video Localization
Page 36: Deep Learning for Spatial and Temporal Video Localization
Page 37: Deep Learning for Spatial and Temporal Video Localization
Page 38: Deep Learning for Spatial and Temporal Video Localization
Page 39: Deep Learning for Spatial and Temporal Video Localization
Page 40: Deep Learning for Spatial and Temporal Video Localization
Page 41: Deep Learning for Spatial and Temporal Video Localization
Page 42: Deep Learning for Spatial and Temporal Video Localization
Page 43: Deep Learning for Spatial and Temporal Video Localization
Page 44: Deep Learning for Spatial and Temporal Video Localization
Page 45: Deep Learning for Spatial and Temporal Video Localization
Page 46: Deep Learning for Spatial and Temporal Video Localization
Page 47: Deep Learning for Spatial and Temporal Video Localization
Page 48: Deep Learning for Spatial and Temporal Video Localization
Page 49: Deep Learning for Spatial and Temporal Video Localization
Page 50: Deep Learning for Spatial and Temporal Video Localization
Page 51: Deep Learning for Spatial and Temporal Video Localization
Page 52: Deep Learning for Spatial and Temporal Video Localization
Page 53: Deep Learning for Spatial and Temporal Video Localization
Page 54: Deep Learning for Spatial and Temporal Video Localization
Page 55: Deep Learning for Spatial and Temporal Video Localization
Page 56: Deep Learning for Spatial and Temporal Video Localization
Page 57: Deep Learning for Spatial and Temporal Video Localization
Page 58: Deep Learning for Spatial and Temporal Video Localization
Page 59: Deep Learning for Spatial and Temporal Video Localization
Page 60: Deep Learning for Spatial and Temporal Video Localization
Page 61: Deep Learning for Spatial and Temporal Video Localization
Page 62: Deep Learning for Spatial and Temporal Video Localization
Page 63: Deep Learning for Spatial and Temporal Video Localization
Page 64: Deep Learning for Spatial and Temporal Video Localization
Page 65: Deep Learning for Spatial and Temporal Video Localization
Page 66: Deep Learning for Spatial and Temporal Video Localization
Page 67: Deep Learning for Spatial and Temporal Video Localization
Page 68: Deep Learning for Spatial and Temporal Video Localization
Page 69: Deep Learning for Spatial and Temporal Video Localization
Page 70: Deep Learning for Spatial and Temporal Video Localization
Page 71: Deep Learning for Spatial and Temporal Video Localization
Page 72: Deep Learning for Spatial and Temporal Video Localization
Page 73: Deep Learning for Spatial and Temporal Video Localization
Page 74: Deep Learning for Spatial and Temporal Video Localization
Page 75: Deep Learning for Spatial and Temporal Video Localization
Page 76: Deep Learning for Spatial and Temporal Video Localization
Page 77: Deep Learning for Spatial and Temporal Video Localization
Page 78: Deep Learning for Spatial and Temporal Video Localization
Page 79: Deep Learning for Spatial and Temporal Video Localization
Page 80: Deep Learning for Spatial and Temporal Video Localization
Page 81: Deep Learning for Spatial and Temporal Video Localization
Page 82: Deep Learning for Spatial and Temporal Video Localization
Page 83: Deep Learning for Spatial and Temporal Video Localization
Page 84: Deep Learning for Spatial and Temporal Video Localization
Page 85: Deep Learning for Spatial and Temporal Video Localization
Page 86: Deep Learning for Spatial and Temporal Video Localization
Page 87: Deep Learning for Spatial and Temporal Video Localization
Page 88: Deep Learning for Spatial and Temporal Video Localization
Page 89: Deep Learning for Spatial and Temporal Video Localization
Page 90: Deep Learning for Spatial and Temporal Video Localization
Page 91: Deep Learning for Spatial and Temporal Video Localization
Page 92: Deep Learning for Spatial and Temporal Video Localization
Page 93: Deep Learning for Spatial and Temporal Video Localization
Page 94: Deep Learning for Spatial and Temporal Video Localization
Page 95: Deep Learning for Spatial and Temporal Video Localization
Page 96: Deep Learning for Spatial and Temporal Video Localization
Page 97: Deep Learning for Spatial and Temporal Video Localization
Page 98: Deep Learning for Spatial and Temporal Video Localization
Page 99: Deep Learning for Spatial and Temporal Video Localization
Page 100: Deep Learning for Spatial and Temporal Video Localization
Page 101: Deep Learning for Spatial and Temporal Video Localization
Page 102: Deep Learning for Spatial and Temporal Video Localization
Page 103: Deep Learning for Spatial and Temporal Video Localization
Page 104: Deep Learning for Spatial and Temporal Video Localization
Page 105: Deep Learning for Spatial and Temporal Video Localization
Page 106: Deep Learning for Spatial and Temporal Video Localization
Page 107: Deep Learning for Spatial and Temporal Video Localization
Page 108: Deep Learning for Spatial and Temporal Video Localization
Page 109: Deep Learning for Spatial and Temporal Video Localization
Page 110: Deep Learning for Spatial and Temporal Video Localization
Page 111: Deep Learning for Spatial and Temporal Video Localization
Page 112: Deep Learning for Spatial and Temporal Video Localization
Page 113: Deep Learning for Spatial and Temporal Video Localization
Page 114: Deep Learning for Spatial and Temporal Video Localization
Page 115: Deep Learning for Spatial and Temporal Video Localization
Page 116: Deep Learning for Spatial and Temporal Video Localization
Page 117: Deep Learning for Spatial and Temporal Video Localization
Page 118: Deep Learning for Spatial and Temporal Video Localization
Page 119: Deep Learning for Spatial and Temporal Video Localization
Page 120: Deep Learning for Spatial and Temporal Video Localization
Page 121: Deep Learning for Spatial and Temporal Video Localization
Page 122: Deep Learning for Spatial and Temporal Video Localization
Page 123: Deep Learning for Spatial and Temporal Video Localization
Page 124: Deep Learning for Spatial and Temporal Video Localization
Page 125: Deep Learning for Spatial and Temporal Video Localization
Page 126: Deep Learning for Spatial and Temporal Video Localization
Page 127: Deep Learning for Spatial and Temporal Video Localization
Page 128: Deep Learning for Spatial and Temporal Video Localization
Page 129: Deep Learning for Spatial and Temporal Video Localization
Page 130: Deep Learning for Spatial and Temporal Video Localization
Page 131: Deep Learning for Spatial and Temporal Video Localization
Page 132: Deep Learning for Spatial and Temporal Video Localization
Page 133: Deep Learning for Spatial and Temporal Video Localization
Page 134: Deep Learning for Spatial and Temporal Video Localization
Page 135: Deep Learning for Spatial and Temporal Video Localization
Page 136: Deep Learning for Spatial and Temporal Video Localization