arXiv:1808.01575v1 [cs.CV] 5 Aug 2018 · classes. Seo et al. [25] proposed a one-shot action...

Video Re-localization

Yang Feng‡? Lin Ma† Wei Liu† Tong Zhang† Jiebo Luo‡

†Tencent AI Lab ‡University of Rochester{yfeng23,jluo}@cs.rochester.edu, [email protected],

[email protected], [email protected]

Abstract. Many methods have been developed to help people find thevideo contents they want efficiently. However, there are still some un-solved problems in this area. For example, given a query video and a ref-erence video, how to accurately localize a segment in the reference videosuch that the segment semantically corresponds to the query video? Wedefine a distinctively new task, namely video re-localization, to ad-dress this scenario. Video re-localization is an important emerging tech-nology implicating many applications, such as fast seeking in videos,video copy detection, video surveillance, etc. Meanwhile, it is also achallenging research task because the visual appearance of a semanticconcept in videos can have large variations. The first hurdle to clear forthe video re-localization task is the lack of existing datasets. It is laborexpensive to collect pairs of videos with semantic coherence or corre-spondence and label the corresponding segments. We first exploit andreorganize the videos in ActivityNet to form a new dataset for videore-localization research, which consists of about 10,000 videos of diversevisual appearances associated with localized boundary information. Sub-sequently, we propose an innovative cross gated bilinear matching modelsuch that every time-step in the reference video is matched against theattentively weighted query video. Consequently, the prediction of thestarting and ending time is formulated as a classification problem basedon the matching results. Extensive experimental results show that theproposed method outperforms the competing methods. Our code is avail-able at: https://github.com/fengyang0317/video reloc.

Keywords: Video Re-localization · Cross Gating · Bilinear Matching

1 Introduction

A great number of videos are generated every day. To effectively access thevideos, several kinds of methods have been developed. The most common andmature one is searching by keywords. However, keyword-based search largelydepends on user tagging. The tags of a video are user specified and it is unlikelyfor a user to tag all the content in a complex video. Content-based video re-trieval (CBVR) [3,22,11] has emerged to circumvent the tagging issue. Given aquery video, CBVR systems analyze the content in it and retrieve videos with

? This work was done while Yang Feng was a Research Intern with Tencent AI Lab.

arX

iv:1

808.

0157

5v1

[cs

.CV

] 5

Aug

201

8

https://github.com/fengyang0317/video_reloc

2 Yang Feng, Lin Ma, Wei Liu, Tong Zhang, Jiebo Luo

Fig. 1. The top video is a clip of an action performed by two characters. The mid-dle video is a whole episode which contains the same action happening in a differentenvironment (marked by a green rectangle). The bottom is a video containing thesame action but performed by two real persons. Given the top query video, videore-localization aims to accurately detect the starting and ending points of the greensegments in the middle and bottom video, respectively. Such segments semanticallycorrespond to the given query video.

relevant contents to the query video. After retrieving videos, the user will havemany videos in hand. It is time-consuming to watch all the videos from thebeginning to the end to determine the relevance. Thus, video summarizationmethods [32,21] have been proposed to create a brief synopsis of a long video.Users are able to get the general idea of a long video quickly with the help ofvideo summarization. Similar to video summarization, video captioning [29,30]aims to summarize a video using one or more sentences. Researchers have alsodeveloped localization methods to help users quickly seek some video clips ina long video. The localization methods mainly focus on localizing video clipsbelonging to a list of pre-defined classes, for example, actions [26,13]. Recently,localization methods with natural language queries have been developed [1,7].

Although existing video retrieval techniques are powerful, there still remainsome unsolved problems. Let us consider the following scenario: when a user iswatching YouTube, he or she finds a very interesting video clip as shown in thetop row of Fig. 1. This clip shows an action performed by two boy characters ina cartoon named “Dragon Ball Z”. What should the user do if he/she wants tofind when such an action also happens in that cartoon? Simply finding exactlythe same content using copy detection methods [12] would fail for most cases,as the content variations across videos are of great differences. As shown in themiddle video of Fig. 1, the action takes place in a different environment. Copydetection methods cannot handle such complicated scenarios. An alternativeapproach is relying on a proper action localization method. However, actionlocalization methods usually localize pre-defined actions. When the action within

Video Re-localization 3

the video clip, as shown in Fig. 1, has not been pre-defined or seen in the trainingdataset, action localization methods will not work. Therefore, an intuitive way tosolve this problem is to crop the segment of interest as the query video and designa new model to localize the semantically matched segments in full episodes.

Motivated by this example, we define a distinctively new task called videore-localization, which aims at localizing a segment in a reference video such thatthis segment semantically corresponds to a query video. Specifically, the inputsto the new task are one query video and one reference video. The query video isa short clip which users are interested in. The reference video contains at leastone segment semantically corresponding to the content in the query video. Then,video re-localization aims at accurately detecting the starting and ending pointsof the segment, which semantically corresponds to the query video.

Video re-localization implicates many real applications. With a query clip, auser can quickly find the content he/she is interested in by video re-localization,thus avoiding seeking in a long video manually. Video re-localization can also beapplied to video surveillance and video-based person re-identification [19,20].

Video re-localization is a very challenging task. First, the appearances of thequery and reference videos may be quite different due to environment, subject,and viewpoint variances, even though they express the same visual concept.Second, determining the accurate starting and ending points is very challeng-ing. There may be no obvious boundaries at the starting and ending points.Another key obstacle to video re-localization is the lack of video datasets thatcontain pairs of query and reference videos as well as the associated localizationinformation.

In order to tackle the video re-localization problem, we create a new datasetby reorganizing the videos in ActivityNet [6]. When building the dataset, weassume that the action segments belonging to the same class semantically cor-respond to each other. The query video is the segment that contains one action.The paired reference video contains one segment of the same type of action andthe background information before and after the segment. We randomly splitthe 200 action classes into three parts. 160 action classes are used for trainingand 20 action classes are used for validation. The remaining 20 action classesare used for testing. Such a split guarantees that the action class of a video usedfor testing is unseen during training. Therefore, if the performance of a videore-localization model is good on the testing set, it should be able to generalizeto other unseen actions as well.

To address the technical challenges of video re-localization, we propose a crossgated bilinear matching model of three recurrent layers. First, local video featuresare extracted from both the query and reference videos. The feature extractionis performed considering only a short period of video frames. The first recurrentlayer is used to aggregate the extracted features and generate a new video featureconsidering the context information. Based on the aggregated representations,we perform matching between the query and reference videos. The feature ofevery reference video is matched with the attentively weighted query video. Ineach matching step, the reference video feature and the query video feature are


processed by factorized bilinear matching to generate their interaction results.Since not all the parts in the reference video are equally relevant to the queryvideo, a cross gating strategy is stacked before bilinear matching to preservethe most relevant information while gating out the irrelevant information. Theobtained interaction results are fed into the second recurrent layer to generate aquery-aware reference video representation. The third recurrent layer is used toperform localization, where the prediction of the starting and ending positionsis formulated as a classification problem. For each time step, the recurrent unitoutputs the probability that the time step belongs to one of the four classes:starting point, ending point, inside the segment, and outside the segment. Thefinal prediction result is the segment with the highest joint probability in thereference video.

In summary, our contributions are four-fold:

1. We introduce a novel task, namely video re-localization, which aims at lo-calizing a segment in the reference video such that the segment semanticallycorresponds to the given query video.

2. We reorganize the videos in ActivityNet [6] to form a new dataset to facilitatethe research on video re-localization.

3. We propose a cross gated bilinear matching model with the video re-localizationtask formulated as a classification problem, which can comprehensively cap-ture the interactions between the query and reference videos.

4. We validate the effectiveness of our proposed model on the new dataset andachieve favorable performance better than the competing methods.

2 Related Work

CBVR systems [3,22,11] have evolved for two decades. Modern CBVR systemssupport various types of queries, such as query by example, query by object,query by keyword, and query by natural language. Given a query, CBVR systemscan retrieve a list of entire videos related to the query. Some of the retrievedvideos will inevitably contain contents irrelevant to the query. Users may stillneed to manually seek the part of interest in a retrieved video, which is time-consuming. Video re-localization introduced in this paper is different from CBVRin the sense that the former can locate the exact starting and ending points ofthe semantically coherent segment in a long reference video.

Action localization [17,16] is related to video re-localization in the sense thatboth are intended to find the starting and ending points of a segment in a longvideo. The difference is that action localization methods merely focus on certainpre-defined action classes. Some attempts were made to go beyond pre-definedclasses. Seo et al. [25] proposed a one-shot action recognition method that doesnot require prior knowledge about actions. Soomro and Shah [27] moved onestep further by introducing unsupervised action discovery and localization. Incontrast, video re-localization is more general than one-shot or unsupervisedaction localization in the sense that video re-localization can be applied to manyother concepts besides actions.


Featu

reExtracto

rFeatu

reExtractor

0.1 0.0 0.2 0.7

0.3 0.4 0.3 0.0

0.0 0.2 0.2 0.6

Query

Referen

ce

S E I OA

A

A

...

......

...

......

... ...

...

...

...

Aggregation Matching Localization

Fig. 2. The architecture of our proposed model for video re-localization. Local videofeatures are first extracted for both query and reference videos and then aggregated byLSTMs. The proposed cross gated bilinear matching scheme exploits the complicatedinteractions between the aggregated query and reference video features. The localiza-tion layer, relying on the matching results, detects the starting and ending points ofa segment in the reference video by performing classification on the hidden state ofeach time step. The four possible classes are Starting, Ending, Inside and Outside.

A© denotes the attention mechanism described in Sec. 3. � and ⊗ are inner and outerproducts, respectively.

Recently, Hendricks et al. [1] proposed to retrieve a specific temporal seg-ment from a video by a natural language query. Gao et al. [7] focused on tempo-ral localization of actions in untrimmed videos using natural language queries.Compared to existing action localization methods, they have the advantage oflocalizing more complex actions than those in a pre-defined list. Our methodis different in the sense that we directly match the query and reference videosegments in a single video modality.

3 Methodology

Given a query video clip and a reference video, we design a novel model to addressthe video re-localization task by exploiting their complicated interactions andpredicting the starting and ending points of the matched segment. As shown inFig. 2, our model consists of three components, which are aggregation, matching,and localization.


3.1 Video Feature Aggregation

In order to effectively represent video contents, we need to choose one or severalkinds of video features depending on what kind of semantics we intend to capture.For our video re-localization task, the global video features are not considered,as we need to rely on the local information to perform segment localization.

After performing feature extraction, two lists of local features with a tem-poral order are obtained for the query and reference videos, respectively. Thequery video features are denoted by a matrix Q ∈ Rd×q, where d is the featuredimension and q is the number of features in the query video, which is related tothe video length. Similarly, the reference video is denoted by a matrix R ∈ Rd×r,where r is the number of features in the reference video. As aforementioned, fea-ture extraction only considers video characteristics within a short range. In orderto incorporate contextual information within a longer range, we employ two longshort-term memory (LSTM) [10] units to aggregate the extracted features:

hqi = LSTM(qi, hqi−1),

hri = LSTM(ri, hri−1),

(1)

where qi and ri are the i-th columns in Q and R, respectively. hqi , hri ∈ Rl×1 arethe hidden states at the i-th time step of the two LSTMs, with l denoting thedimensionality of the hidden state. Note that the parameters of the two LSTMare shared to reduce the model size. The yielded hidden states of the LSTMsare regarded as new video representations. Due to natural characteristics andbehaviors of LSTMs, the hidden states can encode and aggregate the previouscontextual information.

3.2 Cross Gated Bilinear Matching

At each time step, we perform matching of the query and reference videos, basedon the aggregated video representations hqi and hri . Our proposed cross gated bi-linear matching scheme consists of four modules, i.e., the generation of attentionweighted query, cross gating, bilinear matching, and matching aggregation.

Attention Weighted Query. For video re-localization, the segment corre-sponding to the query clip can potentially be anywhere in the reference video.Therefore, every feature from the reference video needs to be matched againstthe query video to capture their semantic correspondence. Meanwhile, the queryvideo may be quite long, so only some parts in the query video actually cor-respond to one feature in the reference video. Motivated by the machine com-prehension method in [31], an attention mechanism is leveraged to select whichpart in the query video is to be matched with the feature in the reference video.At the i-th time step of the reference video, the query video is weighted by the


attention mechanism:

ei,j = tanh(W qhqj +W rhri +Wmhfi−1 + bm),

αi,j =exp(w>ei,j + b)∑k exp(w>ei,k + b)

,

hqi =∑j

αi,jhqj ,

(2)

where W q, W r, Wm ∈ Rl×l, w ∈ Rl×1 are the weight parameters in our attentionmodel with bm ∈ Rl×1 and b ∈ R denoting the bias terms. It can be observed thatthe attention weight αi,j relies on not only the current representation hri of the

reference video but also the matching result hfi−1 ∈ Rl×1 in the previous stage,which can be obtained by Eq. (7) and will be introduced later. The attentionmechanism tries to find the most relevant hqj to hri and use the relevant hqj to

generate the query representation hqi , which is believed to better match hri forthe video re-localization task.

Cross Gating. Based on the attention weighted query representation hqi andreference representation hri , we propose a cross gating mechanism to gate outthe irrelevant reference parts and emphasize the relevant parts. In cross gating,the gate for the reference video feature depends on the query video. Meanwhile,the query video features are also gated by the current reference video feature.The cross gating mechanism can be expressed by the following equation:

gri = σ(W gr h

ri + bgr), hqi = hqi � g

ri ,

gqi = σ(W gq h

qi + bgq), hri = hri � g

qi ,

(3)

where W gr ,W

gq ∈ Rl×l, and bgr , b

gq ∈ Rl×1 denote the learnable parameters. σ

denotes the non-linear sigmoid function. If the reference feature hri is irrelevantto the query video, both the reference feature hri and query representation hqiare filtered to reduce their effects on the subsequent layers. If hri closely relatesto hqi , the cross gating strategy is expected to further enhance their interactions.

Bilinear Matching. Motivated by bilinear CNN [18], we propose a bilinearmatching method to further exploit the interactions between hqi and hri , whichcan be written as:

tij = hq>i W bj h

ri + bbj , (4)

where tij is the j-th dimension of the bilinear matching result, given by ti =[ti1, ti2, . . . , til]

>. W bj ∈ Rl×l and bbj ∈ R are the learnable parameters used to

calculate tij .The bilinear matching model in Eq. (4) introduces too many parameters,

thus making the model difficult to learn. Normally, to generate an l-dimensionbilinear output, the number of parameters introduced would be l3 + l. In order


to reduce the number of parameters, we factorize the bilinear matching modelas:

hqi = Fj hqi + bfj ,

hri = Fj hri + bfj ,

tij = hq>i hri ,

(5)

where Fj ∈ Rk×l and bfj ∈ Rk×1 are the parameters to be learned. k is a hyper-parameter much smaller than l. Therefore, only k × l × (l + 1) parameters areintroduced by the factorized bilinear matching model.

The factorized bilinear matching scheme captures the relationships betweenthe query and reference representations. By expanding Eq. (5), we have thefollowing equation:

tij = hq>i F>j Fj hri︸︷︷︸

quadratic term

+ bf>j Fj(hqi + hri )︸︷︷︸

linear term

+ bf>i bfi︸︷︷︸bias term

. (6)

Each tij consists of a quadratic term, a linear term, and a bias term, with the

quadratic term capable of capturing the complex dynamics between hqi and hri .

Matching Aggregation. Our obtained matching result ti captures the com-plicated interactions between the query and reference videos from a local viewpoint. Therefore, an LSTM unit is used to further aggregate the matching con-text:

hfi = LSTM(ti, hfi−1). (7)

Following the idea in bidirectional RNN [24], we also use another LSTM unit toaggregate the matching results in the reverse direction. Let hbi denote the hidden

state of the LSTM in the reverse direction. By concatenating hfi together withhbi , the aggregated hidden state hmi is generated.

3.3 Localization

The output of the matching layer hmi indicates whether the content in the i-thtime step in the reference video matches well with the query clip. We rely on hmito predict the starting and ending points of the matching segment. Specifically,we formulate the localization task as a classification problem. As illustrated inFig. 2, at each time step in the reference video, the localization layer predicts theprobability that this time step belongs to one of the four classes: starting point,ending point, inside point, and outside point. The localization layer is given by:

hli = LSTM(hmi , hli−1),

pi = softmax(W lhli + bl),(8)

where W l ∈ R4×l and bl ∈ R4×1 are the parameters in the softmax layer. pi isthe predicted probability for time step i. It has four dimensions p1i , p2i , p3i , andp4i , denoting the probability of starting, ending, inside and outside, respectively.


3.4 Training

We train our model using the weighted cross-entropy loss. We generate a labelvector for the reference video at each time step. For a reference video with aground-truth segment [s, e], we assume 1 ≤ s ≤ e ≤ r. The time steps belongingto [1, s) and (e, r] are outside the ground-truth segment, and the generated labelprobabilities for them are gi = [0, 0, 0, 1]. The s-th time step is the starting timestep, which is assigned to label probabilities gi = [ 12 , 0,

12 , 0]. Similarly, the label

probabilities at the e-th time step are gi = [0, 12 ,12 , 0]. The time steps in the

segment (s, e) are labeled as gi = [0, 0, 1, 0]. When the segment is very shortand falls in only one time step, s will be equal to e. In that case, the labelprobabilities for that time step would be [ 13 ,

13 ,

13 , 0]. The cross-entropy loss for

one sample pair is given by:

loss = −1

r

r∑i=1

4∑n=1

gni log(pni ), (9)

where gni is the n-th dimension of gi.One issue of using the above loss for training is that the predicted proba-

bilities of the starting point and ending point would be orders smaller than theprobabilities of the other two classes. The reason is that the positive samplesfor the starting and ending points are much fewer than those of the other twoclasses. For one reference video, there is only one starting point and one end-ing point. In contrast, all the other positions are either inside or outside thesegment. Hence, we decide to pay more attention to losses at the starting andending positions, via a dynamic weighting strategy:

wi =

{cw, if g1i + g2i > 0,1, otherwise,

(10)

where cw is a constant. Thus, the weighted loss used for training can be furtherformulated as:

lossw = −1

r

r∑i=1

wi

4∑n=1

gni log(pni ). (11)

3.5 Inference

After the model is properly trained, we can perform video re-localization on apair of query and reference videos. We localize the segment with the largest jointprobability in the reference video, which is given by:

s, e = arg maxs,e

p1sp2e

(e∏

i=s

p3i

) 1e−s+1

, (12)

where s and e are the predicted time steps of the starting and ending points,respectively. As shown in Eq. (12), the geometric mean of all the probabilitiesinside the segment is used such that the joint probability will not be affected bythe length of the segment.


Tennis serve with ball bouncing

Ballet

Fig. 3. Several video samples in our dataset. The segments containing different actionsare marked by green rectangles.

4 The Video Re-localization Dataset

Existing video datasets are usually created for classification [14,8], temporallocalization [6], captioning [4], or video summarization [9]. None of them can bedirectly used for the video re-localization task. To train our video re-localizationmodel, we need pairs of query videos and reference videos, where the segmentin the reference video semantically corresponding to the query video should beannotated with its localization information, specifically the starting and endingpoints. It would be labor expensive to manually collect query and referencevideos and localize the segments sharing the same semantics as the query video.

Therefore, in this work, we create a new dataset based on ActivityNet [6]for video re-localization. ActivityNet is a large-scale action localization datasetwith segment-level action annotations. We reorganize the video sequences inActivityNet aiming to relocalize the actions in one video sequence given anothervideo segment of the same action. There are 200 classes in ActivityNet and thevideos of each class are split into training, validation and testing subsets. Thissplit is not suitable for our video re-localization problem, because we expect avideo re-localization method that should be able to relocalize more actions thanthose actions defined in ActivityNet. Therefore, we split the dataset by actionclasses. Specifically, we randomly select 160 classes for training, 20 classes forvalidation, and the remaining 20 classes for testing. This split guarantees thatthe action classes used for validation and testing will not be seen during training.The video re-localization model is thus required to relocalize unknown actionsduring testing. If it works well on the testing set, it should be able to generalizewell to other unseen actions.

Many videos in ActivityNet are untrimmed and contain multiple action seg-ments. First, we filter the videos with two overlapped segments, which are an-notated with different action classes. Second, we merge the overlapped segmentsof the same action class. Third, we also remove the segments that are longerthan 512 frames. After such preprocessings, we obtain 9, 530 video segments.Fig. 3 illustrates several video samples in the dataset. It can be observed thatsome video sequences contain more than one segment. One video segment canbe regarded as a query video clip, while its paired reference video can be selectedor cropped from the video sequence to contain only one segment with the sameaction label as the query video clip. During our training process, the query videoand reference video are randomly paired, while the pairs are fixed for validation


and testing. In the future, we will release the constructed dataset to the publicand continuously enhance the dataset.

5 Experiments

In this section, we conduct several experiments to verify our proposed model.First, three baseline methods are designed and introduced. Then we will intro-duce our experimental settings including evaluation criteria and implementationdetails. Finally, we demonstrate the effectiveness of our proposed model throughperformance comparisons and ablation studies.

5.1 Baseline Models

Currently, there is no model specifically designed for video re-localization. We de-sign three baseline models, performing frame-level and video-level comparisons,and action proposal generation, respectively.

Frame-level Baseline. We design a frame-level baseline motivated by the back-tracking table and diagonal blocks described in [5]. We first normalize the fea-tures of query and reference videos. Then we calculate a distance table D ∈ Rq×r

by Dij = ‖hqi − hrj‖2. The diagonal block with the smallest average distances issearched by dynamic programming. The output of this method is the segment inwhich the diagonal block lies. Similar to [5], we also allow horizontal and verticalmovements to allow the length of the output segment to be flexible. Please notethat no training is needed for this baseline.

Video-level Baseline. In this baseline, each video segment is encoded as avector by an LSTM. The L2-normalized last hidden state in the LSTM is se-lected as the video representation. To train this model, we use the triplet loss in[23], which enforces anchor positive distance to be smaller than anchor negativedistance by a margin. The query video is regarded as the anchor. Positive sam-ples are generated by sampling a segment in the reference video having temporaloverlap (tIoU) over 0.8 with the ground-truth segment while negative samplesare obtained by sampling a segment with tIoU less than 0.2. When testing, weperform exhaustively search to select the most similar segment with the queryvideo.

Action Proposal Baseline. We train the SST [2] model on our training setand perform the evaluation on the testing set. The output of the model is theproposal with the largest confidence score.


5.2 Experimental Settings

We use C3D [28] features released by ActivityNet Challenge 20161. The featuresare extracted by publicly available pre-trained C3D model having a temporalresolution of 16 frames. The values in the second fully-connected layer (fc7) areprojected to 500 dimensions by PCA. We temporally downsample the providedfeatures by a factor of two so they do not have overlap with each other. Adam[15] is used as the optimization method. The parameters for the Adam optimiza-tion method are left at defaults: β1 = 0.9 and β2 = 0.999. The learning rate,dimension of the hidden state l, loss weight cw and factorized matrix rank kare set to 0.001, 128, 10, and 8, respectively. We manually limit the maximumallowed length of the predicted segment to 1024 frames.

Following the action localization task, we report the average top-1 mAPcomputed with tIoU thresholds between 0.5 and 0.9 with the step size of 0.1.

Table 1. Performance comparisons on our constructed dataset. The top entry in eachcolumn is highlighted in boldface.

mAP @1 0.5 0.6 0.7 0.8 0.9 Average

Chance 16.2 11.0 5.4 2.9 1.2 7.3Frame-level baseline 18.8 13.9 9.6 5.0 2.3 9.9Video-level baseline 24.3 17.4 12.0 5.9 2.2 12.4SST [2] 33.2 24.7 17.2 7.8 2.7 17.1Our model 43.5 35.1 27.3 16.2 6.5 25.7

5.3 Performance Comparisons

Table 1 shows the results of our method and baseline methods. According tothe results, we have several observations. The frame-level baseline performs bet-ter than randomly guesses, which suggests that the C3D features preserve thesimilarity between videos. The result of the frame-level baseline is significantlyinferior to our model. The reasons may be attributed to the fact that no trainingprocess is involved in the frame-level baseline.

The performance of the video-level baseline is slightly better than the frame-level baseline, which suggests that the LSTM used in the video-level baselinelearns to project corresponding videos to similar representations. However, theLSTM encodes the two video segments independently without considering theircomplicated interactions. Therefore, it cannot accurately predict the starting andending points. Additionally, this video-level baseline is very inefficient during theinference process because the reference video needs to be encoded multiple timesfor an exhaustive search.

1 http://activity-net.org/challenges/2016/download.html

http://activity-net.org/challenges/2016/download.html


Shuffleboard

query length = 14

reference length = 49Ground truth 17 27Ours 17 27Frame 15 23Video 17 19

Pole vault

query length = 15

reference length = 49Ground truth 10 37Ours 10 38Frame 20 32Video 17 38

query

reference

query

reference

Fig. 4. Qualitative results. The segment corresponding to the query is marked by agreen rectangle. Our model can accurately localize the segment semantically corre-sponding to the query video in the reference video.

query

reference

Fig. 5. Visualization of the attention mechanism. The top video is the query, whilethe bottom video is the reference. Color intensities of blue lines indicate the attentionstrengths. The darker the colors are, the higher the attention weights are. Note thatonly the connections with high attention weights are shown.

Our method is substantially better than the three baseline methods. Thegood results of our method indicate that the cross gated bilinear matchingscheme indeed helps to capture the interactions between the query and thereference videos. The starting and ending points can be accurately detected,demonstrating its effectiveness for the video re-localization task.

Some qualitative results from the testing set are shown in Fig. 4. It can beobserved that the query and reference videos are of great visual difference, eventhough they express the same semantic meaning. Although our model has notseen these actions during the training process, it can effectively measure theirsemantic similarities, and consequently localizes the segments correctly in thereference videos.

5.4 Ablation Study

Contributions of Different Components. To verify the contribution of eachpart of our proposed cross gated bilinear matching model, we perform three ab-lation studies. In the first ablation study, we create a base model by removing


Table 2. Performance comparisons of the ablation study. The top entry in each columnis highlighted in boldface.

mAP @1 0.5 0.6 0.7 0.8 0.9 Average

Base 40.8 32.4 22.8 15.9 6.4 23.7Base + cross gating 40.5 33.5 25.1 16.2 6.1 24.3Base + bilinear 42.3 34.9 25.7 15.4 6.5 25.0Ours 43.5 35.1 27.3 16.2 6.5 25.7

the cross gating part and replacing the bilinear part with the concatenation oftwo feature vectors. The second and third studies are designed by adding crossgating and bilinear to the base model, respectively. Table 2 lists all the resultsof the aforementioned ablation studies. It can be observed that both bilinearmatching and cross gating are helpful for the video re-localization task. Crossgating can help filter out the irrelevant information while enhancing the mean-ingful interactions between the query and reference videos. Bilinear matchingfully exploits the interactions between the reference and query videos, leadingto better results than the base model. Our full model, consisting of both crossgating and bilinear matching, achieves the best results.

Attention. In Fig. 5, we visualize the attention values for a query and referencevideo pair. The top video is the query video, while the bottom video is thereference. Both of the two videos contain some parts of “hurling” and “talking”.It is clear that the “hurling” parts in the reference video highly interact withthe “hurling” parts in the query with larger attention weights.

6 Conclusions

In this paper, we first defined a distinctively new task called video re-localization,which aims at localizing a segment in the reference video such that this segmentsemantically corresponds to the query video. Video re-localization implicatesmany real-world applications, such as finding interesting moments in videos,video surveillance, and person re-identification. To facilitate the new video re-localization task, we created a new dataset by reorganizing the videos in Ac-tivityNet [6]. Furthermore, we proposed a novel cross gated bilinear matchingmethod, which effectively performs the matching between the query and refer-ence videos. Based on the matching results, an LSTM was applied to localizethe query video in the reference video. Extensive experimental results show thatour proposed model is effective and outperforms the baseline methods.

Acknowledgement

We would like to thank the support of New York State through the GoergenInstitute for Data Science and NSF Award #1722847.


References

1. Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.:Localizing moments in video with natural language. In: ICCV (2017)

2. Buch, S., Escorcia, V., Shen, C., Ghanem, B., Niebles, J.C.: Sst: Single-streamtemporal action proposals. In: CVPR (2017)

3. Chang, S.F., Chen, W., Meng, H.J., Sundaram, H., Zhong, D.: A fully automatedcontent-based video search engine supporting spatiotemporal queries. CSVT 8(5)(1998)

4. Chen, D.L., Dolan, W.B.: Collecting highly parallel data for paraphrase evaluation.In: ACL (2011)

5. Chou, C.L., Chen, H.T., Lee, S.Y.: Pattern-based near-duplicate video retrieval andlocalization on web-scale videos. IEEE Transactions on Multimedia 17(3) (2015)

6. Fabian Caba Heilbron, Victor Escorcia, B.G., Niebles, J.C.: Activitynet: A large-scale video benchmark for human activity understanding. In: CVPR (2015)

7. Gao, J., Sun, C., Yang, Z., Nevatia, R.: Tall: Temporal activity localization vialanguage query. In: ICCV (2017)

8. Gorban, A., Idrees, H., Jiang, Y.G., Roshan Zamir, A., Laptev, I., Shah, M., Suk-thankar, R.: THUMOS challenge: Action recognition with a large number of classes.http://www.thumos.info/ (2015)

9. Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summariesfrom user videos. In: ECCV (2014)

10. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation9(8) (1997)

11. Hu, W., Xie, N., Li, L., Zeng, X., Maybank, S.: A survey on visual content-basedvideo indexing and retrieval. IEEE Transactions on Systems, Man, and Cybernet-ics, Part C (Applications and Reviews) 41(6) (2011)

12. Jiang, Y.G., Wang, J.: Partial copy detection in videos: A benchmark and anevaluation of popular methods. IEEE Transactions on Big Data 2(1) (2016)

13. Kalogeiton, V., Weinzaepfel, P., Ferrari, V., Schmid, C.: Action tubelet detectorfor spatio-temporal action localization. In: ICCV (2017)

14. Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S.,Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action videodataset. arXiv preprint arXiv:1705.06950 (2017)

15. Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)

16. Klaser, A., Marsza lek, M., Schmid, C., Zisserman, A.: Human focused action lo-calization in video. In: ECCV (2010)

17. Lan, T., Wang, Y., Mori, G.: Discriminative figure-centric models for joint actionlocalization and recognition. In: ICCV (2011)

18. Lin, T.Y., RoyChowdhury, A., Maji, S.: Bilinear cnn models for fine-grained visualrecognition. In: ICCV (2015)

19. Liu, H., Feng, J., Jie, Z., Karlekar, J., Zhao, B., Qi, M., Jiang, J., Yan, S.: Neuralperson search machines. In: ICCV (2017)

20. Liu, H., Jie, Z., Jayashree, K., Qi, M., Jiang, J., Yan, S., Feng, J.: Video-basedperson re-identification with accumulative motion context. CSVT (2017)

21. Plummer, B.A., Brown, M., Lazebnik, S.: Enhancing video summarization viavision-language embedding. In: CVPR (2017)

22. Ren, W., Singh, S., Singh, M., Zhu, Y.S.: State-of-the-art on spatio-temporalinformation-based video retrieval. Pattern Recognition 42(2) (2009)

http://www.thumos.info/


23. Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for facerecognition and clustering. In: CVPR (2015)

24. Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans-actions on Signal Processing 45(11) (1997)

25. Seo, H.J., Milanfar, P.: Action recognition from one example. PAMI 33(5) (2011)26. Shou, Z., Chan, J., Zareian, A., Miyazawa, K., Chang, S.F.: Cdc: convolutional-

de-convolutional networks for precise temporal action localization in untrimmedvideos. In: CVPR (2017)

27. Soomro, K., Shah, M.: Unsupervised action discovery and localization in videos.In: CVPR (2017)

28. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotem-poral features with 3d convolutional networks. In: ICCV (2015)

29. Wang, B., Ma, L., Zhang, W., Liu, W.: Reconstruction network for video caption-ing. In: CVPR (2018)

30. Wang, J., Jiang, W., Ma, L., Liu, W., Xu, Y.: Bidirectional attentive fusion withcontext gating for dense video captioning. In: CVPR (2018)

31. Wang, S., Jiang, J.: Machine comprehension using match-lstm and answer pointer.arXiv preprint arXiv:1608.07905 (2016)

32. Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Video summarization with longshort-term memory. In: ECCV (2016)

arXiv:1808.01575v1 [cs.CV] 5 Aug 2018 · classes. Seo et al. [25] proposed a one-shot action...

Documents

Transcript of arXiv:1808.01575v1 [cs.CV] 5 Aug 2018 · classes. Seo et al. [25] proposed a one-shot action...