Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date...

26
www.axes-project.eu Project acronym AXES Project full title Access to Audiovisual Archives Project No 269980 Large-Scale Integrating project (IP) Deliverable D7.9 Report on annual participation in international benchmarks December 2014 SEVENTH FRAMEWORK PROGRAMME Objective ICT- 2009.4.1: Digital Libraries and Digital Preservation

Transcript of Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date...

Page 1: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

!www.axes-project.eu

Project acronym AXES

Project full title Access to Audiovisual Archives

Project No 269980Large-Scale Integrating project (IP)

Deliverable D7.9

Report on annual participation in international benchmarks

December 2014

SEVENTH FRAMEWORK PROGRAMME

Objective ICT- 2009.4.1: Digital Libraries and DigitalPreservation

!

!

Page 2: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

PROJECT DELIVERABLE REPORT

Project

Grant Agreement number 269980

Project acronym: AXES

Project title: Access to Audiovisual Archives

Funding Scheme: Large-Scale Integrating project (IP)

Date of latest version of Annex I against whichthe assessment will be made:

24 September 2010

Document

Deliverable number: D7.9

Deliverable title: Report on annual participation in internationalbenchmarks

Contractual Date of Delivery: 31/12/2014

Actual Date of Delivery: 18/12/2014

Author (s): Kevin McGuinness

Reviewer (s): UO

Work package no.: WP7

Work package title: Experiencing Digital Libraries

Work package leader: DCU

Version/Revision: 1.2

Draft/Final: Final

Total number of pages (including cover): 26

AXES Page 2 of 26 D7.9

Page 3: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

CHANGE LOG

Reason for change Issue Revision Date

Template preparation - 1.0 26/11/2014Initial revision integrating partner contributions - 1.1 10/12/2014Integrated revisions from UO - 1.2 18/12/2014

AXES Page 3 of 26 D7.9

Page 4: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

DISCLAIMER

This document contains description of the AXES project work and findings.

The authors of this document have taken any available measure in order for its content to beaccurate, consistent and lawful. However, neither the project consortium as a whole nor theindividual partners that implicitly or explicitly participated in the creation and publication of thisdocument hold any responsibility for actions that might occur as a result of using its content.

!

This publication has been produced with the assistance of theEuropean Union. The content of this publication is the soleresponsibility of the AXES project and can in no way be takento reflect the views of the European Union.

The European Union is established in accordance with theTreaty on European Union (Maastricht). There are currently27 Member States of the Union. It is based on the EuropeanCommunities and the member states cooperation in the fieldsof Common Foreign and Security Policy and Justice and HomeAffairs. The five main institutions of the European Union arethe European Parliament, the Council of Ministers, the European Commission, the Court of Justiceand the Court of Auditors. (http://europa.eu.int/)

AXES is a project partly funded by the European Union.

AXES Page 4 of 26 D7.9

Page 5: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

TABLE OF CONTENTS

Change Log . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

Disclaimer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Table of Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 THUMOS 2014 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.1 System Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83.3 Localization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 MEDIAEVAL 2014: Search and Hyperlinking Task . . . . . . . . . . . . . . . . . . . . . 134.1 Experimental Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.2 Audio Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.3 Video Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.4 Test Set Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.5 Required runs and evaluation procedure for the search and linking sub-tasks . . . . 144.6 Participation and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 TRECVID 2014: Multimedia Event Detection Task . . . . . . . . . . . . . . . . . . . . . 175.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 TRECVID 2014: Instance Search Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.1 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.2 User Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

7 ImageNet 2014: Large Scale Visual Recognition Challenge . . . . . . . . . . . . . . . 23

8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

AXES Page 5 of 26 D7.9

Page 6: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

SUMMARY

Participation in international benchmarking activities is an important part of the AXES project. Thisdeliverable describes the AXES group participation in international benchmarking activities for 2014.

AXES Page 6 of 26 D7.9

Page 7: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

INTRODUCTION

Participation in international benchmarking activities is an important part of the AXES project. Itprovides AXES with a means of functionally validating developed systems and their componentswhilst also providing an avenue for useful dissemination within the research domain. AXES partic-ipated in several international benchmarking activities in 2014: THUMOS, MediaEval, TRECVid,and ImageNet LSVRC.

Automatically recognizing and localizing a large number of action categories from videos in thewild of significant importance for video understanding and multimedia event detection. THUMOSworkshop and challenge aims at exploring new challenges and approaches for large-scale actionrecognition with large number of classes from open source videos in a realistic setting.

MediaEval is a benchmarking initiative dedicated to evaluating new algorithms for multimedia accessand retrieval. It emphasizes the multi in multimedia and focuses on human and social aspects ofmultimedia tasks. MediaEval attracts participants who are interested in multimodal approachesto multimedia involving, e.g., speech recognition, multimedia content analysis, user-contributedinformation (tags, tweets), viewer affective response, social networks, temporal and geo-coordinates.

TRECVid is an annual benchmarking activity organized by the National Institute for Standards andTechnology in the US that aims to encourage research in information retrieval by providing a largetest collection, uniform scoring procedures, and a forum for organizations interested in comparingtheir results. Each year, dozens of academic institutions and companies that are interested inmultimedia information retrieval research participate in TRECVid by using their systems to carry outa set of predefined tasks on a common dataset. This allows the participating institutions to comparetheir systems and validate ideas and research in a more realistic setting than available in the lab.

The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object categoryclassification and detection on hundreds of object categories and millions of images. The challengehas been run annually from 2010 to present, attracting participation from more than fifty institutions.

The remainder of this deliverable describes our participation in these activities.

AXES Page 7 of 26 D7.9

Page 8: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

THUMOS 2014

This section describes the AXES INRIA entry in the THUMOS Challenge 2014. The goal of theTHUMOS Challenge is to evaluate action recognition approaches in realistic conditions. In particularthe test data consists of untrimmed videos, where the action may be short compared to the videolength, and multiple instances can be present in each video. For full details on the definition of thechallenge, task, and datasets, we refer to the challenge website1.

Our entry in 2014 uses dense trajectory features (DTF) encoded using Fisher vector (FV) encoding,which we also used in our 2013 submission. This year’s submission additionally incorporates static-image features (SIFT, Color, and CNN) and audio features (ASR and MFCC) for the classificationtask. For the detection task, we combine scores from the classification task with FV-DTF featuresextracted from video slices. We found that these additional visual and audio feature significantlyimprove the classification results. For localization we found that using the classification scores as acontextual feature besides local motion features leads to significant improvements.

3.1 System Description

We first describe our classification system to recognize untrimmed action videos in Section 3.2. Thelocalization system presented in Section 3.3 is similar, but trained to recognize temporally croppedactions instead of complete untrimmed videos. The detection system also exploits the classificationscores obtained for complete videos as a contextual feature.

3.2 Classification

For our classification system we build upon our winning entry in the THUMOS 2013 challenge. It isbased on Fisher vector (FV) encoding of improved dense trajectory features. As last year we use avocabulary of size 256, rescale the videos to be at most 320 pixels wide, and skip every secondframe when decoding the video.

Feature extraction This year, we have added several new features that complement the motion-based features. We add static visual appearance information through the following features:

1. SIFT: we extract SIFT features on a dense multi-scale grid, and encode these in a FV using avocabulary of size 1024. We extract SIFT on one frame out of 60, and aggregate all descriptorsin a single FV.

2. Color: we extract color features based on local mean and variance of the color channels [9]every 60th frame, and encode them in a single FV with a vocabulary size 1024.

3. CNN: we extract a 4K dimensional feature using a convolutional network trained on theImageNet 2010 Challenge data. We use the Caffe implementation [16], and retain the layersix activations after applying the linear rectification (which clips negative values to zero). Wealso experimented with using layer seven or eight, but found worse performance. We extractCNN features in every 10th frame, and average them into a single video-wide feature vector.

In addition to the visual features, we also extract features from the audio stream:1http://crcv.ucf.edu/THUMOS14/

AXES Page 8 of 26 D7.9

Page 9: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

1. MFCC: we downsample the original audio track to 16 kHz with 16 bit resolution and thencompute Mel-frequency cepstral coefficients (MFCC) with a window size of 25 ms and astep-size of 10 ms, keeping the first 12 coefficients of the final cosine transformation plus theenergy of the signal. We enhance the MFCCs with their first and second order derivatives.The MFCC features are then aggregated into a FV with a vocabulary size of 256.

2. ASR: For ASR we used state-of-the art speech transcription systems available for 16 lan-guages [19, 18]. The files were processed by first performing speaker diarization, followed bylanguage identification (LID) and then transcription. The system for identified language wasused if the LID confidence score was above 0.7, else an English system as used. The vastmajority of documents were in English, with a number in Spanish, German, Russian, Frenchas well as a few in 8 other languages. Therefore, we only used the English transcripts, andrepresent them using a bag-of-word encoding of 110K words.

Classifier training To train the action classification models, we train SVM classifiers in a 1-vs-restapproach. We perform early fusion to the dense trajectory features, by concatenating FVs for theMHB, HOG, and HOF channels. Similarly we early fuse the two local image features: SIFT andcolor. We, then, learn a per-class late-fusion of the SVM classifiers trained on the early fusionchannels and the CNN, MFCC, and ASR features.

3.3 Localization

To assess our performance we split the 1010 videos from the Validation split into two equal parts;we used one of them as train split and the other one as test.

For the temporal action localization task we only use the dense trajectory features, since theremaining features are more likely to capture contextual information rather than information that canbe used for precise action localization.

We train 1-vs-rest SVM classifiers, albeit using only trimmed action examples from the Train andValidation sets as positives. As negatives we use (i) all examples from other classes of the Trainpart of the data, (ii) all untrimmed videos in the Background part of the data, (iii) all untrimmedvideos of other classes in the Validation part of the data, and (iv) all trimmed examples of otherclasses in the Validation part of the data. In addition we performed one round of hard-negativemining on the Validation set, based on a preliminary version of the detector, and used these asadditional negatives.

For testing we use temporal detection windows with a duration of 10, 20, 30, 40, 50, 60, 70, 80,90, 100, and 150 frames, which we slide with a stride of 10 frames over the video. After scoringthe windows, we apply non-maximum suppression to enforce that non of the retained windows areoverlapping.

Following [25], we re-score the detection windows by multiplying the detection score by the durationof the window. This avoids a bias towards detecting too small video fragments. In addition, weexperimented with a class-specific duration prior, estimated from the training data.

Finally, we combine the window’s detection score with the video’s classification score for the sameaction class. This pulls-in additional contextual information from the complete video that is notavailable in the temporal window features. We take a weighted average of these scores; the weightis determined using the Validation set.

AXES Page 9 of 26 D7.9

Page 10: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

(a)

Feature mAPMBH 52.02 ± 2.4HOF 50.38 ± 1.9HOG 48.79 ± 2.3CNN 48.42 ± 2.0Color 37.36 ± 1.7SIFT 37.17 ± 1.8ASR 20.77 ± 1.0MFCC 18.97 ± 1.5

(b)

Early fusion mAPEF1: MBH + HOF + HOG 64.35 ± 2.3EF2: SIFT + Color 45.78 ± 2.3Late fusionLF1: EF1 + EF2 69.62 ± 2.18LF2: EF1 + EF2 + CNN 71.06 ± 2.00LF3: EF1 + EF2 + CNN + MFCC 73.65 ± 1.90LF4: EF1 + EF2 + CNN + ASR 76.26 ± 1.85LF5: EF1 + EF2 + CNN + MFCC + ASR 77.84 ± 1.70

Table 1: Evaluation of individual features (a) and combinations (b) for the classification task.

Validation Y Y Y YTrain Y Y Y YBackground Y Y YLF5 mAP 70.40± 1.6 68.74 ± 2.2 77.84 ± 1.7 67.94 ± 1.9 67.90 ± 2.2 77.70 ± 1.8

Table 2: Evaluation of different parts of the training data for the classification task.

3.4 Results

In this section we present experimental results obtained on the Validation set.

Classification results For the classification task we split the Validation set into 30 train/testfolds. For each training fold we select 7 samples from each class, with the test fold containing theremaining 3 samples. We report the mean and the standard deviation of the mAP score acrossthese 30 folds.

Table 1 presents an evaluation of the individual features. The results show that the visual featuresare the strongest, in particular the motion features. Combining features significantly improves theresults, e.g., from 52.02% mAP for MBH, to 64.35% for MBH + HOF + HOG. When combining allfeatures, we obtain 77.84% mAP. Interestingly, the high-level ASR feature brings more than 4%mAP improvement when all other features are already included.

Next, we evaluate the effect of using different parts of the training data and test on the held-out partof the validation set, see above description of the cross-validation procedure. The results in Table 2clearly show the importance of using both the trimmed (in Train) and untrimmed (in Validation)examples; untrimmed videos are important since these are representative of the test set, and thetrimmed examples are important because they are roughly 10 times more of them. The videos inthe Background set were not useful, probably because there are enough negative samples acrossthe Train and Validation dataset. In conclusion, we used the Train and full Validation sets in oursubmitted classification results.

Localization results For our localization system we have to compute features and scores formany temporal windows, and this is much more costly than the classification of entire videos.Therefore, we first evaluated the effect of using only MBH or all three trajectory features, and theimpact of using a smaller vocabulary of size 64 vs. using the one of size 256 used for classification.

AXES Page 10 of 26 D7.9

Page 11: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

System Rescoring Remarks mAPD1 clip duration K=64, MBH 12.56D2 clip duration K=64, MBH + HOF + HOG 14.58D3 clip duration K=256, MBH + HOF + HOG 19.17D3+C, λ = 0.2 clip duration Run #3 21.63D3+C, λ = 0.2 class specific prior, Train+Val. 21.57D3+C, λ = 0.25 class specific prior, Validation Run #1 26.57D3+C∗, λ = 0.25 class specific prior, Validation Run #2, C∗ visual-only 26.52D3 class specific prior, Validation 24.43

Table 3: Evaluation of action localization using the detection (D) and classification (C) system. The combinedscore is a weighted average which weights the detection score by λ and the classification score by (1− λ).

In these experiments we follow [25], and rescore the windows using their duration. The first threerows of Table 3 show that the performance drops significantly if we use a smaller vocabulary, or useonly MBH features. Therefore, we keep all trajectory features and the vocabulary of size 256 in allremaining experiments.

In the remaining experiments in Table 3 we consider the benefit of including the classificationscore as a contextual feature to improve the localization performance. The trade-off between theclassification and detection score is determined cross-validation. The classification and detectionscores are first normalized to be zero-mean and unit-variance so that the scores are comparable,and the combination weight has a natural interpretation. In the first experiment (row 4) we combinethe best detector D3 (with mAP 19.17%) with the classification model using all our channels, whichleads to an improved mAP of 21.63%. This is the system submitted as Run #3.

Instead of rescoring with the clip duration, we also considered rescoring with a class-specific prioron the duration (obtained using a histogram estimate). This leads to a similar performance of21.57% mAP.

We observed a difference in the duration distribution of positive action instances in the Train andValidation part of the data. This difference is explained by different annotation protocols and teamsused to annotate these parts of the data. Therefore, we also considered using a prior estimatebased on the validation data only. This leads to a significantly improved localization mAP of 26.57%.This is the system we submitted as Run #1.

Finally, submitted Run #2 is similar to Run #1, but is a vision-only run that excludes the MFCC andASR audio features in the classification model. The system corresponding to the Run #2 obtainsa performance of 26.52% mAP on our test split. Interestingly, in this case the audio features donot have a signifiant impact. To verify that the detection still benefits from the classifier when usingthe stronger prior, we also include a last run that uses this prior without the classification score(last row). This leads to a reduction in performance to 24.43%, showing that global video context isuseful in the localization task, even when using the strong prior on duration.

AXES Page 11 of 26 D7.9

Page 12: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

3.5 Conclusion

The INRIA team submissions achieved the second to top ranked result in the action classificationtask and the top three ranked results in the temporal localization task2.

2http://crcv.ucf.edu/THUMOS14/results.html

AXES Page 12 of 26 D7.9

Page 13: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

MEDIAEVAL 2014: SEARCH AND HYPERLINKING TASK

To evaluate our search and hyperlinking components against international competitors, AXESorganized the third edition of the Search and Hyperlinking Task at Media Evaluation Workshop(MediaEval) 2014. The task envisions the following scenario: a user searches for relevant segmentswithin a video collection using a query. If the user finds a relevant segment, he or she may want tofollow hyperlinks to other video segments in the collection that are related to the relevant segment.AXES was responsible for test set annotation, feature provisioning, the evaluation framework, andoverall organization of the task.

The task framework asks participants to create systems that support the search and linking aspectsof the task. The use scenario is the same as in the Search and Hyperlinking task 2013 [10] with themain difference being that the search sub-task has changed from known-item searches with onlyone relevant segments to ad-hoc searches with potentially many relevant segments. The followingdescribes the experimental data set provided to task participants for MediaEval 2014, details of thetwo sub-tasks, and our participation.

4.1 Experimental Dataset

The dataset for both subtasks was a collection of 4021 hours of videos provided by the BBC, whichwe split into a development set of 1335 hours, which coincided with the test collection used in the2013 edition of this task, and a test set of 2686 hours. The average length of a video was roughly 45minutes, and most videos were in the English language. The test collection was broadcast contentof date spans 01.04.2008 – 11.05.2008 and 12.05.2008 – 31.07.2008 for the development andtest sets respectively. The BBC kindly provided human generated textual metadata and manualtranscripts for each video. Participants were also provided with the output of several content analysismethods, which we describe in the following subsections.

4.2 Audio Analysis

The audio was extracted from the video stream using the ffmpeg software toolbox (sample rate =16,000Hz, no. of channels = 1). Based on this data, the transcripts were created using the followingASR approaches and provided to participants:

(i) LIMSI-CNRS/Vocapia3, which uses the VoxSigma vrbs trans system (version eng-usa 4.0) [13].Compared to the transcripts created for the 2013 edition of this task, the system’s models had beenupdated with partial support from the Quaero program [12].

(ii) The LIUM system4 [26], is based on the CMU Sphinx project. The LIUM system provided threeoutput formats: (1) one-best transcripts in NIST CTM format, (2) word lattices in SLF (HTK) format,following a 4-gram topology, and (3) confusion networks in a format similar to ATT FSM.

(iii) The NST/Sheffield system5 is trained on multi-genre sets of BBC data that does not overlap withthe collection used for the task, and uses deep neural networks [20]. The ASR transcript containsspeaker diarization, similar to the LIMSI-CNRS/Vocapia transcipts.

3http://www.vocapia.com/4http://www-lium.univ-lemans.fr/en/content/language-and-speech-technology-lst5http://www.natural-speech-technology.org

AXES Page 13 of 26 D7.9

Page 14: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

Additionally, prosodic features were extracted using the OpenSMILE tool version 2.0 rc1 [11]6. Thefollowing list of prosodic features were calculated over sliding windows of 10 milliseconds: root meansquared (RMS) energy, loudness, probability of voicing, fundamental frequency (F0), harmonics tonoise ratio (HNR), voice quality, and pitch direction (classes falling, flat, raising, and direction score).Prosodic information was provided for the first time in 2014 to encourage participants to explore itspotential value for the Search and Hyperlinking sub-tasks.

4.3 Video Analysis

The computer vision groups at University of Leuven (KUL) and University of Oxford (OXU) kindlyprovided the output of concept detectors for 1537 concepts from ImageNet7 using the followingtraining approaches: the approach by KUL uses examples from ImageNet as positive examples[30], while OXU uses an on-the-fly concept detection approach, which downloads training examplesthrough Google image search [8].

4.4 Test Set Annotation

To create realistic queries and anchors, we conducted a study with 28 users aged between 18and 30 from the general public around London, U.K. The study was similar to our previous studycarried out for MediaEval 2013 [6], with the main difference being the focus on information needswith multiple relevant segments. As the study focused on a home user scenario, the participantswere using computer tablets and an early version of the AXES Home system [22] to search andbrowse within the video collection. The user study consisted of the following steps: i) a participantdefined an information need using natural language, ii) they searched the test set with a query theymight use to search e.g. the Youtube video repository, iii) after selecting several possible relevantsegments, they defined anchor points or regions within each segment and stated what kind of linksthey would expect for this anchor. Participants were to define queries that they expected to havemore than one relevant video segment in the collection. The study resulted in 36 ad-hoc searchqueries for the test set.

Subsequently, we asked the participants to mark so-called anchors for which they would like to seelinks within some of the segments that they marked relevant to a search query. The reader can finda more elaborate description of this user study design in [6].

4.5 Required runs and evaluation procedure for the search and linking sub-tasks

For the 2014 task, we were interested in cross-comparison of methods using different types ofmetadata. Thus, we allowed participants to submit up to 5 different approaches. In case any of thegroups based their methods on video features only, they could submit this type of run in addition aswell.

To evaluate the submissions of the search and linking sub-tasks a pooling method was used toselect submitted segments and link targets for relevance assessment. The top-10 ranks of allsubmitted runs were annotated using crowdsourcing technologies. We report precision oriented

6http://opensmile.sourceforge.net/7http://image-net.org/popularity percentile readme.html

AXES Page 14 of 26 D7.9

Page 15: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

CU

NI

CU

NI

DC

U2

01

4C

UN

ILA

CS

CU

NI

CU

NI

DC

U2

01

4C

UN

IC

UN

IC

UN

ILI

NK

ED

TV

20

14

DC

U2

01

4C

UN

ID

CU

20

14

CU

NI

LAC

SLA

CS

LIN

KED

TV

20

14

LIN

KED

TV

20

14

CU

NI

DC

U2

01

4D

CU

20

14

DC

U2

01

4C

UN

ILI

NK

ED

TV

20

14

LIN

KED

TV

20

14

CU

NI

DC

U2

01

4C

UN

ID

CU

20

14

DC

U2

01

4D

CU

20

14

LIN

KED

TV

20

14

LIN

KED

TV

20

14

DC

U2

01

4U

T-H

MI2

01

4D

CU

20

14

DC

U2

01

4D

CU

20

14

LIN

KED

TV

20

14

UT-H

MI2

01

4LI

NK

ED

TV

20

14

UT-H

MI2

01

4D

CU

20

14

CU

NI

DC

Lab

LIN

KED

TV

20

14

DC

Lab

CU

NI

DC

Lab

CU

NI

CU

NI

CU

NI

CU

NI

DC

Lab

DC

Lab

CU

NI

DC

Lab

CU

NI

DC

Lab

DC

Lab

DC

Lab

DC

Lab

runs

0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

P_1

0_t

ol

CUNIDCLabDCU2014LACSLINKEDTV2014UT-HMI2014

(a) Search Results using Precision at 10

CU

NI

CU

NI

CU

NI

CU

NI

CU

NI

CU

NI

CU

NI

CU

NI

CU

NI

CU

NI

CU

NI

CU

NI

CU

NI

CU

NI

CU

NI

CU

NI

LIN

KED

TV

20

14

CU

NI

UT-H

MI2

01

4LI

NK

ED

TV

20

14

UT-H

MI2

01

4C

UN

ILI

NK

ED

TV

20

14

CU

NI

LIN

KED

TV

20

14

UT-H

MI2

01

4LI

NK

ED

TV

20

14

LIN

KED

TV

20

14

CU

NI

LIN

KED

TV

20

14

LIN

KED

TV

20

14

CU

NI

CU

NI

CU

NI

CU

NI

DC

U-I

nsi

ght

CU

NI

JRS

CU

NI

CU

NI

LIN

KED

TV

20

14

JRS

JRS

JRS

JRS

IRIS

AK

UL

JRS

LIN

KED

TV

20

14

JRS

JRS

JRS

JRS

DC

U-I

nsi

ght

JRS

JRS

JRS

DC

Lab

IRIS

AK

UL

UT-H

MI2

01

4IR

ISA

KU

LU

T-H

MI2

01

4D

CLa

bJR

SIR

ISA

KU

LD

CLa

bJR

SD

CLa

bD

CLa

bD

CLa

bIR

ISA

KU

LJR

SD

CLa

bD

CLa

bD

CLa

bD

CLa

bIR

ISA

KU

LD

CU

-Insi

ght

runs

0.0

0.1

0.2

0.3

0.4

0.5

0.6

P_1

0_t

ol

CUNIDCLabDCU-InsightIRISAKULJRSLINKEDTV2014UT-HMI2014

(b) Linking Results using Precision at 10

Figure 1: Comparison of results.

metrics, such as precision at various cutoffs and mean average precision (MAP), using differentapproaches to take into account segment overlap, as described in [5].

4.6 Participation and Results

Figure 1 shows a summary of the results for the search task and the hyperlinking task. In totalthere were 68 and 78 submitted runs for the search and the hyperlinking task respectively. Theparticipation by UT performed represented a baseline approach.

Because UT was involved in the creation of the annotation tool, DCU was the only team from AXESthat participated at the Search and Hyperlinking Task. Our research concentrated on a new strategyto improve the identification of target segments within hyperlinking framework and applying learningalgorithms to estimate the linear fusion weights for multimodal features.

Our previous participation in MediaEval showed that state-of-the-art IR techniques can be appliedto multimodal hyperlinking. However, identifying effective target segments is still an open issue.Existing research has used a sliding window to cover a variety of target segments to achievereasonable performance. This year, we purpose a simple and efficient target segment identificationby using the speaker identification from the LIMSI transcript. The motivation is that a target segmentis a collection of multimodal features with time stamps, and those containing relevant informationshould be allocated a higher rank. We identify target segments as follows. First, we separate thevideo into a number of clips. The separating criteria is specified according to the speaker informationdistribution. Second, each clip is defined as a seed segment and these are expanded by mergingadjacent segments. Merging stops when the length of newly merged segment reaches the targetsegment length according to the MediaEval 2014 definition. We the repeat step this step until allthe seed segments have been expanded. Finally, we identify target segments from each of theseexpanded segments.

In MediaEval 2013, we apply metadata and concept detection as the video-level features to improvehyperlinking results using late fusion. In MediaEval 2014, we attempt to improve this by using FisherLinear Discriminant Analysis to estimate the fusion weights. LDA has the advantage of involving no

AXES Page 15 of 26 D7.9

Page 16: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

requirement to estimate hyperparameters. It uses the intra and inter-class variance to predict thebest linear separation. A detailed description is presented in our workshop paper.

Evaluation shows that LDA algorithm can improve the hyperlinking performance by estimatingmultimodal fusion weights. The performance increases mAP from 0.0430 to 0.0791.

AXES Page 16 of 26 D7.9

Page 17: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

TRECVID 2014: MULTIMEDIA EVENT DETECTION TASK

As with last year, INRIA participated on behalf of AXES in this years TRECVid Multimedia EventDetection (MED) task. The goal of the MED task is to assemble core detection technologies intoa system that can search multimedia recordings for user-defined events based on pre-computedmetadata. The MED evaluation defines events via an event kit which consists of an event name,definition, explication (textual exposition of the terms and concepts), evidential descriptions, andillustrative video exemplars. The objective is to develop systems that can create sufficiently powerfulbut general descriptions of multimedia to allow for recognition of these, often complex, events.

The task is of particular relevance to AXES, since recognizing such events is one of the statedobjectives of the project. Event detectors that were developed as part of our participation in theMED task have already been integrated into AXES RESEARCH and AXES HOME.

Our system this year is based on a collection of local visual and audio descriptors, which areaggregated to global descriptors, one for each type of low-level descriptor, using Fisher vectors.Besides these features, we use two features based on convolutional networks: one for the visualchannel, and one for the audio channel. Additional high-level features are extracted using ASR andOCR features. Finally, we used mid-level attribute features based on object and action detectorstrained on external datasets. The remainder of this section summarizes AXES participation in thetask. More detailed information and results are available in the corresponding TRECVid notebookpaper [2].

5.1 Features

We extract a collection of features for each video, which provide a set of high-dimensional signaturesof the video. The features belong to three categories:

• Low-level descriptors that do not rely on supervised training.

• Mid-level attribute descriptors that use a supervised training stage to obtain a signature interms of confidence scores for various concepts (e.g., for object presence, or action detection).

• High-level textual descriptors based on optical character recognition (OCR) and automaticspeech recognition (ASR) that output semantically meaningful text features, rather low-levelsignatures, or mid-level visual concept detections.

Low-level features The low-level features we used in 2014 are similar to the ones from 2013 [3].For each type of low-level feature, we aggregate the local descriptors into a global signature bymeans of a Fisher vector (FV) [28]. The number of Gaussians chosen for the FVs are a trade-offbetween the accuracy of the representation and computational constraints. Visual frame-based FVsare averaged together to produce a signature for the complete video.

As low-level visual features, we use dense trajectories [31], SIFT features [21], and color featuresbased on local mean and variance of the color channels [9]. As low-level audio features, we usefisher vector encoded MFCC coefficients, and scattering coefficients [7].

Mid-level features To cope with the restricted positive training data, we implemented mid-levelrepresentations. These representations rely on detectors trained for a set of object and action

AXES Page 17 of 26 D7.9

Page 18: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

classes that are not directly related to the MED events. The mid-level feature vector of a video clipis built from the confidence scores of the clip for each of the chosen classes. In the case of the CNNfeatures described below, we do not directly use the detection confidences, but rather an internalrepresentation that is used by the convolutional network to detect object classes.

The three mid-level representations we used are:

• HMDB51 attributes: HMDB51 [17] is a dataset of 7,000 video clips of 51 basic action classes(like “dive”, “jump”). We compute dense trajectory features, aggregated with a FV, and classifythem with an SVM to produce the attribute scores.

• ImageNet attributes: The training set of ImageNet 2010 [27] is a dataset of 1.2 million images,each prominently representing one object from a total of 1,000 object classes (such as “bear”,“hook”, “restaurant”). We used the classification scores produced as output of the 8th layer ofa convolutional neural network trained on this data with the Caffe software package [16]. Thisresults in a 1,000 dimensional feature vector representing class confidences.

• CNN: We extract another 4,096 dimensional feature using the same convolutional networktrained on the ImageNet 2010 data. This feature is obtained from the layer six of the network,after applying the linear rectification which clips negative values to zero. We extract theseCNN features in every 10-th frame, and average them into a single video-wide feature vector.

High-level features We used high-level features that temporally localize words in the videobased on optical character recognition and automatic speech recognition. Details of the OCRimplementation and ASR implementations used can be found in the TRECVid notebook paper.

5.2 Classification

Each of the feature vectors (low, mid, or high level) is used to train a linear SVM classifier. Todetermine the hyper-parameters of the SVM we used different strategies, depending on the numberof training examples. For the 100 example case, we used the same classification approach asin 2013 [3]: 10-fold cross-validation to estimate the SVM’s regularization parameter C and theweighting factor for the positive samples. For the 10 example case, we observed that cross-validationper event and per channel resulted in unstable parameters. Therefore, we globally cross-validatedthese parameters across events and channels, which led to an SVM cost parameter of C = 9 and aweight for the positive training samples of 1/16×Nneg/(Nneg +Npos).

5.3 Experiments

We conducted several experiments on the TRECVid MED 2011 dataset to assess the effectivenessof different features, and to gain insight in the differences between the regimes with 100 and 10training examples. The following summarizes the findings. Further details can be found in thenotebook paper.

In our first set of experiments we evaluated the performance of individual features. The best featurechannels were found to be the dense trajectories, CNN, and SIFT features, together with the 1,051dimensional attribute features derived from them. Going from 100 to 10 training examples reducesthe mAP performance by a factor 1.6× to 2.1×. Attribute descriptors are the ones that resist best

AXES Page 18 of 26 D7.9

Page 19: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

this reduction probably because they encode higher-level information. The dense trajectory featuresalso demonstrate relatively robust performance.

Our second set of experiments compared early fusion and late fusion strategies. We found that earlyfusion of pairs of feature types combined with late fusion with all other features does not improveperformance significantly when combining it with late fusion of the other channels. Early fusion ofmultiple pairs of features does, however, improve over the late fusion baseline.

Our third set of experiments considered handling of near-miss examples. This year, many of thetraining examples were near-miss rather than positive examples. We found that, in the 10Ex case, itis often beneficial to use near-miss examples as positive training data. In contrast, for 100Ex it isnever optimal to use them as positives.

Several other experiments were also performed examining the effects of user annotations and latefusion weights in the 10Ex case. Further details can be found in the notebook paper.

5.4 Conclusions

In the final evaluation, results were submitted by twelve teams for the 10 and 100 example tasks, inthe pre-specified and ad-hoc tracks. The results for the INRIA-LIM-VocR submission (which wasidentical to the AXES system, only using a different ASR engine) were 0.5 to 1 mAP point abovethose for the AXES submission. The INRIA-LIM-VocR ranked first with CMU on the 10 examplead-hoc track. For the other tasks, our submissions ranked between 4th and 5th place, behind CMU,BBNVISOR, and Sesame.

AXES Page 19 of 26 D7.9

Page 20: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

TRECVID 2014: INSTANCE SEARCH TASK

Interactive instance search is an integral and heavily used component in all three systems. As such,the AXES project has participated in the interactive instance search task at TRECVid 2011 [23],2012 [24], and 2013 [4], so as to benchmark our implementation against the state-of-the-art. In2012 and 2013, we used our AXES PRO and AXES RESEARCH systems directly for participationin the task. This approach allowed our experiment participants to take advantage of various othertools, like text search on metadata and on the fly visual concept classification, in addition to ourinstance search feature. We observed that although users did occasionally use some of thesefeatures, direct instance search using the provided examples combined with instance search as aform of relevance feedback, was by far the most popular way of interacting with the system.

As we had already evaluated the PRO and RESEARCH interfaces in previous years we decidedthis year to use a custom interface tailored to the TRECVid INS task that uses only our instancesearch technology. This technology, which is based on INRIA’s BigImbaz engine [15], is exactly thesame as we used in 2013. The idea was to check if a custom tailored TRECVid interface, combinedwith some other optimizations like pseudo relevance feedback and subsequent result list expansion,could improve our performance over previous years.

The remainder of this section gives a brief overview of our participation in the task. Further detailscan be found in the TRECVid notebook paper [1].

6.1 System Overview

We divide the INS search into two steps to implement query expansion. The first step is initial re-trieval. Using the example images provided by the topic description, instance search is implementedby BigImbaz by extracting interest points from keyframes using a Hessian detector and computing aCS-LBP descriptor for each keyframe. It refines the searching by involving a quantization index,Hamming Embedding and burstiness [14]. For each run, all sample keyframes are used as queries.Moreover, we use the mask information provided by TRECVid to extract expanded mask keyframe.The mask image indicates the rectangular area from the sample keyframe query and the expandedmask keyframe is created according to the corresponding area. Both the sample keyframe andexpanded mask keyframe are used for run queries. The initial retrieval stage involves multiplequeries for each run. The retrieval list is created by fusing all ranked lists.

The second step is query expansion on the initial result list. We select the top-N results which areassumed to be relevant to the current topic as additional queries. The INS system uses these top-Nresults to retrieve a new group of scored lists. The submission results are created using the queryexpansion retrieval list. The query expansion strategy is used in both interactive and automatic task.The former automatically determines input query according to sample videos and top-N results inthe initial retrieval list, while the latter allows the user to select additional query images.

6.2 User Interface

The user interface used for INS interactive task was developed using the Python Django framework8.The implementation of user interface uses HTML5, CSS3, and Javascript. The client-side is hosted

8https://www.djangoproject.com/

AXES Page 20 of 26 D7.9

Page 21: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

Figure 2: AXES INS User Interface

using nginx9. Figure 2 show the screenshot of AXES INS interactive interface. It is composed offour panels: the information panel, the query panel, the retrieved result panel, and the saved resultpanel.

Each participant in INS interactive task is automatically guided to next available topic. When a newtopic starts, the interface reads pre-process automatic INS results from the server and presents allranked keyframes in the retrieved result panel. Then users can continue searching until the timeelapses or by manually submitting the saved results. Further details on the user interface can befound in the notebook paper [1].

6.3 Experiments

The instance search experiments were carried out at Dublin City University. A total of 12 peopleparticipated in the experiments. Participants are primarily research assistants, students, andpostdoctoral researchers. Each participant had 15 minutes to finish each topic and was assigned6 topics in total. Participants were briefed on the purpose of the experiment and shown how tooperate the user interface. A total of three runs were submitted:

1. F D AXES 2 a run for automatic searching evaluation;

2. I D AXES 1 the first run for interactive searching evaluation;

3. I D AXES 3 the second run for interactive searching evaluation.

Our interactive results (mAP 0.108 and 0.099) this year did not improve significantly over last year(mAP 0.135, 0.086, 0.079) which suggests our simple query expansion mechanism, result list

9http://nginx.org/

AXES Page 21 of 26 D7.9

Page 22: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

supplementation, and tailored user interface do not improve results much, at least in terms of mAP.The variation across runs is due to differences in user skill alone. Interestingly, our automatic runthis year (mAP 0.075) performed about as well as our worst user group did last year in the interactivetask.

6.4 Conclusion

This year we submitted two interactive runs and one automatic run. We used the same instancesearch engine as last year with a custom UI tailored for instance search, and augmented the resultswith pseudo-relevance feedback and result list supplementation. Unfortunately, these changes didnot improve our results significantly over previous years. However, this indicates that our moregeneral AXES RESEARCH interface, despite not being tailored for the TRECVid instance searchtask, performs approximately as well as an interface that is custom built for the task.

AXES Page 22 of 26 D7.9

Page 23: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

IMAGENET 2014: LARGE SCALE VISUAL RECOGNITION CHALLENGE

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) evaluates algorithms for objectdetection and image classification at large scale. One high level motivation is to allow researchersto compare progress in detection across a wider variety of objects – taking advantage of the quiteexpensive labeling effort. Another motivation is to measure the progress of computer vision for largescale image indexing for retrieval and annotation.

Oxford University participated in the ImageNet Large Scale Visual Recognition Challenge 2014 intask 2 (a): classification and localization with provided training data. Their approach was based onusing an ensemble of very deep convolutional neural networks containing 16 and 19 layers. Theteam found accuracy could be improved by using larger numbers of convolutional layers with verysmall aperture (convolution kernel) sizes. Details of the experiments with different architectures andthe final configuration and results can be found in [29]. The team achieved the best result in 2014when ordered by localization error, achieving an error of 0.253231, and the second best result (afterGoogle’s GoogLeNet) in when ordered by classification error (top-5 error 7.325%)10. The trainedCNN models in Caffe format11 have been made available online12. Oxford further improved theirresults post competition obtaining a top 5 error rate of 7.0%.

Oxford University also used their pre-trained ImageNet model in combination with a Support VectorMachine to achieve the top ranked result in the PASCAL VOC 2012 action classification challenge,achieving a mAP of 0.8413

10See: http://www.image-net.org/challenges/LSVRC/2014/results11http://caffe.berkeleyvision.org12http://www.robots.ox.ac.uk/~vgg/research/very_deep/13See: http://host.robots.ox.ac.uk:8080/leaderboard/displaylb_noeq.php?challengeid=11&compid=10

AXES Page 23 of 26 D7.9

Page 24: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

CONCLUSION

This deliverable described AXES participation in international benchmarking activities in 2014.AXES participated in four international benchmarks this year: the THUMOS action recognitionchallenge, the MediaEval search and hyperlinking tasks, the TRECVid multimedia event detection(MED) and instance search (INS) tasks, and the ImageNet large scale visual recognition challenge.In addition to participating, AXES also organized the MediaEval search and hyperlinking task,providing video data and features to participants.

In the THUMOS benchmark, INRIA achieved the second to top ranked result in the action classifi-cation task and the top three ranked results in the temporal localization task. At MediaEval, DCUranked in the top three runs for the search results using the P@10 metric. In the TRECVid MEDbenchmark, the INRIA team ranked first with CMU on the 10 example ad-hoc track. In the TRECVidINS benchmark, AXES achieved similar results to last year. Oxford achieved the top ranked result intask 2 (a) of the ImageNet LSVRC when ordered by localization error, and the second to top rankedresult when ordered by classification error.

REFERENCES

[1] AXES TRECVid 2014: Instance search. In Proceedings of the 2014 TRECVid Workshop,2014.

[2] The INRIA-LIM-VocR and AXES submissions to TRECVid 2014 Multimedia Event Detection.In Proceedings of the 2014 TRECVid Workshop, 2014.

[3] R. Aly, R. Arandjelovic, K. Chatfield, M. Douze, B. Fernando, Z. Harchaoui, K. Mcguiness,N. O’Connor, D. Oneata, O. Parkhi, D. Potapov, J. Revaud, C. Schmid, J. Schwenninger,D. Scott, T. Tuytelaars, J. V. Jakob, H. Wang, and A. Zisserman. The AXES submissions atTrecVid 2013. In TRECVID Workshop, 2013.

[4] R. Aly, R. Arandjelovic, K. Chatfield, M. Douze, B. Fernando, Z. Harchaoui, K. McGuinness,N. E. O’Connor, D. Oneata, O. M. Parkhi, et al. The AXES Submissions at TRECVid 2013.2013.

[5] R. Aly, M. Eskevich, R. Ordelman, and G. J. F. Jones. Adapting binary information retrievalevaluation metrics for segment-based retrieval tasks. Technical Report 1312.1913, ArXive-prints, 2013.

[6] R. Aly, R. Ordelman, M. Eskevich, G. J. F. Jones, and S. Chen. Linking inside a video collection:what and how to measure? In WWW (Companion Volume), pages 457–460, 2013.

[7] J. Anden and S. Mallat. Multiscale scattering for audio classification. In ISMIR, 2011.

[8] K. Chatfield and A. Zisserman. Visor: Towards on-the-fly large-scale object category retrieval.In Computer Vision–ACCV 2012, pages 432–446. Springer, 2013.

[9] S. Clinchant, J.-M. Renders, and G. Csurka. Trans-media pseudo-relevance feedback methodsin multimedia retrieval. In Advances in Multilingual and Multimodal Information Retrieval, 2008.

AXES Page 24 of 26 D7.9

Page 25: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

[10] M. Eskevich, R. Aly, R. Ordelman, S. Chen, and G. J. F. Jones. The Search and HyperlinkingTask at MediaEval 2013. In Proceedings of the MediaEval 2013 Workshop, volume 1043 ofCEUR Workshop Proceedings, Barcelona, Spain, 2013.

[11] F. Eyben, F. Weninger, F. Gross, and B. Schuller. Recent developments in opensmile, themunich open-source multimedia feature extractor. In Proceedings of ACM Multimedia 2013,MM ’13, pages 835–838, Barcelona, Spain, 2013. ACM.

[12] J.-L. Gauvain. The Quaero Program: Multilingual and Multimedia Technologies. IWSLT 2010,2010.

[13] J.-L. Gauvain, L. Lamel, and G. Adda. The LIMSI Broadcast News transcription system.Speech Communication, 37(1-2):89–108, 2002.

[14] H. Jegou, M. Douze, and C. Schmid. On the burstiness of visual elements. In CVPR, June2009.

[15] H. Jegou, M. Douze, and C. Schmid. Improving bag-of-features for large scale image search.International Journal of Computer Vision, 87(3):316–336, February 2010.

[16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, andT. Darrell. Caffe: Convolutional architecture for fast feature embedding. In ACM Multimedia,2014.

[17] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. HMDB: a large video database forhuman motion recognition. In Proceedings of the IEEE International Conference on ComputerVision, 2011.

[18] L. Lamel. Multilingual Speech Processing Activities in Quaero: Application to MultimediaSearch in Unstructured Data. In The Fifth International Conference: Human LanguageTechnologies - The Baltic Perspective, pages 1–8, Tartu, Estonia, October 4-5 2012.

[19] L. Lamel and J.-L. Gauvain. Speech processing for audio indexing. In Advances in NaturalLanguage Processing, 2008.

[20] P. Lanchantin, P. Bell, M. J. F. Gales, T. Hain, X. Liu, Y. Long, J. Quinnell, S. Renals, O. Saz,M. S. Seigel, P. Swietojanski, and P. C. Woodland. Automatic transcription of multi-genremedia archives. In Proceedings of the First Workshop on Speech, Language and Audio inMultimedia (SLAM@INTERSPEECH), volume 1012 of CEUR Workshop Proceedings, pages26–31. CEUR-WS.org, 2013.

[21] D. Lowe. Distinctive image features from scale-invariant keypoints. International Journal ofComputer Vision, 60(2):91–110, 2004.

[22] K. McGuinness, R. Aly, K. Chatfield, O. Parkhi, R. Arandjelovic, M. Douze, M. Kemman,M. Kleppe, P. van der Kreeft, K. Macquarrie, A. Ozerov, N. E. O’Connor, F. De Jong, A. Zisser-man, C. Schmid, and P. Perez. The AXES research video search system. In Proceedings ofthe IEEE ICASSP 2014, Florence, Italy, 2014.

[23] K. McGuinness, R. Aly, S. Chen, M. Frappier, K. Martijn, H. Lee, R. Ordelman, R. Arandjelovic,M. Juneja, C. Jawahar, et al. AXES at TRECVid 2011. 2011.

[24] D. Oneata, M. Douze, J. Revaud, S. Jochen, D. Potapov, H. Wang, Z. Harchaoui, J. Verbeek,C. Schmid, R. Aly, K. Mcguiness, S. Chen, N. O’Connor, K. Chatfield, O. Parkhi, R. Arandjelovic,

AXES Page 25 of 26 D7.9

Page 26: Deliverable D7.9 Report on annual participation in ... · Reason for change Issue Revision Date Template preparation - 1.0 26/11/2014 Initial revision integrating partner contributions

A. Zisserman, F. Basura, and T. Tuytelaars. AXES at TRECVid 2012: KIS, INS, and MED. InProceedings of TRECVID 2012. NIST, USA, Nov 2012.

[25] D. Oneata, J. Verbeek, and C. Schmid. Action and event recognition with Fisher vectors on acompact feature set. In Proceedings of the IEEE International Conference on Computer Vision,2013.

[26] A. Rousseau, P. Deleglise, and Y. Esteve. Enhancing the ted-lium corpus with selected datafor language modeling and more ted talks. In The 9th edition of the Language Resources andEvaluation Conference (LREC 2014), Reykjavik, Iceland, May 2014.

[27] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy,A. Khosla, M. Bernstein, A. Berg, and L. Fei-Fei. ImageNet large scale visual recognitionchallenge. arXiv:1409.0575, 2014.

[28] J. Sanchez, F. Perronnin, T. Mensink, and J. Verbeek. Image classification with the Fishervector: Theory and practice. International Journal of Computer Vision, 105(3):222–245, 2013.

[29] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale imagerecognition. arXiv preprint arXiv:1409.1556, 2014.

[30] T. Tommasi, T. Tuytelaars, and B. Caputo. A testbed for cross-dataset analysis. CoRR,abs/1402.5923, 2014.

[31] H. Wang and C. Schmid. Action recognition with improved trajectories. In Proceedings of theIEEE International Conference on Computer Vision, 2013.

AXES Page 26 of 26 D7.9