1968 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 9,...

16
1968 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 9, SEPTEMBER 2017 HNIP: Compact Deep Invariant Representations for Video Matching, Localization, and Retrieval Jie Lin, Ling-Yu Duan, Member, IEEE, Shiqi Wang, Yan Bai, Yihang Lou, Vijay Chandrasekhar, Tiejun Huang, Alex Kot, Fellow, IEEE, and Wen Gao, Fellow, IEEE Abstract—With emerging demand for large-scale video analysis, MPEG initiated the compact descriptor for video analysis (CDVA) standardization in 2014. Beyond handcrafted descriptors adopted by the current MPEG-CDVA reference model, we study the problem of deep learned global descriptors for video matching, localization, and retrieval. First, inspired by a recent invariance theory, we propose a nested invariance pooling (NIP) method to derive compact deep global descriptors from convolutional neural networks (CNNs), by progressively encoding translation, scale, and rotation invariances into the pooled descriptors. Second, our empirical studies have shown that a sequence of well designed pooling moments (e.g., max or average) may drastically impact video matching performance, which motivates us to design hybrid pooling operations via NIP (HNIP). HNIP has further improved the discriminability of deep global descriptors. Third, the technical merits and performance improvements by combining deep and handcrafted descriptors are provided to better investigate the complementary effects. We evaluate the effectiveness of HNIP within the well-established MPEG- CDVA evaluation framework. The extensive experiments have demonstrated that HNIP outperforms the state-of-the-art deep and canonical handcrafted descriptors with significant mAP gains of 5.5% and 4.7%, respectively. In particular the combination of HNIP incorporated CNN descriptors and handcrafted global descriptors has significantly boosted the performance of CDVA core techniques with comparable descriptor size. Index Terms—Convolutional neural networks (CNNs), deep global descriptors, handcrafted descriptors, hybrid nested invariance pooling, MPEG CDVA, MPEG CDVS. Manuscript received December 5, 2016; revised April 3, 2017 and May 30, 2017; accepted May 30, 2017. Date of publication June 8, 2017; date of current version August 12, 2017. This work was supported in part by the National Hightech R&D Program of China under Grant 2015AA016302, in part by the National Natural Science Foundation of China under Grant U1611461 and Grant 61661146005, and in part by the PKU-NTU Joint Research Institute (JRI) sponsored by a donation from the Ng Teng Fong Charitable Foundation. The guest editor coordinating the review of this manuscript and approving it for publication was Dr. Cees Snoek. (Corresponding author: Ling-Yu Duan.) J. Lin is with the Institute for Infocomm Research, ASTAR, Singapore 138632 (e-mail: [email protected]). L.-Y. Duan, Y. Bai, Y. Lou, T. Huang, and W. Gao are with the Insti- tute of Digital Media, Peking University, Beijing 100080, China (e-mail: [email protected]; [email protected]; [email protected]; tjhuang@pku. edu.cn; [email protected]). S. Wang is with the Department of Computer Science, City University of Hong Kong, Hong Kong, China (e-mail: [email protected]). V. Chandrasekhar is with the Institute for Infocomm Research, A*STAR, Sin- gapore 138632, and also with the Nanyang Technological University, Singapore 639798 (e-mail: [email protected]). A. Kot is with the Nanyang Technological University, Singapore 639798 (e- mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TMM.2017.2713410 I. INTRODUCTION R ECENT years have witnessed a remarkable growth of in- terest for video retrieval, which refers to searching for videos representing the same object or scene as the one de- picted in a query video. Such capability can facilitate a variety of applications including mobile augmented reality (MAR), au- tomotive, surveillance, media entertainment, etc. [1]. In the rapid evolution of video retrieval, both great promises and new chal- lenges arising from real applications have been perceived [2]. Typically, video retrieval is performed at the server end, which requires the transmission of visual data via network [3], [4]. In- stead of directly sending huge volume of video data, developing compact and robust video feature representations is highly de- sirable, which fulfills low latency transmission over bandwidth constrained network. To this end, in 2009, MPEG started the standardization of Compact Descriptors for Visual Search (CDVS) [5], which came up with a normative bitstream of standardized descriptors for mobile visual search and augmented reality applications. In Sep. 2015, MPEG published the CDVS standard [6]. Very recently, towards large-scale video analysis, MPEG has moved forward to standardize Compact Descriptors for Video Analy- sis (CDVA) [7]. To deal with content redundancy in temporal dimension, the latest CDVA Experimental Model (CXM) [8] casts video retrieval into keyframe based image retrieval task, in which the keyframe-level matching results are combined for video matching. The keyframe-level representation avoids de- scriptor extraction on dense frames in videos, which largely re- duces the computational complexity (e.g., CDVA only extracts descriptors of 12% frames detected from raw videos). In CDVS, handcrafted local and global descriptors have been successfully standardized in a compact and scalable manner (e.g., from 512 B to 16 KB), where local descriptors capture the invariant characteristics of local image patches and global de- scriptors like Fisher Vectors (FV) [9] and VLAD [10] reflect the aggregated statistics of local descriptors. Though handcrafted descriptors have achieved great success in CDVS standard [5] and CDVA experimental model, how to leverage promising deep learned global descriptors remains an open issue in the MPEG CDVA Ad-hoc group. Many recent works [11]–[18] have shown the advantages of deep global descriptors for image retrieval, which may be attributed to the remarkable success of Convolutional Neural Networks (CNN) [19], [20]. In particular, state-of-the-art deep global descriptors R-MAC [18] computes the max over a set of Region-of-Interest (ROI) cropped from 1520-9210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Transcript of 1968 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 9,...

  • 1968 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 9, SEPTEMBER 2017

    HNIP: Compact Deep Invariant Representations forVideo Matching, Localization, and Retrieval

    Jie Lin, Ling-Yu Duan, Member, IEEE, Shiqi Wang, Yan Bai, Yihang Lou, Vijay Chandrasekhar, Tiejun Huang,Alex Kot, Fellow, IEEE, and Wen Gao, Fellow, IEEE

    Abstract—With emerging demand for large-scale video analysis,MPEG initiated the compact descriptor for video analysis (CDVA)standardization in 2014. Beyond handcrafted descriptors adoptedby the current MPEG-CDVA reference model, we study theproblem of deep learned global descriptors for video matching,localization, and retrieval. First, inspired by a recent invariancetheory, we propose a nested invariance pooling (NIP) methodto derive compact deep global descriptors from convolutionalneural networks (CNNs), by progressively encoding translation,scale, and rotation invariances into the pooled descriptors. Second,our empirical studies have shown that a sequence of welldesigned pooling moments (e.g., max or average) may drasticallyimpact video matching performance, which motivates us todesign hybrid pooling operations via NIP (HNIP). HNIP hasfurther improved the discriminability of deep global descriptors.Third, the technical merits and performance improvementsby combining deep and handcrafted descriptors are providedto better investigate the complementary effects. We evaluatethe effectiveness of HNIP within the well-established MPEG-CDVA evaluation framework. The extensive experiments havedemonstrated that HNIP outperforms the state-of-the-art deepand canonical handcrafted descriptors with significant mAP gainsof 5.5% and 4.7%, respectively. In particular the combinationof HNIP incorporated CNN descriptors and handcrafted globaldescriptors has significantly boosted the performance of CDVAcore techniques with comparable descriptor size.

    Index Terms—Convolutional neural networks (CNNs), deepglobal descriptors, handcrafted descriptors, hybrid nestedinvariance pooling, MPEG CDVA, MPEG CDVS.

    Manuscript received December 5, 2016; revised April 3, 2017 and May 30,2017; accepted May 30, 2017. Date of publication June 8, 2017; date of currentversion August 12, 2017. This work was supported in part by the NationalHightech R&D Program of China under Grant 2015AA016302, in part bythe National Natural Science Foundation of China under Grant U1611461 andGrant 61661146005, and in part by the PKU-NTU Joint Research Institute (JRI)sponsored by a donation from the Ng Teng Fong Charitable Foundation. Theguest editor coordinating the review of this manuscript and approving it forpublication was Dr. Cees Snoek. (Corresponding author: Ling-Yu Duan.)

    J. Lin is with the Institute for Infocomm Research, A∗STAR, Singapore138632 (e-mail: [email protected]).

    L.-Y. Duan, Y. Bai, Y. Lou, T. Huang, and W. Gao are with the Insti-tute of Digital Media, Peking University, Beijing 100080, China (e-mail:[email protected]; [email protected]; [email protected]; [email protected]; [email protected]).

    S. Wang is with the Department of Computer Science, City University ofHong Kong, Hong Kong, China (e-mail: [email protected]).

    V. Chandrasekhar is with the Institute for Infocomm Research, A*STAR, Sin-gapore 138632, and also with the Nanyang Technological University, Singapore639798 (e-mail: [email protected]).

    A. Kot is with the Nanyang Technological University, Singapore 639798 (e-mail: [email protected]).

    Color versions of one or more of the figures in this paper are available onlineat http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TMM.2017.2713410

    I. INTRODUCTION

    R ECENT years have witnessed a remarkable growth of in-terest for video retrieval, which refers to searching forvideos representing the same object or scene as the one de-picted in a query video. Such capability can facilitate a varietyof applications including mobile augmented reality (MAR), au-tomotive, surveillance, media entertainment, etc. [1]. In the rapidevolution of video retrieval, both great promises and new chal-lenges arising from real applications have been perceived [2].Typically, video retrieval is performed at the server end, whichrequires the transmission of visual data via network [3], [4]. In-stead of directly sending huge volume of video data, developingcompact and robust video feature representations is highly de-sirable, which fulfills low latency transmission over bandwidthconstrained network.

    To this end, in 2009, MPEG started the standardization ofCompact Descriptors for Visual Search (CDVS) [5], whichcame up with a normative bitstream of standardized descriptorsfor mobile visual search and augmented reality applications.In Sep. 2015, MPEG published the CDVS standard [6]. Veryrecently, towards large-scale video analysis, MPEG has movedforward to standardize Compact Descriptors for Video Analy-sis (CDVA) [7]. To deal with content redundancy in temporaldimension, the latest CDVA Experimental Model (CXM) [8]casts video retrieval into keyframe based image retrieval task,in which the keyframe-level matching results are combined forvideo matching. The keyframe-level representation avoids de-scriptor extraction on dense frames in videos, which largely re-duces the computational complexity (e.g., CDVA only extractsdescriptors of 1∼2% frames detected from raw videos).

    In CDVS, handcrafted local and global descriptors have beensuccessfully standardized in a compact and scalable manner(e.g., from 512 B to 16 KB), where local descriptors capture theinvariant characteristics of local image patches and global de-scriptors like Fisher Vectors (FV) [9] and VLAD [10] reflect theaggregated statistics of local descriptors. Though handcrafteddescriptors have achieved great success in CDVS standard [5]and CDVA experimental model, how to leverage promisingdeep learned global descriptors remains an open issue in theMPEG CDVA Ad-hoc group. Many recent works [11]–[18]have shown the advantages of deep global descriptors for imageretrieval, which may be attributed to the remarkable success ofConvolutional Neural Networks (CNN) [19], [20]. In particular,state-of-the-art deep global descriptors R-MAC [18] computesthe max over a set of Region-of-Interest (ROI) cropped from

    1520-9210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

  • LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 1969

    feature maps output by intermediate convolutional layer,followed by the average of these regional max-pooled features.Results show that R-MAC offers remarkable improvements overother deep global descriptors like MAC [18] and SPoC [16],while maintaining the same dimensionality.

    In the context of compact descriptors for video retrieval, thereexist important practical issues with CNN based deep globaldescriptors. First, one major drawback of CNN is the lack of in-variance to geometric transformations of the input image such asrotations. The performance of deep global descriptors quicklydegrades when the objects in query and database videos arerotated differently. Second, different from CNN features, hand-crafted descriptors are robust against scale and rotation changesin 2D plane, because of the local invariant feature detectors. Assuch, more insights should be provided on whether there is greatcomplementarity between CNN and conventional handcrafteddescriptors for better video matching and retrieval performance.

    To tackle the above issues, we make the following contribu-tions in this work:

    1) We propose a Nested Invariance Pooling (NIP) methodto produce compact global descriptors from CNN byprogressively encoding translation, scale and rotation in-variances. NIP is inspired from a recent invariance the-ory, which provides a practically useful and mathemati-cally proved approach to computing invariant representa-tion with feedforward neural network. Both quantitativeand qualitative evaluations are provided on the invarianceproperties.

    2) We further improve the discriminability of deep globaldescriptors by incorporating Hybrid pooling momentsinto NIP. The proposed HNIP outperforms state-of-the-art deep and handcrafted descriptors by a large marginwith comparable descriptor size in the well-establishedvideo retrieval evaluation framework.

    3) We analyze the complementary nature of deep fea-tures and handcrafted descriptors over diverse datasets(landmarks, scenes and common objects). By fusing thestrengths of both deep and handcrafted global descrip-tors. We have further improved the video matching andretrieval performance, without incurring extra cost on de-scriptor size as per CDVA evaluation framework.

    4) Due to the superior performance, HNIP has been adoptedby CDVA Ad-hoc group as technical reference to setupnew core experiments [21], which opens up future ex-ploration of CNN techniques in the development of stan-dardized video descriptors. The latest core experimentsinvolve compact deep feature representation, CNN modelcompression, etc.

    II. RELATED WORK

    Handcrafted descriptors: Handcrafted local descriptors [22],[23], such as SIFT based on Difference of Gaussians (DoG)detector [22], have been successfully and widely employedto conduct image matching and localization tasks due totheir robustness in scale and rotation changes. Building onlocal image descriptors, global image representations aim toprovide statistical summaries of high level image properties

    and facilitate fast large-scale image search. In particular, forglobal image descriptors, the most prominent ones includeBag-of-Words (BoW) [24], [25], Fisher Vector (FV) [9],VLAD [10], Triangulation Embedding [26] and Robust VisualDescriptor with Whitening (RVDW) [27], with which fastcomparisons against a large scale database become practical.

    Given the fact that raw descriptors such as SIFT and FV mayconsume extraordinarily large number of bits for transmissionand storage, many compact descriptors were developed. For ex-ample, numerous strategies have been proposed to compressSIFT using hashing [28], [29], transform coding [30], latticecoding [31] and vector quantization [32]. On the other hand, bi-nary descriptors including BRIEF [33], ORB [34], BRISK [35]and Ultrashort Binary Descriptor (USB) [36] were proposed,which support fast Hamming distance matching. For compactglobal descriptors, efforts have also been made to reduce theirdescriptor size, such as tree-structure quantizer [37] for BoWhistogram, locality sensitive hashing [38], dimensionality re-duction and vector quantization for FV [9], and simple signbinarization for VLAD like descriptors [9], [39].

    Deep descriptors: Deep learned descriptors have been ex-tensively applied to image retrieval [11]–[18]. First initialstudy [11], [13] proposed to use representations directly ex-tracted from fully connected layer of CNN. More compactglobal descriptors [14]–[16] can be extracted by performingeither global max or average pooling (e.g., SPoC in [16]) overfeature maps output by intermediate layers. Further improve-ments are obtained by spatial or channel-wise weighting ofpooled descriptors [17]. Very recently, inspired by the R-CNNapproach [40] used for object detection, Tolias et al. [18] pro-posed ROI based pooling on deep convolutional features, Re-gional Maximum Activation of Convolutions (R-MAC), whichsignificantly improves global pooling approaches. ThoughR-MAC is scale invariant to some extent, it suffers from thelack of rotation invariance. These regional deep features can bealso aggregated into global descriptors by VLAD [12].

    In a number of recent works [13], [41]–[43], pre-trainedCNNs for image classification are repurposed for the imageretrieval task, by fine-tuning them with specific loss functions(e.g., Siamese or triplet networks) over carefully constructedmatching and non-matching training image sets. There is consid-erable performance improvement when training and test datasetsin similar domains (e.g., buildings). In this work, we aim toexplicitly construct invariant deep global descriptors from theperspective of better leveraging the state-of-the-arts or classicalCNN architectures, rather than further optimizing the learningof invariant deep descriptors.

    Video descriptors: Video is typically composed of a num-ber of moving frames. Therefore, a straightforward method forvideo descriptor representation is extracting feature descriptorsat frame level then reducing the temporal redundancies of thesedescriptors for compact representation. For local descriptors,Baroffio et al. [44] proposed both intra- and inter-feature cod-ing methods of SIFT in the context of visual sensor network,and an additional mode decision scheme based on rate-distortionoptimization was designed to further improve the feature codingefficiency. In [45], [46], studies have been conducted to encodethe binary features such as BRISK [35]. Makar et al. [47] pre-

  • 1970 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 9, SEPTEMBER 2017

    sented a temporally coherent keypoint detector to allow efficientinterframe coding of canonical patches, corresponding featuredescriptors, and locations towards mobile augmented realityapplication. Chao et al. [48] developed a key-points encodingtechnique, where locations, scales and orientations extractedfrom original videos are encoded and transmitted along withcompressed video to the server. Recently, the temporal depen-dencies of global descriptors have also been exploited. For BoWextracted from video sequence, Baroffio et al. [49] proposed anintra-frame coding method with uniform scalar quantization,as well as an inter-frame technique with arithmetic coding thequantized symbols. Chen et al. [50], [51] proposed an encodingscheme for scalable residual based global signatures given thefact that REVVs [39] of adjacent frames share most codewordsand residual vectors.

    Besides the frame-level approaches, aggregations of local de-scriptors over video slots and global descriptors over sceneshave also been intensively explored [52]–[55]. In [54], tempo-ral aggregation strategies for large scale video retrieval wereexperimentally studied and evaluated with the CDVS globaldescriptors [56]. Four aggregation modes, including local fea-ture, global signature, tracking-based and independent framebased aggregation schemes were investigated. For video-levelCNN representation, in [57], the authors applied FV and VLADaggregation techniques over dense local features of CNN acti-vation maps for video event detection.

    III. MPEG CDVA

    MPEG CDVA [7] aims to standardize the bitstream of com-pact video descriptors for large-scale video analysis. The CDVAstandard incurs two main technical requirements of the dedi-cated descriptors: compactness and robustness. On the one hand,compact representation is an efficient way to economize thetransmission bandwidth, storage space and computational cost.On the other hand, robust representation in the scenario of ge-ometric transformations such as rotation and scale variations isparticularly required. To this end, in the 115th MPEG meeting,the CXM [8] was released, which mainly relies on MPEG CDVSreference software TM14.2 [6] for keyframe-level compact androbust handcrafted descriptor representation based on scale androtation invariant local features.

    A. CDVS-Based Handcrafted Descriptors

    The MPEG CDVS [5] standardized descriptors serve as thefundamental infrastructure to represent video keyframes. Thenormative blocks of CDVS standard are illustrated in Fig. 1(b),mainly involving extraction of compressed local and globaldescriptors. First, scale and rotation invariant interest key pointsare detected from image, and a subset of reliable key pointsare retained, followed by the computation of handcrafted localSIFT features. The compressed local descriptors are formedby applying a low-complexity transform coding on local SIFTfeatures. The compact global descriptors are Fisher vectors ag-gregated from the selected local features, followed by scalabledescriptor compression with simple sign binarization. Basically,pairwise image matching is accomplished by first comparingcompact global descriptors, then further performing geometric

    consistency checking (GCC) with compressed local descriptors.CDVS handcrafted descriptors are with very low memory foot-print, while preserving competitive matching and retrieval accu-racy. The standard supports operating points ranging from 512 Bto 16 kB specified for different bandwidth constraints. Overall,the 4 kB operating point achieves a good tradeoff between ac-curacy and complexity (e.g., transmission bitrate, search time).Thus, CDVA CXM adopts the 4 kB descriptors for keyframe-level representation, in which compressed local descriptors andcompact global descriptors are both ∼2 kB per keyframe.

    B. CDVA Evaluation Framework

    Fig. 1(a) shows the evaluation framework of CDVA, includ-ing keyframe detection, CDVS descriptors extraction, trans-mission, and video retrieval and matching. At the client side,color histogram comparison is applied to identify keyframesfrom video clips. The standardized CDVS descriptors are ex-tracted from these keyframes, which can be further packed toform CDVA descriptors [58]. Keyframe detection has largelyreduced the temporal redundancy in videos, resulting in low bi-trate query descriptor transmission. At the server side, the samekeyframe detection and CDVS descriptors extraction are ap-plied to database videos. Formally, we denote query video X ={x1 , ...,xNx } and reference video Y = {y1 , ...,yNy }, where xand y denote keyframes. Nx and Ny are the number of detectedkeyframes in query and reference videos, respectively. The startand end timestamps for keyframes are recorded, e.g., [T sxn , T

    exn

    ]for query keyframe xn . Here, we briefly describe the pipeline ofpairwise video matching, localization and video retrieval withCDVA descriptors.

    Pairwise video matching and localization: Pairwise videomatching is performed by comparing the CDVA descriptors ofvideo keyframe pair (X,Y). Each keyframe in X is comparedwith all keyframes in Y. The video-level similarity K(X, Y) isdefined as the largest matching score among all keyframe-levelsimilarities. For example, if we consider video matching withCDVS global descriptors only

    K(X, Y) = maxx∈X,y∈Y

    k(f(x), f(y)) (1)

    where k(·, ·) denotes a matching function (e.g., cosine similar-ity). f(x) denotes CDVS global descriptors for keyframe x.1

    Following the matching pipeline in CDVS, if k(·, ·) exceeds apre-defined threshold, GCC with CDVS local descriptors is sub-sequently applied for verifying true positive keyframe matches.Then the keyframe-level similarity is finally determined as themultiplication of matching scores from both CDVS global andlocal descriptors. Correspondingly, K(X, Y) in (1) is refined asthe maximum of their combined similarities.

    The matched keyframe timestamps between query and refer-ence videos are recorded for evaluating the temporal localizationtask, i.e., locating the video segment containing item of inter-est. In particular, if the multiplication of CDVS global and localmatching scores exceeds a predefined threshold, the correspond-ing keyframe timestamps are recorded. Assuming there are τ(τ ≤ Nx ) keyframes satisfying such criterion in a query video,

    1We use the same notation for deep global descriptors in the following section.

  • LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 1971

    Fig. 1. (a) Illustration of MPEG CDVA evaluation framework. (b) Descriptor extraction pipeline for MPEG CDVS. (c) Temporal localization of item of interestbetween video pair.

    the matching video segment is defined as T ′start = min{T sxn

    }

    and T ′end = max{T exn

    }, where 1 ≤ n ≤ τ . As such, we can ob-

    tain the predicted matching video segment by descriptor match-ing, as shown in Fig. 1(c).

    Video retrieval: Video retrieval differs from pairwise videomatching in that the former is one to many matching, whilethe latter is one to one matching. Thus, video retrieval sharessimilar matching pipeline with pairwise video matching, ex-cept for the following differences: 1) For each query keyframe,the top Kg candidate keyframes are retrieved from databaseby comparing CDVS global descriptors. Subsequently, GCCreranking with CDVS local descriptors is performed betweenquery keyframe and each candidate. The top Kl databasekeyframes are recorded. The default choices for Kg and Klare 500 and 100, respectively. 2) For each query video, all re-turned database keyframes are merged into candidate databasevideos according to their video indices. Then, the video-levelsimilarity between query and each candidate database videois obtained following the same principle as pairwise videomatching. Finally, the top ranked candidate database videos arereturned.

    IV. METHOD

    A. Hybrid Nested Invariance Pooling

    Fig. 2(a) shows the extraction pipeline of our compact deepinvariant global descriptors with HNIP. Given a video keyframex, we rotate it by R times (with step size θ◦). By forward-ing each rotated image through a pre-trained deep CNN, theconvolutional feature maps output by intermediate layer (e.g.,convolutional layer) are represented by a cube W × H × C,where W and H denote width and height of each featuremap respectively and C is the number of channels in the fea-ture maps. Subsequently, we extract a set of ROIs from the

    cube using an overlapping sliding window, with window sizeW

    ′ ≤ W and H ′ ≤ H . The window size is adjusted to incor-porate ROIs with different scales (e.g., 5 × 5, 10 × 10). Here,we denote the number of scales as S. Finally, a 5-D data struc-ture γx(Gt,Gs,Gr , C) ∈ RW

    ′×H ′×S×R×C is derived, whichencodes translations Gt (i.e., spatial locations W

    ′ × H ′ ), scalesGs and rotations Gr of input keyframe x.

    HNIP aims to aggregate the 5-D data into a compact deep in-variant global descriptor. In particular, it firstly performs poolingover translations (W

    ′ × H ′ ), then scales (S) and finally rota-tions (R) in a nested way, resulting in a C-dimensional globaldescriptor. Formally, for the cth feature map, nt-norm poolingover translations Gt is given by

    γx(Gs,Gr , c) =

    ⎝ 1W ′ × H ′

    W′×H ′∑

    j=1

    γx(j,Gs,Gr , c)nt

    1n t

    (2)where the pooling operation has a parameter nt defining the sta-tistical moments, e.g., nt = 1 is first-order (i.e., average pool-ing), nt → +∞ on the other extreme is infinite order (i.e., maxpooling), and nt = 2 is second order (i.e., square-root pooling).Equation (2) leads to a 3-D data γx(Gs,Gr , C) ∈ RS×R×C .Analogously, ns-norm pooling over scale transformations Gsand the subsequent nr -norm pooling over rotation transforma-tions Gr are

    γx(Gr , c) =

    ⎝ 1S

    S∑

    j=1

    γx(j,Gr , c)ns

    1n s

    , (3)

    γx(c) =

    ⎝ 1R

    R∑

    j=1

    γx(j, c)nr

    1n r

    . (4)

  • 1972 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 9, SEPTEMBER 2017

    Fig. 2. (a) Nested invariance pooling (NIP) on feature maps extracted from intermediate layer of CNN architecture. (b) A single convolution-pooling operationfrom a CNN schematized for a single input layer and single output neuron. The parallel with invariance theory shows that the universal building block of CNN iscompatible with the incorporation of invariance to local translations of the input.

    The corresponding global descriptor is obtained by concatenat-ing γx(c) for all feature maps

    f(x) = {γx(c)}0≤c (6)

    where β(·) is a normalization term computed by β(x) =(∑C

    c=1 < γx(c), γx(c) >)− 12 . Equation (6) refers to cosine sim-

    ilarity by accumulating the scalar products of normalized pooledfeatures for each feature map. HNIP descriptors can be furtherimproved by post-processing techniques such as PCA whiten-ing [16], [18]. In this work, The global descriptor is firstly L2normalized, followed by PCA projection and whitening with apre-trained PCA matrix. The whitened vectors are L2 normal-ized and compared with (6).

    Subsequently, we further investigate HNIP with more details.In Section IV-B, inspired by a recent invariance theory [59],HNIP is proven to be approximately invariant to translation,scale and rotation transformations, which is independent ofthe statistical moments chosen in the nested pooling stages.In Section IV-C, both quantitative and qualitative evaluationsare presented to illustrate the invariance properties. Moreover,we observe the statistical moments in HNIP drastically affectvideo matching performance. Our empirical results show thatthe optimal nested pooling moments correspond to: nt = 2 as asecond order moment, ns first-order and nr of infinite order.

    B. Theory on Transformation Invariance

    Invariance theory in a nutshell: Recently, Anselmi andPoggio [59] proposed an invariance theory exploring how signal(e.g., image) representations are invariant to various transforma-tions. Denote f(x) as the representation for image x, f(x) isinvariant to transformation g (e.g., rotation) if f(x) = f(g · x)is hold ∀g ∈ G, where we define the orbit of x by a transfor-mation group G as Ox = {g · x|g ∈ G}. It can be easily shown

    that Ox is globally invariant to the action of any element ofG and thus any descriptor computed directly from Ox will beglobally invariant to G.

    More specifically, the invariant descriptor f(x) can bederived in two stages. First, given a predefined template t (e.g.,convolutional filter in CNN), we compute the dot products oft over the orbit: Dx,t = {< g · x, t >∈ R|g ∈ G}. Second, theextracted invariant descriptor should be a histogram represen-tation of the distribution Dx,t with a specific bin configuration,for example, the statistical moments (e.g., mean, max, standarddeviation, etc.) derived from Dx,t . Such a representation ismathematically proven to have proper invariance property fortransformations such as translations (Gt), scales (Gs) androtations (Gr ). One may note that the transformation g canbe applied either on the image or the template indifferently,i.e., {< g · x, t >=< x, g · t > |g ∈ G}. Recent work on faceverification [60] and music classification [61] successfullyapplied this theory to practical applications. More details aboutinvariance theory are referred to [59].

    An example: translation invariance of CNN: The convolution-pooling operations in CNN are compliant with the invari-ance theory. Existing well-known CNN architectures likeAlexNet [19] and VGG16 [20] share a common building block:a succession of convolution and pooling operations, which infact provides a way to incorporate local translation invariance.As shown in Fig. 2(b), convolution operation on translated im-age patches (i.e., sliding windows) is equivalent to < g · x, t >,and max pooling operation is in line with the statistical his-togram computation over the distribution of the dot products(i.e., feature maps). For instance, considering a convolutionalfilter learned “cat face”, the filter would always response to animage depicted cat face no matter where is the face located inthe image. Subsequently, max pooling over the activation fea-ture maps captures the most salient feature of the cat face, whichis naturally invariant to object translation.

    Incorporating scale and rotation invariances into CNN:Based on the already locally translation invariant feature maps(e.g., the last pooling layer, pool5), we propose to further im-

  • LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 1973

    prove the invariance of pool5 CNN descriptors by incorporatingglobal invariance to several transformation groups. The spe-cific transformation groups considered in this study are scalesGS and rotations GR . As one can see, it is impractical to gen-erate all transformations g · x for the orbit Ox. In addition,the computational complexity of feedforward pass in CNN in-creases linearly with the number of transformed version of theinput x. For practical consideration, we simplify the orbit bya finite set of transformations (e.g., # of rotations R = 4, # ofscales S = 3). This results in HNIP descriptors approximatelyinvariant to transformations, without huge increase in featureextraction time.

    An interesting aspect of the invariance theory is the possibilityin practice to chain multiple types of group invariances one afterthe other as already demonstrated in [61]. In this study, we con-struct descriptors invariant to several transformation groups byprogressively applying the method to different transformationgroups as shown in (2)–(4).

    Discussions: While there is theoretical guarantee in the scale-and rotation-invariance of handcrafted local feature detectorssuch as DoG, classical CNN architectures lack invariance tothese geometric transformations [62]. Many works have pro-posed to encode transformation invariances into both hand-crafted (e.g., BoW built on dense sampled SIFT [63]) and CNNrepresentations [64], by explicitly augmenting input images withrotation and scale transformations. Our HNIP takes similar ideaof image augmentation, but has several significant differences.First, rather than a single pooling (max or average) layer overall transformations, HNIP progressively pools features togetheracross different transformations with different moments, whichis essential for significantly improving the quality of pooledCNN descriptors. Second, unlike previous empirical studies,we have attempted to mathematically show that the design ofnested pooling ensures that HNIP is approximately invariant tomultiple transformations, which is inspired by the recently de-veloped invariance theory. Third, to the best of our knowledge,this work is the first to comprehensively analyze invariant prop-erties of CNN descriptors, in the context of large scale videomatching and retrieval.

    C. Quantitative and Qualitative Evaluations

    Transformation invariance: In this section, we propose adatabase-side data augmentation strategy for image retrievalto study rotation and scale invariance properties, respectively.For simplicity, we represent an image as a 4096 dimensionaldescriptor extracted from the first fully connected layer (fc6)of VGG16 [20] pre-trained on ImageNet dataset. We report re-trieval results in terms of mean Average Precision (mAP) on theINRIA Holidays dataset [65] (500 query images, 991 referenceimages).

    Fig. 3(a) investigates the invariance property with respect toquery rotations. First, we observe that the retrieval performancedrops significantly as we rotate query images when fixing thereference images (the red curve). To gain invariance to queryrotations, we rotate each reference image within a range of−180◦ to 180◦, with the step of 10◦. The fc6 features for its36 rotated images are pooled together into one common globaldescriptor representation, with either max or average pooling.

    Fig. 3. Comparison of pooled descriptors invariant to (a) rotation and (b)scale changes of query images, measured by retrieval accuracy (mAP) on theHolidays data set. fc6 layer of VGG16 [20] pretrained on ImageNet datasetis used.

    Fig. 4. Distances for three matching pairs from the MPEG CDVA dataset (seeSection VI-A for more details). For each pair, three pairwise distances (L2normalized) are computed by progressively encoding translations (Gt ), scales(Gt + Gs ), and rotations (Gt + Gs + Gr ) into the nested pooling stages.Average pooling is used for all transformations. Feature maps are extractedfrom the last pooling layer of pretrained VGG16.

    We observe that the performance is relatively consistent (blue formax pooling, green for average pooling) against the rotation ofquery images. Moreover, performance in terms of the variationsof the query image scale is plotted in Fig. 3(b). It is observedthat the database-side augmentation by max or average poolingover scale changes (scale ratio of 0.75, 0.5, 0.375, 0.25, 0.2 and0.125) can improve the performance when query scale is small(e.g., 0.125).

    Nesting multiple transformations: We further analyze nestedpooling over multiple transformations. Fig. 4 provides an insighton how progressively adding different types of transformationsaffect the matching distance on different image matching pairs.We can see the reduction in matching distance with the incor-poration of each new transformation group. Fig. 5 takes a closerlook at pairwise similarity maps between local deep featuresof query keyframes and the global deep descriptors of refer-ence keyframes, which explicitly reflects the regions of querykeyframe significantly contributing to similarity measurement.We compare our HNIP (the third heatmap at each row) to thestate-of-the-art deep descriptors MAC [18] and R-MAC [18].Because of the introduction of scale and rotation transforma-tions, HNIP is able to locate the query object of interest re-

  • 1974 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 9, SEPTEMBER 2017

    Fig. 5. Example similarity maps between local deep features of query keyframes and the global deep descriptors of reference keyframes, using off-the-shelfVGG16. For query (left image) and reference (right image) keyframe pair at each row, the middle three similarity maps are from MAC, R-MAC, and HNIP (fromleft to right), respectively. Each similarity map is generated by cosine similarity between query local features at each feature map location and the pooled globaldescriptors for reference keyframe (i.e., MAC, R-MAC, or HNIP), which allows locating the regions of query keyframe contributing most to the pairwise similarity.

    TABLE IPAIRWISE VIDEO MATCHING BETWEEN MATCHING AND NON-MATCHING

    VIDEO DATASETS, WITH POOLING CARDINALITY INCREASEDBY PROGRESSIVELY ENCODING TRANSLATION, SCALE, ANDROTATION TRANSFORMATIONS, FOR DIFFERENT POOLING

    STRATEGIES, I.E., MAX-MAX-MAX, AVG-AVG-AVG,AND OUR HNIP (SQU+AVG+MAX)

    Gt Gt -Gs Gt -Gs -Gr

    Max 71.9 Max-Max 72.8 Max-Max-Max 73.9Avg 76.9 Avg-Avg 79.2 Avg-Avg-Avg 82.2Squ 81.6 Squ-Avg 82.7 Squ-Avg-Max 84.3

    sponsible for similarity measures more precisely than MACand R-MAC, though there are scale and rotation changes be-tween query-reference pairs. Moreover, as shown in Table I,quantitative results on video matching by HNIP with progres-sively encoded multiple transformations provide additional pos-itive evidence for the nested invariance property.

    Pooling moments: In Fig. 3, it is interesting to note that anychoice of pooling moment n in the pooling stage can produceinvariant descriptors. However, the discriminability of NIP withvaried pooling moments could be quite different. For videoretrieval, we empirically observe that pooling with hybrid mo-ments works well for NIP, e.g., starting with square-root pool-ing (nt = 2) for translations and average pooling (ns = 1) forscales, and ending with max pooling (nr → +∞) for rotations.Here, we present an empirical analysis of how pooling momentsaffect pairwise video matching performance.

    We construct matching and non-matching video set from theMPEG CDVA dataset. Both sets contain 4690 video pairs. Withinput video keyframe size 640 × 480, feature maps of size 20 ×15 are extracted from the last pooling layer of VGG16 [20] pre-trained on ImageNet dataset. For transformations, we considernested pooling by progressively adding transformations withtranslations (Gt), scales (Gt-Gs) and rotations (Gt-Gs-Gr ).For pooling moments, we evaluate Max-Max-Max, Avg-Avg-Avg and our HNIP (i.e., Squ-Avg-Max). Finally, video similarityis computed using (1) with pooled features.

    Table I reports pairwise matching performance in termsof True Positive Rate with False Positive Rate set to 1%,with transformations switching from Gt to Gt-Gs-Gr , for

    TABLE IISTATISTICS ON THE NUMBER OF RELEVANT DATABASE VIDEOS RETURNED

    IN TOP 100 LIST (I.E., RECALL@100) FOR ALL QUERY VIDEOS IN THEMPEG CDVA DATASET (SEE SECTION VI-A FOR MORE DETAILS)

    Landmarks Scenes Objects

    HNIP \ CXM 8143 1477 1218CXM \ HNIP 1052 105 1834

    “A \ B” represents relevant instances are successfully re-trieved by method A but missed in the list generated bymethod B. The last pooling layer of pretrained VGG16 isused for HNIP.

    Max-Max-Max, Avg-Avg-Avg and HNIP. As more transfor-mations nested in, the separability between matching andnon-matching video sets becomes larger, regardless of thepooling moments used. More importantly, HNIP performs thebest compared to Max-Max-Max and Avg-Avg-Avg, whileMax-Max-Max is the worst. For instance, HNIP outperformsAvg-Avg-Avg, i.e., 84.3% vs. 82.2%.

    V. COMBINING DEEP AND HANDCRAFTED DESCRIPTORS

    In this section, we analyze the strength and weakness of deepfeatures in the context of video retrieval and matching, com-pared to state-of-the-art handcrafted descriptors built upon localinvariant features (SIFT). To this end, we respectively computestatistics of HNIP and CDVA handcrafted descriptors (CXM)by retrieving different types of video data. In particular, we fo-cus on videos depicting landmarks, scenes and common objects,collected by MPEG CDVA. Here we describe how to computethe statistics. First, for each query video, we retrieve top 100most similar database videos using HNIP and CXM, respec-tively. Second, for all queries from each type of video data, weaccumulate the number of relevant database videos (1) retrievedby HNIP but missed by CXM (denoted as HNIP \ CXM), and(2) retrieved by CXM but missed by HNIP (CXM \ HNIP).The statistics are presented in Table II. As one can see, com-pared to handcrafted descriptors CXM, HNIP is able to identifymuch more relevant landmark and scene videos in which CXMfails. On the other hand, CXM recalls more videos depictingcommon objects than HNIP. Fig. 6 shows qualitative examples

  • LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 1975

    Fig. 6. Examples of keyframe pairs in which (a) HNIP determines as matching but CXM as non-matching (HNIP \ CXM) and (b) CXM determines as matchingbut HNIP as non-matching (CXM \ HNIP).

    of keyframe pairs corresponding to HNIP \ CXM and CXM \HNIP, respectively.

    Fig. 7 further visualizes intermediate keyframe matching re-sults produced by handcrafted and deep features, respectively.Despite viewpoint change of the landmark images in Fig. 7(a),the salient features fired in their activation maps are spatiallyconsistent. Similar observations exist in indoor scene images inFig. 7(b). These observations are probably attributed to deepdescriptors are excelling in characterizing global salient fea-tures. On the other hand, handcrafted descriptors work on localpatches detected at sparse interest points, which prefer rich tex-tured blobs [Fig. 7(c)] rather than lower textured ones [Fig. 7(a)and 7(b)]. This may explain why there are more inlier matchesfound by GCC for the product images in Fig. 7(a). Finally, com-pared to approximate scale and rotation invariances providedby HNIP analyzed in the previous section, handcrafted localfeatures have with built-in mechanism to ensure nearly exactinvariances to these transformations of rigid object in the 2Dplane, and examples can be found in Fig. 6(b).

    In summary, these observations reveal that the deep learn-ing features may not always outperform handcrafted features.There may exist complementary effects between CNN deep de-scriptors and handcrafted descriptors. Therefore, we propose toleverage the benefits of both deep and handcrafted descriptors.Considering that handcrafted descriptors are categorized intolocal and global ones, we investigate the combination of deepdescriptors with either handcrafted local or global descriptors,respectively.

    Combining HNIP with handcrafted local descriptors: Formatching, if the HNIP matching score exceeds a threshold, thenwe use handcrafted local descriptors for verification. For re-

    trieval, HNIP matching score is used to select the top 500candidates list, then we use handcrafted local descriptors forreranking.

    Combining HNIP and handcrafted global descriptors: In-stead of simply concatenating the HNIP derived deep descrip-tors and handcrafted descriptors, for both matching and retrieval,the similarity score is defined as the weighted sum of matchingscores of HNIP and handcrafted global descriptors

    k̂(x,y) = α · kc(x,y) + (1 − α) · kh(x,y) (7)where α is the weighting factor. kc and kh represent the matchingscore of HNIP and handcrafted descriptors, respectively. In thiswork, α is empirically set to 0.75.

    VI. EXPERIMENTAL RESULTS

    A. Datasets and Evaluation Metrics

    Datasets: MPEG CDVA ad-hoc group collects large scalediverse video dataset to evaluate the effectiveness of videodescriptors for video matching, localization and retrieval ap-plications, with resource constraints including descriptor size,extraction time and matching complexity. This CDVA dataset2

    is diversified to contain views of 1) stationary large objects, e.g.,buildings, landmarks (most likely background objects, possiblypartially occluded or a close-up), 2) generally smaller items(e.g., paintings, books, CD covers, products) which typically

    2MPEG CDVA dataset and evaluation framework are available upon re-quest at http://www.cldatlas.com/cdva/dataset.html. CDVA standard documentsare available at http://mpeg.chiariglione.org/standards/exploration/compact-descriptors-video-analysis.

  • 1976 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 9, SEPTEMBER 2017

    Fig. 7. Keyframe matching examples which illustrate the strength and weakness of CNN based deep descriptors and handcrafted descriptors. In (a) and (b), deepdescriptors perform well but handcrafted ones fail, while (c) is the opposite.

    TABLE IIISTATISTICS ON THE MPEG CDVA BENCHMARK DATASETS

    IoI: items of interest. q.v. : query videos. r.v.: reference videos.

    appear in front of background scenes, possibly occluded and 3)scenes (e.g., interior scenes, natural scenes, multi-camera shots,etc.). CDVA dataset also comprises planar or non-planar, rigidor partially rigid, textured or partially textured objects (scenes),which are captured from different view-points with differentcamera parameters and lighting conditions.

    Specifically, MPEG CDVA dataset contains 9974 query and5127 reference videos (denoted as All), depicting 796 items ofinterest in which 489 large landmarks (e.g., buildings), 71 scenes(e.g., interior or natural scenes) and 236 small common objects(e.g., paintings, books, products). The videos have durationsfrom 1 sec to 1+ min. To evaluate video retrieval on differenttypes of video data, we categorize query videos into Landmarks(5224 queries), Scenes (915 queries) and Objects (3835 queries).Table III summaries the numbers of items of interest and theirinstances for each category. Fig. 8 shows some example videoclips from the three categories.

    To evaluate the performance of large scale video retrieval,we combine the reference videos with a set of user-generatedand broadcast videos as distractors, which consist of contentunrelated to the items of interest. There are 14537 distractorvideos with more than 1000 hours data.

    Moreover, to evaluate pairwise video matching and tempo-ral localization, 4693 matching video pairs and 46911 non-matching video pairs are constructed from query and referencevideos. Temporal location of items of interest within each videopair is annotated as the ground truth.

    We also evaluate our method on image retrieval bench-mark datasets. INRIA Holidays, [65] dataset is composed of1491 high-resolution (e.g., 2048 × 1536) scene-centric images,500 of them are queries. This dataset includes a large varietyof outdoor scene/object types: natural, man-made, water andfire effects. We evaluate the rotated version of Holidays [13],where all images are with up-right orientation. Oxford5k, [66]is buildings datasets consisting of 5062 images, mainly withsize 1024 × 768. There are 55 queries composed of 11 land-marks, each represented by 5 queries. We use the providedbounding boxes to crop query images. The University of Ken-tucky Benchmark (UKBench) [25] consists of 10200 VGA sizeimages, organized into 2550 groups of common objects, eachobject represented by 4 images. All 10200 images are serving asqueries.

    Evaluation metrics: Retrieval performance is evaluated bymean Average Precision (mAP) and precision at a given cut-off rank R for query videos (Precisian@R), and we set R =100 following MPEG CDVA standard. Pairwise video matchingperformance is evaluated by Receiver Operating Characteristic(ROC) curve. We also report pairwise matching results in termsof True Positive Rate (TPR), given False Positive Rate (FPR)equals to 1%. In case a video pair is predicted as a match,temporal location of the item of interest is further identifiedwithin the video pair. The localization accuracy is measuredby Jaccard Index: [T s t a r t ,Te n d ]

    ⋂[T ′s t a r t ,T

    ′e n d ]

    [T s t a r t ,Te n d ]⋃

    [T ′s t a r t ,T′e n d ]

    , where [Tstart , Tend ]denotes the ground truth and [T ′start , T

    ′end ] denotes the predicted

    start and end frame timestamps.Besides these accuracy measurements, we also measure the

    complexity of algorithms, including descriptor size, transmis-sion bit rate, extraction time and search time. In particular,transmission bit rate is measured by (# query keyframes) ×(descriptor size in Bytes) / (query durations in seconds).

    B. Implementation Details

    In this work, we build HNIP descriptors with two widelyused CNN architectures : AlexNet [19] and VGG16 [20]. Wetest off-the-shelf networks pre-trained on ImageNet ILSVRCclassification data set. In particular, we crop the networks to the

  • LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 1977

    Fig. 8. Example video clips from the CDVA dataset.

    TABLE IVVIDEO RETRIEVAL COMPARISON (MAP) BY PROGRESSIVELY ADDING

    TRANSFORMATIONS (TRANSLATION, SCALE, ROTATION) INTO NIP(OFF-THE-SHELF VGG16 IS USED FOR ALL TEST DATASETS)

    Transf. size (kB/k.f.) Landmarks Scenes Objects All

    Gt 2 64.0 82.9 64.8 66.0Gt -Gs 2 65.3 82.4 67.3 67.6Gt -Gs -Gr 2 64.6 82.7 72.2 69.2

    Average pooling is applied to all transformations. No PCA whitening is performed.kB/k.f.: descriptor size per key frame. The best results are highlighted in bold.

    last pooling layer (i.e., pool5). We resize all video keyframesto 640×480 and Holidays (Oxford5k) images to 1024×768as the inputs of CNN for descriptor extraction. Finally, post-processing can be applied to the pooled descriptors like HNIPand R-MAC [18]. Following the standard practice, we choosePCA whitening in this work. We randomly sample 40K framesfrom the distractor videos for PCA learning. These experimen-tal setups are applied to both HNIP and state-of-the-art deeppooled descriptors like MAC [18], SPoC [16], CroW [17] andR-MAC [18].

    We also compare HNIP with the MPEG CXM, which is cur-rent state-of-the-art handcrafted compact descriptors for videoanalysis. Following the practice in CDVA standard, we em-ploy OpenMP to perform parallel retrieval for both CXM anddeep global descriptors. Experiments are conducted on TianheHPC platform, where each node is equipped with 2 processors(24 cores, Xeon E5-2692) @2.2GHZ, and 64 GB RAM. ForCNN feature maps extraction, we use NVIDIA Tesla K80 GPU.

    C. Evaluations on HNIP variants

    We perform video retrieval experiments to assess the effect oftransformations and pooling moments in HNIP pipeline, usingoff-the-shelf VGG16.

    Transformations: Table IV studies the influence of poolingcardinalities by progressively adding transformations into thenested pooling stages. We simply apply average pooling to alltransformations. First, dimensions of all NIP variants are 512for VGG16, resulting in descriptor size 2 kB per keyframe forfloating point vectors (4 bytes each element). Second, over-all, retrieval performance (mAP) increases as more transfor-mations nested in the pooled descriptors, e.g., from 66.0% forGt to 69.2% for Gt-Gs -Gr on the full test dataset (All). Also,we observe that Gt-Gs-Gr outperforms Gt-Gs and Gt by alarge margin on Objects, while achieving comparable perfor-

    TABLE VVIDEO RETRIEVAL COMPARISON (MAP) OF NIP WITH DIFFERENT POOLING

    MOMENTS (OFF-THE-SHELF VGG16 IS USED FOR ALL TEST DATASETS)

    Pool Op. size (kB/k.f.) Landmarks Scenes Objects All

    Max-Max-Max 2 63.2 82.9 69.9 67.6Avg-Avg-Avg 2 64.6 82.7 72.2 69.2Squ-Squ-Squ 2 54.2 65.6 66.5 60.0Max-Squ-Avg 2 60.6 80.7 74.4 67.8Hybrid (HNIP) 2 70.0 88.2 80.9 75.9

    Transformations are Gt -Gs -Gr for all experiments. No PCA whitening is performed.The best results are highlighted in bold.

    mance on Landmarks and Scenes. Revisiting the analysis of ro-tation invariance pooling on the scene-centric Holidays datasetin Fig. 3(a), though invariance to query rotation changes canbe gained by database-side augmented pooling, one may notethat its retrieval performance is comparable to the one withoutrotating query and reference images (i.e., the peak value of thered curve). These observations are probably because there arerelatively limited rotation (scale) changes for videos depictinglarge landmarks or scenes, compared to small common objects.More examples can be found in Fig. 8.

    Hybrid pooling moments: Table V explores the effects ofpooling moments within NIP. Transformations are fixed as Gt -Gs-Gr . There are 33 = 27 possible combinations of poolingmoments in HNIP. For simplicity, we compare our Hybrid NIP(i.e., Squ-Avg-Max) to two widely-used pooling strategies (i.e.,max or average pooling across all transformations) and anothertwo schemes: square-root pooling across all transformationsand Max-Squ-Avg which decreases pooling moment along theway. First, for uniform pooling, Avg-Avg-Avg is overall supe-rior over Max-Max-Max and Squ-Squ-Squ, while Squ-Squ-Squperforms much worse than the other two. Second, HNIP out-performs the optimal uniform pooling Avg-Avg-Avg by a largemargin. For instance, the gains over Avg-Avg-Avg are +5.4%,+5.5% and +8.7% on Landmarks, Scenes and Objects, respec-tively. Finally, for hybrid pooling, HNIP performs significantlybetter than Max-Squ-Avg over all test datasets. We observesimilar trends when comparing HNIP to other hybrid poolingcombinations.

    D. Comparisons Between HNIP and State-of-the-Art DeepDescriptors

    Previous experiments show that the integration of transfor-mations and hybrid pooling moments offers remarkable video

  • 1978 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 9, SEPTEMBER 2017

    TABLE VIVIDEO RETRIEVAL COMPARISON OF HNIP WITH STATE-OF-THE-ART DEEP DESCRIPTORS

    IN TERMS OF MAP (OFF-THE-SHELF VGG16 IS USED FOR ALL TEST DATASETS)

    method size (kB/k.f.) extra. time (s/k.f.) Landmarks Scenes Objects All

    w/o PCAW w/ PCAW w/o PCAW w/ PCAW w/o PCAW w/ PCAW w/o PCAW w/ PCAW

    MAC [18] 2 0.32 57.8 61.9 77.4 76.2 70.0 71.8 64.3 67.0SPoC [16] 2 0.32 64.0 69.1 82.9 84.0 64.8 70.3 66.0 70.9CroW [17] 2 0.32 62.3 63.9 79.2 78.4 71.9 72.0 67.5 68.3R-MAC [18] 2 0.32 69.4 74.6 84.4 87.3 73.8 78.2 72.5 77.1HNIP (Ours) 2 0.96 70.0 74.8 88.2 90.1 80.9 85.0 75.9 80.1

    We implement MAC [18], SPoC [16], CroW [17], R-MAC [18] based on the source codes released by the authors, while following the same experimental setups as our HNIP. Thebest results are highlighted in bold.

    TABLE VIIEFFECT OF THE NUMBER OF DETECTED VIDEO KEYFRAMES ON DESCRIPTOR TRANSMISSION BIT RATE,

    RETRIEVAL PERFORMANCE (MAP), AND SEARCH TIME, ON THE FULL TEST DATASET (ALL)

    # query k.f. # DB k.f. method size (kB/k.f.) bit rate (Bps) mAP search time (s/q.v.)

    ∼140K (1.6%) ∼105K (2.4%) CXM ∼4 2840 73.6 12.4AlexNet-HNIP 1 459 71.4 1.6

    VGG16-HNIP 2 918 80.1 2.3∼175K (2.0%) ∼132K (3.0%) CXM ∼4 3463 74.3 16.6

    AlexNet-HNIP 1 571 71.9 2.0VGG16-HNIP 2 1143 80.6 2.8

    ∼231K (2.7%) ∼176K (3.9%) CXM ∼4 4494 74.6 21.0AlexNet-HNIP 1 759 71.9 2.2VGG16-HNIP 2 1518 80.7 3.1

    We report performance of state-of-the-art handcrafted descriptors (CXM), and PCA whitened HNIP with both off-the-shelf AlexNet andVGG16. Numbers in bracket denote the percentage of detected keyframes from raw videos. Bps: bytes per second. s/q.v.: seconds perquery video.

    retrieval performance improvements. Here, we conduct anotherround of video retrieval experiments to validate the effective-ness of our optimal reported HNIP, compared to state-of-the-artdeep descriptors [16]–[18].

    Effect of PCA whitening: Table VI studies the effect of PCAwhitening on different deep descriptors in video retrieval per-formance (mAP), using off-the-shelf VGG16. Overall, PCAwhitened descriptors perform better than their counterpartswithout PCA whitening. More specifically, the improvementson SPoC, R-MAC and our HNIP are much larger than MACand CroW. In view of this, we apply PCA whitening to HNIP inthe following sections.

    HNIP versus MAC, SPoC, CroW and R-MAC: Table VIpresents the comparison of HNIP against state-of-the-art deepdescriptors. We observe that HNIP obtains consistently betterperformance than other approaches on all test datasets, at the costof extra extraction time.3 HNIP significantly improves the re-trieval performance over MAC [18], SPoC [16] and CroW [17],e.g., over 10% in mAP on the full test dataset (All). Compared

    3The extraction time of deep descriptors is mainly decomposed into 1) feed-forward pass for extracting feature maps and 2) pooling over feature mapsfollowed by post-processing such as PCA whitening. In our implementationbased on MatConvNet, the first stage takes 0.21 seconds per keyframe (VGAsize input image to VGG16 executed on a NVIDIA Tesla K80 GPU); HNIP isfour times slower (∼0.84) as there are four rotations for each keyframe. Thesecond stage is ∼0.11 seconds for MAC, SPoC and CroW, ∼0.115 seconds forR-MAC and ∼0.12 seconds for HNIP. Therefore, the extraction time of HNIPis roughly three times as much as others.

    TABLE VIIIVIDEO RETRIEVAL COMPARISON OF HNIP WITH STATE-OF-THE-ART

    HANDCRAFTED DESCRIPTORS (CXM), FOR ALL TEST DATASETS

    method Landmarks Scenes Objects All

    CXM 61.4/60.9 63.0/61.9 92.6/91.2 73.6/72.6AlexNet-HNIP 65.2/62.3 77.6/74.1 78.4/74.5 71.4/68.1VGG16-HNIP 74.8/71.6 90.1/86.6 85.0/81.3 80.1/76.7

    The former (latter) number in each cell represents performance in terms of mAP(Precision@R). We report performance of PCA whitened HNIP with both off-the-shelf AlexNet and VGG16. The best results are highlighted in bold.

    with the state-of-the-art R-MAC [18], +7% mAP improvementis achieved on Objects, which is mainly attributed to the im-proved robustness against the rotation changes in videos (thekeyframes capture small objects from different angles).

    E. Comparisons Between HNIP and Handcrafted Descriptors

    In this section, we first study the influences of the numberof detected video keyframes on descriptor transmission bit rate,retrieval performance and search time. Then, with keyframesfixed, we compare HNIP to the state-of-the-art compact hand-crafted descriptors (CXM), which currently obtains the optimalvideo retrieval performance in MPEG CDVA datasets.

    Effect of the number of detected video keyframes: As shownin Table VII, we generate three keyframe detection config-

  • LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 1979

    Fig. 9. Pairwise video matching comparison of HNIP with state-of-the-art handcrafted descriptors (CXM) in terms of ROC curve, for all test datasets. Experimentalsettings are identical with those in Table VIII.

    urations by varying the detection parameters. Also, we testtheir retrieval performance and complexity for CXM descriptors(∼4 kB per keyframe) and PCA whitened HNIP with both off-the-shelf AlexNet (1 kB per keyframe) and VGG16 (2 kB perkeyframe), on the full test dataset (All). It is easy to find thatdescriptor transmission bit rate and search time increase pro-portionally with the number of detected keyframes. However,there is little retrieval performance gain for all descriptors, i.e.,less than 1% in mAP. Thus, we consider the first configurationthroughout this work, which achieves a good tradeoff betweenaccuracy and complexities. For instance, mAP for VGG16-HNIP is 80.1% when the descriptor transmission bit rate isonly 2840 Bytes per second.

    Video retrieval: Table VIII shows the video retrieval com-parison of HNIP with handcrafted descriptors CXM on alltest datasets. Overall, AlexNet-HNIP is inferior to CXM,while VGG16-HNIP performs the best. Second, HNIP withboth AlexNet and VGG16 outperform CXM on Landmarksand Scenes. The performance gap between HNIP and CXMbecomes larger as network goes deeper from AlexNet toVGG16, e.g., AlexNet-HNIP and VGG16-HNIP improve CXMby 3.8% and 13.4% in mAP on Landmarks, respectively. Third,we observe AlexNet-HNIP performs much worse than CXM onObjects (e.g., 74.5% vs. 91.2% in Precision@R). VGG16-HNIPreduces the gap, but still underperforms CXM. This is reason-able as handcrafted descriptors based on SIFT are more robustto scale and rotation changes of rigid objects in the 2D plane.

    Video pairwise matching and localization: Fig. 9 andTable IX further show pairwise video matching and temporallocalization performance of HNIP and CXM on all test datasets,respectively. For pairwise video matching, VGG16-HNIP andAlexNet-HNIP consistently outperform CXM in terms of TPRfor varied FPR on Landmarks and Scenes. In Table IX, weobserve the performance trends of temporal localization areroughly consistent with pairwise video matching.

    One may note that the localization accuracy of CXM isworse than HNIP on Objects (see Table IX), but CXM ob-tains much better video retrieval mAP than HNIP on Objects(see Table VIII). First, given a query-reference video pair, videoretrieval tries to identify the most similar keyframe pair, buttemporal localization aims to locate multiple keyframe pairs bycomparing with a predefined threshold. Second, as shown inFig. 9, CXM achieves better TPR (Recall) than both VGG16-

    TABLE IXVIDEO LOCALIZATION COMPARISON OF HNIP WITH STATE-OF-THE-ART

    HANDCRAFTED DESCRIPTORS (CXM) IN TERMSOF JACCARD INDEX, FOR ALL TEST DATASETS

    method Landmarks Scenes Objects All

    CXM 45.5 45.9 68.8 54.4AlexNet-HNIP 48.9 63.0 67.3 57.1VGG16-HNIP 50.8 63.8 71.2 59.7

    Experimental settings are the same as in Table VIII. The best resultsare highlighted in bold.

    TABLE XIMAGE RETRIEVAL COMPARISON (MAP) OF HNIP WITH STATE-OF-THE-ART

    DEEP AND HANDCRAFTED DESCRIPTORS (CXM),ON HOLIDAYS, OXFORD5K, AND UKBENCH

    method Holidays Oxford5k UKbench

    CXM 71.2 43.5 3.46MAC [18] 78.3 56.1 3.65SPoC [16] 84.5 68.6 3.68R-MAC [18] 87.2 67.6 3.73HNIP (Ours) 88.9 69.3 3.90

    We report performance of PCA whitened deep descriptorswith off-the-shelf VGG16. The best results are highlightedin bold.

    HNIP and AlexNet-HNIP on Objects when FPR is small (e.g.,FPR = 1%), and TPR becomes worse as FPR increases. Thisimplies that 1) CXM ranks relevant videos and keyframes higherthan HNIP in the retrieved list for object queries, which leads tobetter mAP on Objects when evaluating retrieval performancein a small shortlist (100 in our experiments). 2) VGG16-HNIPgains higher Recall than CXM when FPR becomes large, whichleads to higher localization accuracy on Objects. In other words,towards better temporal localization, we choose a small thresh-old (the corresponding FPR = 14.3% in our experiments) inorder to recall as many relevant keyframes as possible.

    Image retrieval: To further verify the effectiveness of HNIP,we conduct image instance retrieval experiments on scene-centric Holidays, landmark-centric Oxford5k and object-centricUKbench. Table X shows the comparisons of HNIP with MAC,SPoC, R-MAC, and handcrafted descriptors from MPEG CDVAevaluation framework. First, we observe HNIP outperforms

  • 1980 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 9, SEPTEMBER 2017

    Fig. 10. (a) and (b) Video retrieval, (c) pairwise video matching, and (d) and localization performance of the optimal reported HNIP (i.e., VGG16-HNIP)combined with either CXM local or CXM global descriptors, for all test datasets. For simplicity, we report the pairwise video matching performance in terms ofTPR given FPR = 1%.

    TABLE XIVIDEO RETRIEVAL COMPARISON OF HNIP WITH CXM AND THE COMBINATION OF HNIP WITH CXM-LOCAL AND CXM-GLOBAL RESPECTIVELY,

    ON THE FULL TEST DATASET (ALL), WITHOUT (“W/O D ”), or WITH (“W/ D ”) COMBINATION OF THE LARGE SCALE DISTRACTOR VIDEOS

    method size (kB/k.f.) mAP Precision@R # query k.f. # DB k.f. search time (s/q.v.)

    w/o D w/ D w/o D w/ D w/o D w/ D w/o D w/ D

    CXM ∼4 73.6 72.1 72.6 71.2 ∼140K ∼105K ∼1.25M 12.4 38.6VGG16-HNIP 2 80.1 76.8 76.7 73.6 2.3 9.2VGG16-HNIP + CXM-Local ∼4 75.7 75.4 74.4 74.1 12.9 17.8VGG16-HNIP + CXM-Global ∼4 84.9 82.6 82.4 80.3 4.9 39.5

    handcrafted descriptors by a large margin on all data sets.Second, HNIP performs significantly better than state-of-the-art deep descriptors R-MAC on UKbench, though it showsmarginally better performance on Holidays. The performanceadvantage trend between HNIP and R-MAC is consistent withvideo retrieval results on CDVA-Scenes and CDVA-Objects inTable VI. It is again demonstrated that HNIP tends to be moreeffective on object-centric datasets compared to scene- andlandmark-centric datasets, as object-centric datasets are usuallywith more rotation and scale distortions.

    F. Combination of HNIP and Handcrafted Descriptors

    CXM contains both compressed local descriptors(∼2 kB/frame) and compact global descriptors (∼2 kB/frame)aggregated from local ones. Following the combination strate-gies designed in Section V, Fig. 10 shows the effectivenessof combining VGG16-HNIP with either CXM-Global orCXM-Local descriptors,4 in video retrieval (a) & (b), matching(c) and localization (d). First, we observe the combination ofVGG16-HNIP with either CXM-Global or CXM-Local consis-tently improves CXM across all tasks on all test datasets. In thisregard, the improvements of VGG16-HNIP + CXM-Global aremuch larger than VGG16-HNIP + CXM-Local, especially forLandmarks and Scenes. Second, VGG16-HNIP + CXM-Globalperforms better on all test datasets in video retrieval, matchingand localization (except localization accuracy on Landmarks).

    4Here, we did not introduce the complicated combination of VGG-HNIP +CXM-Global + CXM-Local, because its performance is very close to VGG-HNIP + CXM-Local, and moreover it further increases descriptor size andsearch time, compared to VGG-HNIP + CXM-Local.

    In particular, VGG16-HNIP + CXM-Global significantlyimproves VGG16-HNIP on Objects in terms of mAP andPrecision@R (+10%). This leads us to the conclusion thatdeep descriptors VGG16-HNIP and handcrafted descriptorsCXM-Global are complementary to each other. Third, weobserve VGG16-HNIP + CXM-Local significantly degradesthe performance of VGG16-HNIP on Landmarks and Scenes,e.g., there is a drop of ∼10% in mAP on Landmarks. Thisis due to the fact that matching pairs retrieved by HNIP (buthandcrafted features fail) cannot pass the GCC step, i.e., thenumber of inliers (patch-level matching pairs) is less sufficient.For instance, in Fig. 7, the landmark pair is determined asa match by VGG16-HNIP, but the subsequent GCC stepconsiders it as a non-match because there are only 2 matchedpatch pairs. More examples can be found in Fig. 6(a).

    G. Large Scale Video Retrieval

    Table XI studies video retrieval performance of CXM,VGG16-HNIP, their combinations VGG16-HNIP + CXM-Localand VGG16-HNIP + CXM-Global, on the full test dataset(All) without or with the large scale distractor video set. Bycombining reference videos with the large scale distractor set,the number of database keyframes increases from ∼105K to∼1.25M, making the search speed significantly slower. For ex-ample, HNIP is ∼5 times slower with the 512-D Euclideandistance computation. Further compressing HNIP into extremecompact code (e.g., 256 bits) for ultra-fast Hamming distancecomputation is highly desirable, while without incurring con-siderable performance loss. We will study it in our future work.Second, the performance ordering of approaches remains the

  • LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 1981

    same on large scale experiments, i.e., VGG16-HNIP + CXM-Global performs the best, followed by VGG16-HNIP, VGG16-HNIP + CXM-Local and CXM. Finally, when increasing thedatabase size by 10× larger, we observe the performance lossis relatively small, e.g., −1.5%, −3.3%, −0.3% and −2.3% inmAP for CXM, VGG16-HNIP, VGG16-HNIP + CXM-Localand VGG16-HNIP + CXM-Global, respectively.

    VII. CONCLUSION AND DISCUSSIONS

    In this work, we propose a compact and discriminative CNNdescriptor HNIP for video retrieval, matching and localization.Based on the invariance theory, HNIP is proven to be robustto multiple geometric transformations. More importantly, ourempirical studies show that the statistical moments in HNIPdramatically affect video matching performance, which leadsus to the design of hybrid pooling moments within HNIP. In ad-dition, we study the complementary nature of deep learned andhandcrafted descriptors, then propose a strategy to combine thetwo descriptors. Experimental results demonstrate that HNIPdescriptor significantly outperforms state-of-the-art deep andhandcrafted descriptors, with comparable or even smaller de-scriptor size. Furthermore, the combination of HNIP and hand-crafted descriptors offers the optimal performance.

    This work provides valuable insights for the ongoing CDVAstandardization efforts. During the recent 116th MPEG meetingin Oct. 2016, MPEG CDVA Ad-hoc group has adopted the pro-posed HNIP into core experiments [21] for investigating morepractical issues when dealing with deep learned descriptors inthe well-established CDVA evaluation framework. There areseveral directions for future work. First, an in-depth theoreticalanalysis on how pooling moments affect video matching perfor-mance is valuable to further reveal and clarify the mechanismof hybrid pooling, which may contribute to the invariance the-ory. Second, it is interesting to study how to further improve re-trieval performance by optimizing deep features like fine-tuningCNN tailored for the video retrieval task, instead of off-the-shelfCNNs used in this work. Third, to accelerate search speed, fur-ther compressing deep descriptors into extremely compact codes(e.g., dozens of bits) while still preserving retrieval accuracy isworth investigating. Last but not the least, as CNN incurs hugenumber of network model parameters (over 10 millions), how toeffectively and efficiently compress CNN model is a promisingdirection.

    REFERENCES

    [1] Compact Descriptors for Video Analysis: Objectives, Applications andUse Cases, ISO/IEC JTC1/SC29/WG11/N14507, 2014.

    [2] Compact Descriptors for Video Analysis: Requirements for Search Appli-cations, ISO/IEC JTC1/SC29/WG11/N15040, 2014.

    [3] B. Girod et al., “Mobile visual search,” IEEE Signal Process. Mag.,vol. 28, no. 4, pp. 61–76, Jul. 2011.

    [4] R. Ji et al., “Learning compact visual descriptor for low bit rate mobilelandmark search,” in Proc. Int. Joint Conf. Art. Int., Jul. 2011, vol. 22, no.3, pp. 2456–2463.

    [5] L.-Y. Duan et al., “Overview of the MPEG-CDVS standard,” IEEE Trans.Image Process., vol. 25, no. 1, pp. 179–194, Jan. 2016.

    [6] Test Model 14: Compact Descriptors for Visual Search, ISO/IECJTC1/SC29/WG11/W15372, 2011.

    [7] Call for Proposals for Compact Descriptors for Video Analysis (CDVA)-Search and Retrieval, ISO/IEC JTC1/SC29/WG11/N15339, 2015.

    [8] Cdva Experimentation Model (cxm) 0.2, ISO/IEC JTC1/SC29/WG11/W16274, 2015.

    [9] F. Perronnin, Y. Liu, J. Sánchez, and H. Poirier, “Large-scale image re-trieval with compressed fisher vectors,” in Proc. IEEE Conf. Comput. Vis.Pattern Recog., Jun. 2010, pp. 3384–3391.

    [10] H. Jégou, M. Douze, C. Schmid, and P. Pérez, “Aggregating local descrip-tors into a compact image representation,” in Proc. IEEE Conf. Comput.Vis. Pattern Recog., Jun. 2010, pp. 3304–3311.

    [11] A. Sharif Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnnfeatures off-the-shelf: An astounding baseline for recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recog. Workshops, Jun. 2014,pp. 512–519.

    [12] Y. Gong, L. Wang, R. Guo, and S. Lazebnik, “Multi-scale orderless poolingof deep convolutional activation features,” in Proc. Eur. Conf. Comput.Vis., 2014, pp. 392–407.

    [13] A. Babenko, A. Slesarev, A. Chigorin, and V. Lempitsky, “Neural codesfor image retrieval,” in Proc. Eur. Conf. Comput. Vis., 2014, pp. 584–599.

    [14] A. S. Razavian, J. Sullivan, A. Maki, and S. Carlsson, “A baseline forvisual instance retrieval with deep convolutional networks,” CoRR, 2014.[Online]. Available: http://arxiv.org/abs/1412.6574

    [15] H. Azizpour, A. Sharif Razavian, J. Sullivan, A. Maki, and S. Carlsson,“From generic to specific deep representations for visual recognition,”in Proc. IEEE Conf. Comput. Vis. Pattern Recog. Workshops, 2015, pp.36–45.

    [16] A. Babenko and V. Lempitsky, “Aggregating local deep features for imageretrieval,” in Proc. IEEE Int. Conf. Comput. Vis., Dec. 2015, pp. 1269–1277.

    [17] Y. Kalantidis, C. Mellina, and S. Osindero, “Cross-dimensional weightingfor aggregated deep convolutional features,” in Proc. Eur. Conf. Comput.Vis. Workshops, 2016, pp. 685–701.

    [18] G. Tolias, R. Sicre, and H. Jégou, “Particular object retrieval with integralmax-pooling of cnn activations,” in Int. Conf. Learn. Rep., 2016.

    [19] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classificationwith deep convolutional neural networks,” in Proc. Adv. Neural Inf. Pro-cess. Syst., 2012, pp. 1097–1105.

    [20] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” in Int. Conf. Learn. Rep., 2015.

    [21] Description of Core Experiments in CDVA, ISO/IEC JTC1/SC29/WG11/W16510, 2016.

    [22] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,”Int. J. Comput. Vis., vol. 60, no. 2, pp. 91–110, 2004.

    [23] H. Bay, T. Tuytelaars, and L. Van Gool, “Surf: Speeded up robust features,”in Proc. Eur. Conf. Comput. Vis., 2006, pp. 404–417.

    [24] J. Sivic and A. Zisserman, “Video Google: A text retrieval approachto object matching in videos,” in Proc. IEEE Int. Conf. Comput. Vis.,Oct. 2003, vol. 2, pp. 1470–1477.

    [25] D. Nistér and H. Stewénius, “Scalable recognition with a vocabulary tree,”in Proc. Comput. Vis. Pattern Recog., 2006, pp. 2161–2168.

    [26] H. Jégou and A. Zisserman, “Triangulation embedding and democraticaggregation for image search,” in Proc. IEEE Conf. Comput. Vis. PatternRecog., Jun. 2014, pp. 3310–3317.

    [27] S. S. Husain and M. Bober, “Improving large-scale image retrieval throughrobust aggregation of local descriptors,” IEEE Trans. Pattern Anal. Mach.Intell., to be published.

    [28] M. Raginsky and S. Lazebnik, “Locality-sensitive binary codes from shift-invariant kernels,” in Proc. Adv. Neural Inf. Process. Syst., 2009, pp. 1509–1517.

    [29] Y. Weiss, A. Torralba, and R. Fergus, “Spectral hashing,” in Proc. Adv.Neural Inf. Process. Syst., 2009, pp. 1753–1760.

    [30] V. Chandrasekhar et al.,“Transform coding of image feature descriptors,”in Proc. IS&T/SPIE Electron. Imag., 2009.

    [31] V. Chandrasekhar et al., “CHoG: Compressed histogram of gradients alow bit-rate feature descriptor,” in Proc. IEEE Conf. Comput. Vis. PatternRecog., Jun. 2009, pp. 2504–2511.

    [32] H. Jegou, M. Douze, and C. Schmid, “Product quantization for nearestneighbor search,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 1,pp. 117–128, Jan. 2011.

    [33] M. Calonder et al., “BRIEF: Computing a local binary descriptorvery fast,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 7,pp. 1281–1298, Jul. 2012.

    [34] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: an efficientalternative to SIFT or SURF,” in Proc. IEEE Int. Conf. Comput. Vis.,Nov. 2011, pp. 2564–2571.

  • 1982 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 19, NO. 9, SEPTEMBER 2017

    [35] S. Leutenegger, M. Chli, and R. Y. Siegwart, “BRISK: Binary robustinvariant scalable keypoints,” in Proc. Int. Conf. Comput. Vis., Nov. 2011,pp. 2548–2555.

    [36] S. Zhang, Q. Tian, Q. Huang, W. Gao, and Y. Rui, “USB: Ultrashortbinary descriptor for fast visual matching and retrieval,” IEEE Trans.Image Process., vol. 23, no. 8, pp. 3671–3683, Aug. 2014.

    [37] D. M. Chen et al., “Tree histogram coding for mobile image matching,”in Proc. Data Compression Conf., 2009, pp. 143–152.

    [38] B. Kulis and K. Grauman, “Kernelized locality-sensitive hashing for scal-able image search,” in Proc. IEEE 12th Int. Conf. Comput. Vis., Sep.-Oct.2009, pp. 2130–2137.

    [39] D. Chen et al., “Residual enhanced visual vector as a compact signaturefor mobile visual search,” Signal Process., vol. 93, no. 8, pp. 2316–2327,2013.

    [40] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Rich feature hierarchiesfor accurate object detection and semantic segmentation,” in Proc. IEEEConf. Comput. Vis. Pattern Recog., Jun. 2014, pp. 580–587.

    [41] R. Arandjelović, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD:CNN architecture for weakly supervised place recognition,” in Proc. Com-put. Vis. Pattern Recog., Jun. 2016, pp. 5297–5307.

    [42] F. Radenović, G. Tolias, and O. Chum, “CNN image retrieval learns fromBoW: Unsupervised fine-tuning with hard examples,” in Proc. Eur. Conf.Comput. Vis., 2016, pp. 3–20.

    [43] A. Gordo, J. Almazán, J. Revaud, and D. Larlus, “Deep image retrieval:Learning global representations for image search,” in Proc. Eur. Conf.Comput. Vis., 2016, pp. 241–257.

    [44] L. Baroffio, M. Cesana, A. Redondi, M. Tagliasacchi, and S. Tubaro,“Coding visual features extracted from video sequences,” IEEE Trans.Image Process., vol. 23, no. 5, pp. 2262–2276, May 2014.

    [45] A. Redondi, L. Baroffio, M. Cesana, and M. Tagliasacchi, “Compress-then-analyze vs. analyze-then-compress: Two paradigms for image anal-ysis in visual sensor networks,” in Proc. IEEE Int. Workshop MultimediaSignal Process., Sep.-Oct. 2013, pp. 278–282.

    [46] L. Baroffio, J. Ascenso, M. Cesana, A. Redondi, and M. Tagliasacchi,“Coding binary local features extracted from video sequences,” in Proc.IEEE Int. Conf. Image Process., Oct. 2014, pp. 2794–2798.

    [47] M. Makar, V. Chandrasekhar, S. Tsai, D. Chen, and B. Girod, “Interframecoding of feature descriptors for mobile augmented reality,” IEEE Trans.Image Process., vol. 23, no. 8, pp. 3352–3367, Aug. 2014.

    [48] J. Chao and E. G. Steinbach, “Keypoint encoding for improved feature ex-traction from compressed video at low bitrates,” IEEE Trans. Multimedia,vol. 18, no. 1, pp. 25–39, Jan. 2016.

    [49] L. Baroffio et al., “Coding local and global binary visual features extractedfrom video sequences,” IEEE Trans. Image Process., vol. 24, no. 11,pp. 3546–3560, Nov. 2015.

    [50] D. M. Chen, M. Makar, A. F. Araujo, and B. Girod, “Interframe codingof global image signatures for mobile augmented reality,” in Proc. DataCompression Conf., 2014, pp. 33–42.

    [51] D. M. Chen and B. Girod, “A hybrid mobile visual search system withcompact global signatures,” IEEE Trans. Multimedia, vol. 17, no. 7,pp. 1019–1030, Jul. 2015.

    [52] C.-Z. Zhu, H. Jégou, and S. Satoh, “Nii team: Query-adaptive asymmetri-cal dissimilarities for instance search,” in Proc. TRECVID 2013 Workshop,Gaithersburg, USA, 2013, pp. 1705–1712.

    [53] N. Ballas et al., “Irim at trecvid 2014: Semantic indexing and instancesearch,” in Proc. TRECVID 2014 Workshop, 2014.

    [54] A. Araujo, J. Chaves, R. Angst, and B. Girod, “Temporal aggregationfor large-scale query-by-image video retrieval,” in Proc. IEEE Int. Conf.Image Process., Sep. 2015, pp. 1519–1522.

    [55] M. Shi, T. Furon, and H. Jégou, “A group testing framework for simi-larity search in high-dimensional spaces,” in Proc. 22nd ACM Int. Conf.Multimedia. 2014, pp. 407–416.

    [56] J. Lin et al., “Rate-adaptive compact fisher codes for mobile vi-sual search,” IEEE Signal Process. Lett., vol. 21, no. 2, pp. 195–198,Feb. 2014.

    [57] Z. Xu, Y. Yang, and A. G. Hauptmann, “A discriminative CNN videorepresentation for event detection,” in Proc. IEEE Conf. Comput. Vis.Pattern Recog., Jun. 2015, pp. 1798–1807.

    [58] L.-Y. et al., “Compact Descriptors for Video Analysis: TheEmerging MPEG Standard,” CoRR, 2017. [Online]. Available:http://arxiv.org/abs/1704.08141

    [59] F. Anselmi and T. Poggio, “Representation learning in sensory cortex: Atheory,” in Proc. Center Brains, Minds Mach., 2014.

    [60] Q. Liao, J. Z. Leibo, and T. Poggio, “Learning invariant representations andapplications to face verification,” in Proc. Advances Neural Inf. Process.Syst., Lake Tahoe, NV, 2013, pp. 3057–3065.

    [61] C. Zhang, G. Evangelopoulos, S. Voinea, L. Rosasco, and T. Poggio, “Adeep representation for invariance and music classification,” in Proc. IEEEInt. Conf. Acoust., Speech, Signal Process., May 2014, pp. 6984–6988.

    [62] K. Lenc and A. Vedaldi, “Understanding image representations by mea-suring their equivariance and equivalence,” in Proc. IEEE Conf. Comput.Vis. Pattern Recog., Jun. 2015, pp. 991–995.

    [63] J. R. R. Uijlings, A. W. M. Smeulders, and R. J. H. Scha, “Real-timevisual concept classification.” IEEE Trans. Multimedia, vol. 12, no. 7,pp. 665–681, Nov. 2010.

    [64] M. Cimpoi, S. Maji, and A. Vedaldi, “Deep filter banks for texture recogni-tion and segmentation.” in Proc. IEEE Conf. Comput. Vis. Pattern Recog.,Jun. 2015, pp. 3828–3836.

    [65] H. Jégou, M. Douze, and C. Schmid, “Hamming embedding and weakgeometric consistency for large scale image search,” in Proc. Eur. Conf.Comput. Vis., 2008, pp. 304–317.

    [66] J. Philbin, O. Chum, M. Isard, J. Sivic, and A. Zisserman, “Object retrievalwith large vocabularies and fast spatial matching,” in Proc. IEEE Conf.Comput. Vis. Pattern Recog., Jun. 2007, pp. 1–8.

    Jie Lin received the B.S. and Ph.D. degrees from theSchool of Computer Science and Technology, Bei-jing Jiaotong University, Beijing, China, in 2006 and2014, respectively.

    He is currently a Research Scientist with the In-stitute of Infocomm Research, A*STAR, Singapore.He was previously a visiting student in the Rapid-Rich Object Search Laboratory, Nanyang Technolog-ical University, Singapore, and the Institute of DigitalMedia, Peking University, Beijing, China, from 2011to 2014. His research interests include deep learning,

    feature coding and large-scale image/video retrieval. His work on image featurecoding has been recognized as core contribution to the MPEG-7 Compact De-scriptors for Visual Search (CDVS) standard.

    Ling-Yu Duan (M’09) received the M.Sc. degree inautomation from the University of Science and Tech-nology of China, Hefei, China, in 1999, the M.Sc.degree in computer science from the National Uni-versity of Singapore (NUS), Singapore, in 2002, andthe Ph.D. degree in information technology fromThe University of Newcastle, Callaghan, Australia,in 2008.

    He is currently a Full Professor with the NationalEngineering Laboratory of Video Technology, Schoolof Electronics Engineering and Computer Science,

    Peking University (PKU), Beijing, China. He was the Associate Director of theRapid-Rich Object Search Laboratory, a joint lab between Nanyang Techno-logical University, Singapore, and PKU since 2012. Before he joined PKU, hewas a Research Scientist with the Institute for Infocomm Research, Singapore,from Mar. 2003 to Aug. 2008. He has authored or coauthored more than 130research papers in international journals and conferences. His research interestsinclude multimedia indexing, search, and retrieval, mobile visual search, visualfeature coding, and video analytics. Prior to 2010, his research was basicallyfocused on multimedia (semantic) content analysis, especially in the domainsof broadcast sports videos and TV commercial videos.

    Prof. Duan was the recipient of the EURASIP Journal on Image and VideoProcessing Best Paper Award in 2015, and the Ministry of Education TechnologyInvention Award (First Prize) in 2016. He was a co-editor of MPEG CompactDescriptor for Visual Search (CDVS) Standard (ISO/IEC 15938-13), and is aCo-Chair of MPEG Compact Descriptor for Video Analytics (CDVA). His re-cent major achievements focus on the topic of compact representation of visualfeatures and high-performance image search. He made significant contributionin the completed MPEG-CDVS standard. The suite of CDVS technologies havebeen successfully deployed, which impact visual search products/services ofleading Internet companies such as Tencent (WeChat) and Baidu (Image SearchEngine).

  • LIN et al.: HNIP: COMPACT DEEP INVARIANT REPRESENTATIONS FOR VIDEO MATCHING, LOCALIZATION, AND RETRIEVAL 1983

    Shiqi Wang received the B.S. degree in computer sci-ence from the Harbin Institute of Technology, Harbin,China, in 2008, and the Ph.D. degree in computerapplication technology from the Peking University,Beijing, China, in 2014.

    From March 2014 to March 2016, he was a Post-doc Fellow with the Department of Electrical andComputer Engineering, University of Waterloo, Wa-terloo, ON, Canada. From April 2016 to Aprill 2017,he was with the Rapid-Rich Object Search Labora-tory, Nanyang Technological University, Singapore,

    as a Research Fellow. He is currently an Assistant Professor in the Depart-ment of Computer Science, City University of Hong Kong, Hong Kong. Hehas proposed more than 30 technical proposals to ISO/MPEG, ITU-T, and AVSstandards. His research interests include image/video compression, analysis andquality assessment.

    Yan Bai received the B.S. degree in software en-gineering from Dalian University of Technology,Liaoning, China, in 2015, and is currently workingtoward the M.S. at the School of Electrical Engi-neering and Computer Science, Peking University,Beijing, China.

    Her research interests include large-scale videoretrieval and fine-grained visual recognition.

    Yihang Lou received the B.S. degree in softwareengineering from Dalian University of Technology,Liaoning, China, in 2015, and is currently workingtoward the M.S. degree at the School of ElectricalEngineering and Computer Science, Peking Univer-sity, Beijing, China.

    His current interests include large-scale video re-trieval and object detection.

    Vijay Chandrasekhar received the B.S and M.S. de-grees from Carnegie Mellon University, Pittsburgh,PA, USA, in 2002 and 2005, respectively, and thePh.D. degree in electrical engineering from StanfordUniversity, Stanford, CA, USA, in 2013.

    He has authored or coauthored more than 80 pa-pers/MPEG contributions in a wide range of top-tier journals/conferences such as the InternationalJournal of Computer Vision, ICCV, CVPR, the IEEESignal Processing Magazine, ACM Multimedia, theIEEE TRANSACTIONS ON IMAGE PROCESSING, De-

    signs, Codes and Cryptography, the International Society of Music InformationRetrieval, and MPEG-CDVS, and has filed 7 U.S. patents (one granted, six pend-ing). His research interests include mobile audio and visual search, large-scaleimage and video retrieval, machine learning, and data compression. His Ph.D.work on feature compression led to the MPEG-CDVS (Compact Descriptorsfor Visual Search) standard, which he actively contributed from 2010 to 2013.

    Dr. Chandrasekhar was the recipient of the A*STAR National Science Schol-arship (NSS) in 2002.

    Tiejun Huang received the B.S. and M.S. degreesin computer science from the Wuhan University ofTechnology, Wuhan, China, in 1992 and 1995, re-spectively, and the Ph.D. degree in pattern recogni-tion and intelligent system from the Huazhong (Cen-tral China) University of Science and Technology,Wuhan, China, in 1998.

    He is a Professor and the Chair of the Departmentof Computer Science, School of Electronic Engineer-ing and Computer Science, Peking University, Bei-jing, China. His research areas include video coding,

    image understanding, and neuromorphic computing.Prof. Huang is a Member of the Board of the Chinese Institute of Electronics

    and the Advisory Board of IEEE Computing Now. He was the recipient of theNational Science Fund for Distinguished Young Scholars of China in 2014, andwas awarded the Distinguished Professor of the Chang Jiang Scholars Programby the Ministry of Education in 2015.

    Alex Kot (S’85–M’89–SM’98–F’06) has been withNanyang Technological University, Singapore, since1991. He headed the Division of Information En-gineering, School of Electrical and Electronic En-gineering, for eight years and was an AssociateChair/Research and the Vice Dean Research with theSchool of Electrical and Electronic Engineering. Heis currently a Professor with the College of Engi-neering and the Director of the Rapid-Rich ObjectSearch Laboratory. He has authored or coauthoredextensively in the areas of signal processing for com-

    munication, biometrics, data-hiding, image forensics, and information security.Prof. Kot is a Member of the IEEE Fellow Evaluation Committee and a

    Fellow of Academy of Engineering, Singapore. He was the recipient of the BestTeacher of the Year Award and is a coauthor for several Best Paper Awards,including ICPR, IEEE WIFS, ICEC, and IWDW. He was on the IEEE SP So-ciety in various capacities, such as the General Co-Chair for the 2004 IEEEInternational Conference on Image Processing and a Chair of the worldwideSPS Chapter Chairs, and the Distinguished Lecturer Program. He is the VicePresident for the IEEE Signal Processing Society. He was a Guest Editor forthe Special Issues for the IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMSFOR VIDEO TECHNOLOGY and the EURASIP Journal on Advances in SignalProcessing. He is also an Editor of the EURASIP Journal on Advances in SignalProcessing. He is an IEEE SPS Distinguished Lecturer. He was an AssociateEditor of the IEEE TRANSACTIONS ON IMAGE PROCESSING, the IEEE TRANS-ACTIONS ON SIGNAL PROCESSING, the IEEE TRANSACTIONS ON MULTIMEDIA,the IEEE SIGNAL PROCESSING LETTERS, the IEEE SIGNAL PROCESSING MAGA-ZINE, the IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, the IEEETRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, and theIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I, and the IEEE TRANSAC-TIONS ON CIRCUITS AND SYSTEMS II.

    Wen Gao (S’87–M’88–SM’05–F’09) received thePh.D. degree in electronics engineering from the Uni-versity of Tokyo, Tokyo, Japan, in 1991.

    He was a Professor of Computer Science with theHarbin Institute of Technology, Harbin, China, from1991 to 1995, and a Professor with the Institute ofComputing Technology, Chinese Academy of Sci-ences, Beijing, China. He is currently a Professor ofcomputer science with the Institute of Digital Media,School of Electronic Engineering and Computer Sci-ence, Peking University, Beijing, China.

    Prof. Gao has served on the editorial boards for several journals, such as theIEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY,the IEEE TRANSACTIONS ON MULTIMEDIA, the IEEE TRANSACTIONS ON AU-TONOMOUS MENTAL DEVELOPMENT, the EURASIP Journal of Image Commu-nications, and the Journal of Visual Communication and Image Representation.He has chaired a number of prestigious international conferences on multimediaand video signal processing, such as IEEE ICME and ACM Multimedia, andalso served on the advisory and technical committees of numerous professionalorganizations.

    /ColorImageDict > /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 150 /GrayImageMinRes