Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video...

17
Fine granularity Semantic Video Annotation: an approach based on Automatic Shot level Concept Detection and Object Recognition Vanessa El-Khoury, Martin Jergler, Getnet Abebe, David Coquil, Harald Kosch University of Passau Chair of Distributed Information Systems 94032 Passau, Germany [email protected]

Transcript of Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video...

Page 1: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

Fine granularity Semantic Video Annotation: an approach based onAutomatic Shot level Concept Detection and Object Recognition

Vanessa El-Khoury, Martin Jergler, Getnet Abebe, David Coquil, Harald KoschUniversity of Passau

Chair of Distributed Information Systems94032 Passau, Germany

[email protected]

Page 2: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

Fine granularity Semantic Video Annotation: an approach based onAutomatic Shot level Concept Detection and Object Recognition

Vanessa El-Khoury, Martin Jergler, Getnet Abebe, David Coquil, Harald KoschUniversity of Passau

Chair of Distributed Information Systems94032 Passau, Germany

[email protected]

ABSTRACTA fine-grained video content indexing, retrieval, and adaptation requires accurate metadata

describing its structure and semantics to the lowest granularity, i.e., the object level. We ad-dress these requirements by proposing Semantic Video Content Annotation Tool (SVCAT) forstructural and high-level semantic annotation. SVCAT is a semi-automatic MPEG-7 standardcompliant annotation tool, which produces metadata according to a new object-based videocontent model. Videos are temporally segmented into shots and shots level concepts are de-tected automatically using ImageNet as a background knowledge. These concepts are used asa guide to easily locate and select objects of interest which can be tracked automatically. Theintegration of shot based concept detection with object localization and tracking drasticallyalleviates the task of an annotator. As such, SVCAT enables to easily generate selective andfine-grained metadata which are vital for user centric object level semantic video operationssuch as product placement or obscene material removal. Experimental results show that SV-CAT is able to provide accurate object level video metadata.

Keywords: Structural and Semantic Annotation, Video Content Analysis, Object ContourTracking, MPEG-7, Shot Annotation, Concept Detection, Keyframe classification

INTRODUCTION

In recent years, the rapid evolution of technology has con-tributed to the distribution of huge amounts of video dataover the web. A critical prerequisite to effectively index,query, retrieve, adapt and consume such masses of infor-mation is the availability of video content annotation toolsproviding semantic and structural metadata at different levelsof granularity. In several scenarios, applications can benefitfrom accurate and rich metadata, in particular from metadatarelated to objects and their spatial properties. Typical exam-ples include the following:• Object-based video mining (e.g., video indexing or

video summarization), which is arguably more efficientwhen relying on the semantics of the objects present in thevideo (Weber, Lefevre, & Gancarski, 2010).• Spatial semantic adaptation of video, such that a high

priority is attached to regions of interest to maximize thequality of the adapted content while targeting Universal Mul-timedia Access (Bruyne et al., 2011).• Automatic fine-grained personalization of video

content- for instance using an adaptation decision takingengine that employs a utility-based approach. In such anapproach, the utility of adaptation options is evaluated byconsidering several quality parameters derived from the

metadata describing the video content (El-Khoury, Coquil,Bennani, & Brunie, 2012).

To realize fine-grained video annotation, the video mustfirst be segmented. Afterwards, a thorough video analysis isrequired to accurately identify and describe objects, events,and temporal and spatial properties of the objects. The ex-tracted information should then be represented in an inter-operable format that enables its exploitation by a large num-ber of applications. To this end, the definition of an expres-sive video content model is required. Finally, to store all theabove information, an annotation language capable of cov-ering the video content model must be used. Many videoannotation tools have been proposed in the literature (Da-siopoulou, Giannakidou, Litos, Malasioti, & Kompatsiaris,2011). With respect to the production of fine-grained struc-tural and semantic metadata, these tools have a number oflimitations, which are mainly related to (i) their annotationmodel and metadata format, (ii) the accuracy of object-levelannotation and (iii) their degree of automation.

Annotation model and metadata format: Many tools arebased on video models that lack expressiveness along spatialand semantic dimensions. Indeed, they do not integrate theobject layer in their model of the video structure. They justrely on the metadata extracted from the object to describe

Page 3: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

2 VANESSA EL-KHOURY, MARTIN JERGLER, GETNET ABEBE, DAVID COQUIL, HARALD KOSCH

shots and scenes, thus bridging the gap only for the tempo-ral structure. Moreover, they store the generated metadatain custom formats rather than standardized description doc-uments. This reduces their adoption as exploiting the pro-duced data in applications requires learning these specificformats and developing ad-hoc parsers. In contrast, standard-ized formats such as MPEG-7 increase interoperability sincethey provide a well-defined, documented structure.

Accuracy of object-level annotation: the support for videoannotation at the object level by the existing tools is ratherpoor. The few that support this at all often lack precision inthe specification of the spatial properties of the object: itsselection relies on imprecise drawing tools such as boundingboxes and polygons. Even for the tools that offer this func-tionality with acceptable accuracy, it remains impractical forrelatively large videos due to its limited degree of automa-tion.

Degree of automation: semantic and structural annota-tion of videos includes several processes that are very timeconsuming if they have to be executed manually. This is inparticular the case for identifying semantic temporal struc-tures like scenes and localizing the salient objects in eachframe. Furthermore, as enormously challenging as it is tolocate segments and objects of interest in a video by hand,existing systems do not provide generic and sufficient sup-port for it. Although several tools enable automatic shot de-tection as well as localization and annotation of the object inone frame, none of them supports the propagation of objectdescriptions to the next frames without human intervention.For instance, the propagation is done by dragging the objectdescriptions or by copying them with one mouse click fromframe to frame. We argue that automation can significantlyincrease the performance of the video analysis and is indis-pensable to process large videos efficiently.

To overcome these limitations, we propose SVCAT forstructural and high-level semantic annotation. More, wepresent a new video data model, which captures the low-levelfeature, high-level concept and structural information gener-ated by SVCAT. SVCAT is a highly automated (i.e. semi-automatic), standard compliant (i.e. MPEG-7) and very ac-curate (i.e. object level granularity at pixel precision) an-notation tool that generates a fine granularity semantic de-scription in two phases. The first phase deals with identifica-tion of semantic entities in the temporal dimension based onconcept detection in shots, which is achieved by classifyingkeyframes of the shot into ImageNet large scale hierarchicalimage categories. This produces a very high level descrip-tion of the shots. Using this description as a guide, userscan easily navigate to and select objects of interest which thesystem then tracks automatically, using a contour trackingalgorithm, to produce the finer descriptions. This two levelarrangement makes SVCAT easily customizable to varyingapplication scenarios and reduce the effort required by anno-

tators to locate objects of interest.Besides structural annotation (scenes, shots, frames, ob-

jects), SVCAT provides a mechanism to attach semantic de-scriptions to the segments. Indeed, descriptive keywords de-rived from MPEG-7 classification schemes (CS) are attachedto them. These CS define controlled vocabularies, which arenecessary to specify semantics distinctly and render the gen-erated descriptions interoperable. Fundamentally, SVCAT isan attempt to interconnect researches in concept detectionand object tracking to facilitate automated video annotation.

The remainder of this paper is structured as follows. In thenext section, we analyze the requirements of a video anno-tation tool. Then, we presents our object-based video modeland describe in detail the architecture and functionalities ofSVCAT. Following that, evaluation results regarding the ac-curacy and performance of SVCAT are discussed. Finally,we overview existing video annotation tools and position SV-CAT among them before giving conclusions and future per-spectives.

VIDEO CONTENT ANNOTATION REQUIREMENTS

In this section, we examine the main requirements foran efficient video content annotation tool that enables theproduction of structural and semantic metadata at differentlevels of granularity. These requirements are considered inlight of the criteria discussed in the previous section: in-teroperability, high degree of automation, and accuracy. Toincrease the applicability of the tool, we also stress that itshould be bound neither to a specific type of videos (domain-independence) nor restricted to a specific type of objects. Inline with how SVCAT operates, we categorize these require-ments into three as automatic shot annotation, object anno-tation, and video model/metadata requirements and describethem in more detail in the following.

Automatic shot annotation requirements

Automatic shot annotation consists of temporal video seg-mentation followed by concept detection. This segmentationconsists of decomposing the video content into shots anddetecting keyframes within the shots. Shots are defined assequences of images taken without interruption by a singlecamera. The problem of automatic shot boundary detectionhas attracted much attention, enabling state-of-the-art shotsegmentation techniques to reach satisfying levels of perfor-mance, as demonstrated by the TREC Video Retrieval Eval-uation track (TRECVID) (Smeaton, Over, & Doherty, 2010).Because shots possess temporal redundancies, they are cus-tomarily abstracted with keyframes. Keyframes are typicalframes(static images) that contain salient objects and eventsof the shot and contain little redundancy or overlapped con-tent. Given a shot P having n frames, i.e., S = ( f1, f2, . . . , fn),let d j | j ∈ 1, 2, ..., n be the difference of feature values be-tween consecutive frames. The set of keyframes then is a

Page 4: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

OBJECT LEVEL VIDEO ANNOTATION 3

collection of frames K = fk1, fk2, . . . , fkm ⊂ S where m < nand d j is a local maxima with respect to its preceding andfollowing the frames.

Once the temporal video units such as shots are identi-fied, the next step in video annotation is to detect semanticentities, i.e., people, objects, settings and events appearingin these units and label them accordingly. With the size ofvideos, doing this manually is difficult and can be sensiblyused to produce only short video level descriptions. For acomplex content like video, this means insufficient contentdescription. Social tagging methods, enabled by the prolifer-ation of Web 2.0 applications like YouTube and Vimeo, arealso known to produce ambiguous, too personalized and lim-ited annotations (Ulges, Schulze, Koch, & Breuel, 2010). Afeasible alternative is an automatic annotation that derivessemantic labels for videos based on automatic concept detec-tion - a process of inferring the presence of semantic conceptsin the video stream (Snoek & Worring, 2009).

Concept detection is considered as a classification prob-lem where finite set of concept detectors are trained overlow level features of the video. It consists of training andtesting phases. During the training phase, annotated visualconcept lexicons are chosen and discriminative methods likesupport vector machines (SVMs), nearest neighbor (KNN)classifiers, or decision trees are used to train positive andnegative examples of each concept. In the testing phase, aprobability value indicating the existence of concepts is as-signed to an input video (Snoek & Worring, 2009). Appliedon shots, concept detection deals with the extraction of ap-propriate features from the shot to estimate concept scoresindicating the probability of certain concept’s presence in theshot. This involves several tasks as depicted in Figure 1. Thefigure gives a scheme of automatic shot annotation based onbag of visual features image representation(BoVF)model andwhere the shots are abstracted with keyframes. Visual classi-fication using BoVF has four basic steps- feature extraction,codebook generation, image encoding and image classifica-tion (Csurka, Dance, Fan, Willamowski, & Bray, 2004). Inthe following sections, we discuss each of these tasks con-cisely and explain important parameters along the way.

Feature extraction: identifies salient regions in the imageand describes their characteristics. A good quality annotationnecessitates these features to be highly discriminative. In thelast decade, several local descriptors such as Histograms oforiented gradients (HOG), Gradient location and orientationhistogram (GLOH), Speeded up robust features (SURF), andScale Invariant Feature Transform (SIFT) have been used forvarious purposes (Linderberg, 2013). However, SIFT haveproven to be successful in image matching, object recogni-tion and image classification researches. SIFT descriptorsare based on grayscale gradient orientations of keypoints ob-tained from scale-space extrema of differences-of-Gaussians(DoG). They are invariant to translation, rotation and scaling

Figure 1. BoVF based shot annotation pipeline

transformations.Codebook generation: With BoVF representation, images

are described as a normalized histogram of visual features ina visual vocabulary (codebook). A codebook is generated byapplying clustering algorithms such as K-means and Gaus-sian mixture model(GMM) on the features extracted from atraining set. The cluster centers form the codewords whichin essence are the most representative patterns of the im-ages set. New extracted feature from images are labeledwith these codewords. The number of codewords(codebooklength) plays an important role in the accuracy of imagesrepresentation; it has been shown that larger codebooks givebetter results in image classification (Linderberg, 2013).

Image encoding: assigns detected features to the code-words using hard or soft techniques (Chatfield, Lempitsky,Vedaldi, & Zisserman, 2011). Hard-assignment encoding(also called vector encoding) assigns an extracted feature toonly one best matching codeword. Let the codewords inthe codebook be cw1, . . . , cwK , the set of descriptors sam-pled from an image be d1, . . . , dN , and ψmi an assignment ofdescriptor di to a codeword m. Hard-assignment encodingassigns an extracted feature to its most matching codewordoptimizing

argminm

‖ di − ψm ‖2

when, for instance, an Euclidean distance is used (‖ · ‖represents l2 distance). The image representation then be-comes the non negative vector fhist ∈ RK such that [ fhist]m =

|i : ψmi = m|. On the other hand, soft-assignment methods(also known as kernel codebook encoding techniques) repre-sent images by estimating posteriori probability of features toeach codeword. Besides, sparse and locality coding schemesthat optimize a linear combination of few visual words to ap-proximate a local feature and code it with the optimized co-efficients have been recently introduced. Typical example isthe locality constrained linear encoding (LLC) that projectseach image descriptors di to a local linear subspace spanned

Page 5: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

4 VANESSA EL-KHOURY, MARTIN JERGLER, GETNET ABEBE, DAVID COQUIL, HARALD KOSCH

by small number of visual words close to di. This encodingrequires to first set the number of nearest visual words to beused. Assuming that a set of N codewords are used to encodedi, its descriptor will be a K-dimensional vector containingweighted linear approximation of di over these N words andzeros for the other words. Image level description is thenobtained by sum or max pooling (Chatfield et al., 2011)

Image classification: this phase analyzes image descrip-tors to organize images into distinct and exclusive classes.In the past, SVM and Random Forest based classificationschemes are shown to perform well over large dimensionaldata like image data (Caruana, Karampatziakis, & Yesse-nalina, 2008). SVM classifiers map training data to a higherdimensional space using kernel functions such as linear,polynomial, radial basis function, or sigmoid and find amaximal margin hyperplane separating classes of the data.New data are then classified according to the side of thehyperplane they belong to. Random Forest(RF) classifiersuse a group of unpruned decision trees whose leaf nodesare labeled by estimates of the posterior distribution overthe image classes. These trees are built on randomly se-lected subspaces of the training data. Different works havedemonstrated that RF classifiers can achieve a good accuracyin classifying multi-class high dimensional data with lessercomputational requirement than their SVM counterparts.

Annotation refinement: accuracy of automatic concept de-tection suffers due to the high variability in the visual char-acteristics and, hence, it is a common practice to minimizethe effect of the error by refining obtained annotations. Sev-eral works use a popular strategy called Content-Based Con-cept Fusion(CBCF) (Zhong & Miao, 2012) to achieve this.In its simplest form, CBCF uses some common conceptcoocurrence reference to evaluate the quality of candidateannotations within shots. However, this is not sufficient asusing the same concept coocurrence set for every shot orvideo does not work well due to differences in video con-texts. In addition, relationship between consecutive shots cangive important clue about an annotation and is worth con-sidering. Therefore, approaches that use within-shot corre-lation as well as temporal correlation are adopted. For in-stance, authors of (Zhong & Miao, 2012) propose the fol-lowing method. Given a shot xt | t = 1, . . . ,T classified byconcept detector Ci| i = 1, . . . , J with an initial detectionscore of p (Ct | xt), the refined score p (Ct | xt) based on tem-poral refinement term pt (Ct | xt) and spatial refinement termps (Ct | xt) is given as :

p (Ct |xt) = λp (Ct |xt)+ (1 − λ) (wpt(Ct |xt) + (1 − w) ps (Ct |xt)(1)

where λ (0 ≤ λ ≤ 1) is a factor to tune the influences frominitial result and refinement terms, and ω (0 ≤ ω ≤ 1) to bal-ance the contributions made by these two refinement terms.

We summarize this section by restating that automatic

shot annotation requires temporal video decomposition fol-lowed by effective concept detection in the segments ob-tained. This necessitates efficient shot boundary detectionand keyframe selection techniques, good quality visual con-cept lexicon for training, and a cleverly designed concept de-tection approach. Furthermore, annotations obtained must berefined to enhance their quality.

Object annotation requirements

The preceding section describes the requisites to obtainshot level annotations, which are the high level descriptionsof the contents in the shots. In this section, we explain the es-sentials of localization and annotation of objects to get finerdescriptions.

Object Representation. In order to annotate the spatialproperties of video objects, an object representation modelmust first be selected. Several models have been proposed inthe literature: the objects can be represented as sets of points,simple geometric shapes (e.g., rectangle, ellipse), or by us-ing articulated shape models, skeletal models and contour orsilhouette representations. The contour corresponds to theset of pixels forming the boundary of an object, whereas thesilhouette is the region inside the contour. For the purposeof SVCAT, we choose a combination of both contour andsilhouette representation model. Indeed, the former is themost appropriate one to fulfil our goal of having an accuraterepresentation, while the latter facilitates the computation oflow-level features over the whole object. This representationalso has the advantage of being able to support a huge set ofobject deformations, which facilitates the representation ofcomplex, non-rigid objects (e.g., pedestrians) at pixel accu-racy.

Such a combined representation can be implemented us-ing level sets (Osher & Sethian, 1988). The level set methoduses a closed curve Γ in the two dimensional space to rep-resent the contour. Γ is implicitly represented using an aux-iliary function ϕ : R2 × R → R1 on a fixed Cartesian grid.This function is called the level set function. The values ofϕ are the euclidean distances from the contour Γ, which isrepresented by the zero level set of ϕ.

Γ = (x, y) |ϕ (x, y) = 0 (2)

The inside of the region delimited by Γ (i.e. silhouette) isgiven negative values ϕ (x) < 0 and the outside of the re-gion positive values ϕ (x) > 0. The level set methodologyprovides some nice features. Unlike other representations(e.g., splines), it can handle topological changes in the objectappearance like, for instance, the splitting and merging ofregions. Additionally, intrinsic geometrical properties can bederived directly from the level set.

Object Selection. Object selection approaches for videoannotation tools may be categorized into fully automatic,semi-automatic and manual methods. Automatic selection

Page 6: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

OBJECT LEVEL VIDEO ANNOTATION 5

approaches are based on machine learning (e.g., supervisedlearning as in Adaptive Boosting (Levin, Viola, & Freund,2003) and require the use of training data. This has the disad-vantage of restricting the application to a specific domain forwhich prior data is available, and means that only the typesof objects that appear in the provided data may be discov-ered. We consider these limitations excessive for our case.On the other hand, manual object selection is a very tedioustask, which is also prone to errors. Thus, a tool that restrictsitself to these methods cannot be realistically used to anno-tate large video collections. Therefore, we argue that a semi-automatic approach is best suited. The idea is that the anno-tator provides an initial region selection, and that an imagesegmentation uses it as an input to automatically compute anexact contour.

Image segmentation is a problem that has been extensivelyresearched in the image processing community. Three ma-jor techniques have emerged in recent years: Mean-Shift-Clustering (Comaniciu, Meer, & Member, 2002), segmenta-tion based on graph cuts, like the GrowCut algorithm (Vezh-nevets, 2004) and active contour techniques, also known assnakes (Kass, Witkin, & Terzopoulos, 1988). Mean-Shiftclustering is not suited to our purposes. Indeed, it is highlydependent on the number of regions for segmentation, re-sulting frequently in over- or under- segmentation comparedto human perception of objects. In a graph cut approach,the user labels a number of pixels either as belonging tothe object or the background. Based on this input, the al-gorithm iteratively assigns labels to all other pixels of theimage. In order to decide whether a pixel belongs to objector background, the method examines its similarities to neigh-bouring pixels that have already been labelled. The processis repeated until all pixels have been processed. An activecontour approach starts from a first selection of the contourof the object provided by the user. The algorithm expandsthis contour line until it tightly encloses the intended con-tour. Contour evolution is governed by minimizing an energyfunctional.

Both graph cuts and active contour approaches are appro-priate for our requirements. Thus, we have conducted exper-iments in order to evaluate their performance in the contextof our tool (see section ), from which we concluded that anactive contour approach is the best choice.

Object Tracking. Though an object selection function-ality as described above facilitates the exact definition of aregion of interest corresponding to an object in a frame, do-ing so in each frame in which an object appears is a verycumbersome task. In order to automate this process, a track-ing approach can be used to re-detect the object in all subse-quent frames based on an initial selection provided by the an-notator. Many object tracking methods have been proposedin the literature (Yilmaz, Javed, & Shah, 2006). To selectan appropriate approach, the following requirements must be

considered:1. Object representation: From the requirements re-

garding object representation detailed above, we canconclude that the object tracker should make use of asilhouette or contour for object representation.

2. User input:The initial object selection provided by theannotator can be considered as reliable. The trackershould be able to make use of this information as muchas possible and require no further user input.

3. Generic algorithm: Salient objects may have variouscharacteristics (different motion, non-rigid, complexshapes, etc.). The tracker should be generic enoughto cope with all these types of objects. Moreover, thetracker should not introduce further constraints on theproperties of the object (features, motion speed and de-gree of similarity of objects between frames).

4. No assumption of previous data: To keep the toolgeneric, the selected tracker cannot rely on any otherprevious data than the initial selection of the object re-gion, such as training data in which objects have beenidentified.

Due to the first requirement, we restrict ourselves to sil-houette trackers, excluding trackers that use other object rep-resentations. This category comprises shape-matching andcontour evolution approaches. The former approaches try toiteratively match a representation of the object in each con-secutive frame. They are not appropriate in our case becausethey cannot deal with non-rigid objects. The latter categoryof approaches comprises two sub-categories, based either onstate space models or on direct minimization of an energyfunctional. State Space models define a model of the ob-ject’s state, containing shape and motion parameters of thecontour. Tracking is achieved by updating this model so thatthe posterior probability of the contour is maximized. Thisprobability depends on the model state in the current frameand on a likelihood describing the distance of the contourfrom observed edges. Direct minimization techniques imple-ment tracking by trying to evolve an initial contour in eachframe until a contour energy function is minimized. Methodsin this category differ in their minimization method (greedymethod or gradient descent) and their contour energy func-tion, which is defined with respect to temporal informationeither by means of a temporal gradient (optical flow) or ofappearance statistics computed from the object and the back-ground.

Approaches based on state space models require trainingdata, and thus are not appropriate. Many direct minimizationapproaches of the literature have to be excluded as well be-cause they are not generic enough and require training dataor additional user input. This led us to narrow our study tothe tracking methods proposed by Yilmaz et al. (Yilmaz, Li,& Shah, 2004) and by Shi and Karl (Shi & Karl, 2005).

Yilmaz et al. evolve the contour using color and texturefeatures within a band around the object’s contour. Through

Page 7: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

6 VANESSA EL-KHOURY, MARTIN JERGLER, GETNET ABEBE, DAVID COQUIL, HARALD KOSCH

this band, they aim to combine region-based and boundary-based contour tracking methods in a single approach. Ob-jects are represented by level sets. Tracking of multiple ob-jects is possible as well. A disadvantage of the algorithmis its explicit handling of occlusion. Even if an object isoccluded by another object, its position is estimated by thetracker. This means that the tracker would calculate a con-tour for an object even if it is not visible and hence this regionwould be annotated as an object region erroneously.

Similar to this approach, Shi and Karl propose a trackingmethod based on a novel implementation of the level set rep-resentation and the idea of region competition (Zhu & Yuille,1996). They use color and texture information to model ob-ject and background region. Contour evolution is achievedby applying simple operations on the level set, like switchingelements between lists. Switching decisions are obtained byestimating the likelihood of pixels around the zero level setto belong to a particular region (region competition). The ap-proach requires no training and uses a simple tracking model,which computes the contour of the object in the current framebased on the information from the last frame. It can be ex-tended to track multiple objects.

We conclude this analysis by opting for Shi and Karl’smethod, which combines satisfactory tracking accuracy withsound performance and does not have problems with occlu-sion. The implementation of this approach in our tool is de-scribed in section .

Video model and metadata format requirements

In order to achieve interoperable and machine understand-able annotations, there is a need to formalize and well definethe semantics of the annotation vocabulary. To organize thisinformation, the tool must base on a video annotation model.To enable fine-grained video annotation, this model must beexpressive, especially in order to properly link the semanticdescriptions and the structural elements of the video. More-over, the model must be implemented in a metadata file for-mat. In this regard, to make the tool interoperable, it is ap-propriate to use the multimedia content description standardMPEG-7 and a predefined controlled vocabulary, using forinstance MPEG classification schemes.

SVCAT

In this section, we describe the characteristics and the im-plementation of SVCAT(http://www.dimis.fim.uni-passau.de/MDPS/index.php/en/research/projects/SVCAT.html),which aims to fulfill the requirements described in theprevious section. SVCAT extends another tool developed byour research group, VAnalyzer (Stegmaier, Doeller, Coquil,El-Khoury, & Kosch, 2010). The implementation uses theJava Media Framework (JMF) to access the video contentand MPEG-7 description schemes to annotate the structureand semantics of videos.

4 Semantic video annotation

In this Section, we formally define the semantic video model according to which the annotation information is

generated, in order to perform object-based adaptation reasoning. This model extends the relatively classical

video structure hierarchy of scenes, shots, and frames, with a region-level based. It enables the representation of

low- and high-level annotation information regarding objects used to drive the adaptation. Then, we present the

Semantic Video Content Annotation Tool (SVCAT) that we have developed for the sake of producing annotation

information according to our model. SVCAT provides semi-automatic semantic annotation functionality at the

object level based on predefined classification schemes. It exports the produced annotation as an MPEG-7

description.

4.1 Video content model

Is-a Is-a-sequence Is-a-set temporal spatial

Fig. 1: This figure illustrates in (a) the five-level hierarchical structure of the classical video model, and shows in (b) the

proposed object-based structural model.

4.1.1 Structural representation of video data

[Def. 1 Region]: The region in the frame is considered as a group of connected pixels satisfying homogeneity

condition. It is represented as a 2-tuples r = <boundary, area>, such that boundary is the contour of the region

and area is the surface of the region in number of pixels.

[Corollary 1]: For SVCAT, we model the region as a group of connected pixels sharing similar texture (i.e.) and

intensity color (i.e.).

[Def. 2 Frame]: The frame is considered as the smallest temporal unit in the video structure. It is represented as

a 2-tuples f = <frame-nb, ℛf>, where frame-nb corresponds to the position of the frame in the sequence of the

video and ℛf is the set of regions constituting the whole frame. We denote by ℱ the set of all the frames for a

given video.

Set of OFS in all the scenes

Set of OFS in all the shots of the scene

Set of OFS in the shot

Sequence of all the scenes

Sequence of shots with similar semantics

Sequence of frames with similar set of features

Region Level

Set of objects

Video Object Frame Sequence

Scene Object Frame Sequence

Shot Object Frame Sequence

Object Frame

Object

FS

OFS

Video Level

Scene Level

Shot Level

Frame Level

Set of regions

Figure 2. The five-level hierarchical structure of the classi-cal video model (left) and the object-based structural model(right)

We start the description of SVCAT with a formalizationof the underlying video model and provide a definition of theassociated semantics. Next, we present the tool’s architec-ture and outline its functionalities and workflow. We thendescribe the video segmentation, shot annotation, , object se-lection and object tracking processes of the tool. Finally, wepresent how structural and semantic object annotation infor-mation is represented using MPEG-7 descriptors.

Video Model

SVCAT is based on a video model that extends the typ-ical video structure hierarchy comprising scenes, shots, andframes, with a region-based level. In particular, this enablesthe representation of low- and high-level annotation informa-tion regarding objects. The model is represented in Figure 2.The components of the model are defined in the next subsec-tions.

Structural representation of video data. Region: Aregion inside a frame is a group of connected pixels satisfy-ing a homogeneity condition. It is represented as a 2-tupler = (boundary, area), such that boundary is the contour ofthe region and area is its surface in pixels.

Frame: A frame is the smallest temporal unit in the videostructure. It is represented as a 2-tuple f = ( f ramenb,R f ),where f ramenb corresponds to the position of the frame inthe sequence of the video and R f is the set of regions con-stituting the whole frame. We denote by F the set of all theframes for a given video.

Frame sequence: A frame sequence (FS ) is a generic con-cept corresponding to any finite sequence of consecutiveframes in the video. It is defined as FS = (FS fs , FS length),where FS fs and FS length are the first frame and the du-ration of the FS, respectively. Formally, it is defined asFS = ( fi, fi+1, . . . , f j−1, f j) s.t. i, j ∈ N and i < j. The lengthof an FS denoted |FS | refers to the number of frames in FS .

Shot: A shot is a frame sequence sharing a similar set offeatures. It is defined as sh = (sh f s, shlength), where sh f s

and shlength are the first frame and the duration of the shot,respectively. The set of shots for a given video is denotedas SH =

shi|shi = ( f j, f j+1, . . . , fk−1, fk) and i, j, k ∈ N.

Page 8: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

OBJECT LEVEL VIDEO ANNOTATION 7

[Def. 10 Object Frame]: An Object Frame is frame containing an object o. The frame is represented in the

model as a 2-tuples f = <frame-nb, 𝒪f>, where frame-nb corresponds to the position of the frame in the sequence

of the video and 𝒪f 𝒪 is the set of objects contained in the frame.

Fig. 2 illustrates the frame level and shows how the semantics of the salient object are embedded at this

level.

Fig. 2: Object Frame structure

[Def. 11 Object Frame Sequence]: An Object Frame Sequence (denoted ) is a finite sequence of

consecutive object frames in a shot. The boundaries of each Object Frame Sequence occur with the appearance

and disappearance of the object within a shot. If an object o appears in consecutive shots (e.g., o2), different ofs

are distinguished. An Object Frame Sequence is defined as a 3-tuples ofso = <ofs-fs, ofs-length, o>, such

that 𝒪, ofs-fs is the first frame where the object o appears and ofs-length is the duration of its appearance.

Formally, the Object Frame Sequence is defined as:

( )

[Def. 12 Shot Object Frame Sequence]: A Shot Object Frame Sequence is a finite set of ofso related to a specific

object o in the shot sh. It is denoted as .

As depicted in Fig. 3, is the set of Object Frame Sequences related to

in

the shot .

Fig. 3: Object-based video model structure

[Corollary 3: Scene Object Frame Sequence]: A Scene Object Frame Sequence denoted

is a finite set of all the SOFSo,sh related to specific object o in the scene sc.

r2

r1

boundary

area

ri ... regions

f

frame

ok

o1 ...

terms

objects

𝑜 𝑜

𝑜2 𝑜2

𝑜

𝑆𝑂𝐹𝑆𝑜 𝑠 𝑖

c𝑗 c𝑗 c𝑗

h𝑖 h𝑖 h𝑖

Figure 3. Object frame structure

The shots in SH cannot overlap; their union constitutes thewhole video.

Scene: A scene is a sequence of consecutive shots that aresemantically related. It is represented as a 3-tuple scene =

(scene fs , scenelength,SH′), where scene fs and scenelength are

respectively the first frame and the duration of the scene, andSH ′ is the set of shots existing in the scene. The set ofscenes for a given video is denoted as SC =

sci|sci = (sh j,

. . . , shk) and i, j, k ∈ N. The elements of SC cannot over-lap; their union constitutes the whole video.

Video: A video is a finite sequence of consecutive scenes.It is represented as a 5-tuple v = (videolength,width, height,SC), where videolength is the duration of the video expressedin number of frames, width and height are respectively thewidth and height of the video frame in pixels, and SC is theset of all scenes in the video.

Object-based semantic and structural representationof video data. Term: A term t is a description associatedwith an object representing a real world entity (e.g., car). Aterm is defined as a 3-tuple t = (termid, name, de f inition),where id, name and de f inition correspond respectively to theidentifier, name, and definition of the term. A term can berelated to another term with the Is-a relation (e.g., Audi Is-acar).

Classification Scheme: A classification scheme CS is a setof standard terms for a specific domain d. A classificationscheme is defined as a 2-tuple CS d = (Td, Is − a), whereTd is the set of terms for domain d and Is − a is the rela-tion between terms. For a given video, all terms used in theannotation must belong to a single CS.

Object: An object is a sub-set of regions Ro ⊂ R f constitut-ing a component that can be recognized without ambiguity asa real-world object by a human observer. It is defined as a 3-tuple o = (boundaryRo , areaRo ,To), such that boundary andarea are respectively the contour and surface of the mergedregions Ro forming the object; and To is the set of terms as-sociated with the object to describe its semantics. We denoteby O the set of the entire objects for a given video.

Object frame: An Object frame is frame containing an ob-ject o. The frame is represented in the model as a 2-tuplef = ( f ramenb,O f ), where f ramenb corresponds to the posi-tion of the frame in the sequence of the video and O f ⊂ O is

the set of objects contained in the frame.Figure 3 illustrates the frame level and shows how the se-

mantics of the salient object are embedded at this level.

Object frame sequence: An Object frame sequence (de-noted o f so) is a finite sequence of consecutive object framesin a shot. The boundaries of each object frame sequenceoccur with the appearance and disappearance of the objectwithin a shot. If an object o appears in consecutive shots(e.g., o2), different o f s are distinguished. An Object framesequence is defined as a 3-tuple o f so = (o f s fs , o f slength, o),such that o ∈ O, o f s fs is the first frame where the ob-ject o appears and o f slength is the duration of its appear-ance. Formally, the Object Frame Sequence is defined as:o f so =

fi, fi+1, . . . , f j−1, f j

s.t. for i, j, k ∈ N, ∀k : i ≤ k ≤

j,∃sh ∈ SH : o f so ⊂ sh, o ⊂ fk, o 1 fi−1 and o 1 f j+1.

[Def. 10 Object Frame]: An Object Frame is frame containing an object o. The frame is represented in the

model as a 2-tuples f = <frame-nb, 𝒪f>, where frame-nb corresponds to the position of the frame in the sequence

of the video and 𝒪f 𝒪 is the set of objects contained in the frame.

Fig. 2 illustrates the frame level and shows how the semantics of the salient object are embedded at this

level.

Fig. 2: Object Frame structure

[Def. 11 Object Frame Sequence]: An Object Frame Sequence (denoted ) is a finite sequence of

consecutive object frames in a shot. The boundaries of each Object Frame Sequence occur with the appearance

and disappearance of the object within a shot. If an object o appears in consecutive shots (e.g., o2), different ofs

are distinguished. An Object Frame Sequence is defined as a 3-tuples ofso = <ofs-fs, ofs-length, o>, such

that 𝒪, ofs-fs is the first frame where the object o appears and ofs-length is the duration of its appearance.

Formally, the Object Frame Sequence is defined as:

( )

[Def. 12 Shot Object Frame Sequence]: A Shot Object Frame Sequence is a finite set of ofso related to a specific

object o in the shot sh. It is denoted as .

As depicted in Fig. 3, is the set of Object Frame Sequences related to

in

the shot .

Fig. 3: Object-based video model structure

[Corollary 3: Scene Object Frame Sequence]: A Scene Object Frame Sequence denoted

is a finite set of all the SOFSo,sh related to specific object o in the scene sc.

r2

r1

boundary

area

ri ... regions

f

frame

ok

o1 ...

terms

objects

𝑜 𝑜

𝑜2 𝑜2

𝑜

𝑆𝑂𝐹𝑆𝑜 𝑠 𝑖

c𝑗 c𝑗 c𝑗

h𝑖 h𝑖 h𝑖

Figure 4. Object-based video model structure

Shot object frame sequence: A shot object frame se-quence is a finite set of o f so related to a specific object o inthe shot sh. It is denoted as S OFS o,sh =

o f so,k |o f so,k ⊂ sh

and k ∈ N. As depicted in Figure 4, S OFS o3,shi+1 is the setof object frame sequences

o f so3,1, o f so3,2

related to o f so3

in the shot shi+1.

Scene object frame sequence: A scene object frame se-quence denoted S cOFS o,sc =

S OFS o,shi |∀i ∈ N, shi ⊂ sc

is a finite set of all the S OFS o,sh related to specific object oin the scene sc.

Video object frame sequence: A video object frame se-quence denoted VOFS o,v =

S OFS o,shi |∀i ∈ N, shi ⊂ v

is a

finite set of all the S OFS o,sh related to specific object o inthe video.

Priority: A priority is an attribute that is assigned to a shotduring the annotation process. It quantitatively evaluates the"semantic importance" of the shot as a numerical value be-tween 0 and 1. We define the function ρ : SH → v ∈ [0, 1]such that ρ(sh) = 1 and ρ(sh) = 0 corresponds to the Top-priority and No-priority, respectively. A shot with a high pri-ority means that the semantic meaning of the video wouldbe severely altered if this shot is deleted. An example wouldbe a shot exposing an important development of the story ina narrative video as opposed to a non-informative transitionshot. By default, we consider that all the shots in a video areof high priority.

A scene is also associated with a priority value, which isinferred from the priority values of its shots. Actually, the

Page 9: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

8 VANESSA EL-KHOURY, MARTIN JERGLER, GETNET ABEBE, DAVID COQUIL, HARALD KOSCH

Figure 5. Conceptual architecture of SVCAT

priority value of the scene is calculated by applying a simplelogical disjunction (OR) to the priority values of its shots.

Architecture & Functionalities

In this section, we describe the main functionalities andworkflow of SVCAT. As shown in Fig.5, SVCAT consistsof two autonomous modules- semantic shot annotation andobject annotation (El Khoury, Jergler, Coquil, & Kosch,2012).These modules can work together or independentlydepending on the situation. When working independently,the semantic shot annotation module decomposes the videointo shots, detects keyframes and then automatically gener-ates shot annotation on the basis of keyframes classificationinto ImageNet categories. The object annotation module, onthe other hand, decomposes the video into shots and enablesmanual object selection, scene composition and semantic an-notation. When both modules are working together, SVCATbecomes a versatile tool where the semantic shot annotationmodule serves as a preprocessor for the object annotationmodule by generating shot level annotations, which assistannotators to locate objects of interest easily.

This design choice makes SVCAT usable in two com-pletely different situations,i.e., when the user knows the setobjects to annotate and when he does not. When objects tobe annotated are known in advance, the semantic shot an-notation module can be customized to detect such conceptsand provide an important input, in the form of concept Xappears in segments S 1, S 1, . . . ...S n, to the object annotationmodule thus helping the annotator to focus on certain areasin the video rather than searching the entire video. Whenthe objects to be annotated are not known ahead, it just tem-porally decomposes the video and gives a video summaryinformation in the form of keyframes. This can be used bythe annotator as a starting point for further analysis. The shotannotation module generates an MPEG-7 standard compliantdescription as depicted in Listing 1 where the FreeTextAn-notation element indicates detected concepts (line 6). Theannotations with respective frame numbers are shown to theuser on the SVCAT GUI and finer level annotation processstarts with the analysis and description of the temporal struc-ture. This process is followed by a localization and annota-tion of the object of interest in a frame, and their propagationto the consecutive frames that enclose it. In the followingsubsections, we describe in details each of module, and ex-plain our design choices. Since the novelty of SVCAT lies in

its semantic object annotation, we particularly focus on thisaspect.

Listing 1: Semantic shot annotation1 <MediaSourceDecompos i t ion c r i t e r i a =" m o d a l i t i e s ">

2 <VideoSegment i d=" VSID_1 ">

3 <Tempora lDecompos i t ion c r i t e r i a =" v i s u a l s h o t s ">

4 <VideoSegment i d=" SHID_1 " x s i : t y p e =" ShotType ">

5 <T e x t A n n o t a t i o n t y p e=" c o n t e n t ">

6 <F r e e T e x t A n n o t a t i o n> man dog r i v e r b o a t< / F r e e T e x t A n n o t a t i o n>

7 < / T e x t A n n o t a t i o n>

8 <MediaTime>

9 <MediaTimePoint>T00 :00 :00 :0F30000< / MediaTimePoint>10 <MediaDura t ion>P0DT0H0M11S8338N30000F< / MediaDura t ion>

11 < / MediaTime>

12 < ! . . . . . . >

1314 < ! . . . . . . >

15 < / MediaSourceDecompos i t ion

Temporal decomposition

For the semantic shot annotation module, MPEG-7 scal-able color descriptors and MPEG-7 edge histograms are usedto detect shots. Motion attention information within the shotsis used to detect keyframes. The object annotation modulebenefits from VAnalyzer, which performs an automatic de-tection of the shot boundaries based on canny edge detec-tor and motion compensation. In both cases, an overviewof the detection result is displayed to the annotator, enablinghim/her to refine the result of the detection by splitting andmerging shots. Based on the shot detection, the annotatorconstructs the scenes by manually grouping shots that sharesimilar semantic concepts. Once the shot/scene segmenta-tion is validated, an MPEG-7 video metadata description isgenerated. In addition, the temporal structure of the videois displayed such that the top part enables the annotator tonavigate through the scenes, while displaying their shot or-ganization in the bottom part. Moreover, this interface allowsthe annotator to assign priorities and free text annotation tothe shots.

Listing 2: Annotation of temporal structure1 <n s 1 : T e m p o r a l D e c o m p o s i t i o n>

2 <ns1 :VideoSegment i d=" Scene_1 ">

3 <n s 1 : S e m a n t i c R e f h r e f =" u r n _ p r i o r i t y C S _ N o " / >

4 <ns1:MediaTime>

5 <ns1 :Med iaT imePo in t>2012−01−29 T00 :00 :00 :000F1000< / ns1 :Med iaT imePo in t>6 <n s 1 : M e d i a D u r a t i o n>PT00H00M00S01N800F< / n s 1 : M e d i a D u r a t i o n>

7 < / ns1:MediaTime>

8 <n s 1 : T e m p o r a l D e c o m p o s i t i o n>

9 <ns1 :VideoSegment i d=" Shot_0 ">

10 <n s 1 : S e m a n t i c R e f h r e f =" u r n _ p r i o r i t y C S _ N o " / >

11 <ns1:MediaTime> . . . < / ns1:MediaTime>

12 < \ ns1 :VideoSegment>13 < !−− f u r t h e r s h o t s o f Scene_1 . . . −−>

14 < \ n s 1 : T e m p o r a l D e c o m p o s i t i o n>

15 < !−− f u r t h e r s c e n e s . . . −−>

16 < \ ns1 :VideoSegment>17 <n s 1 : T e m p o r a l D e c o m p o s i t i o n>

Listing 2 shows an excerpt of the generated MPEG-7 descrip-tion. Scenes and shots are represented by VideoSegments,which are hierarchically structured by nested TemporalDe-compositions. The outer one represents the decompositionof the video into scenes (line 1) and each of the inner onesrepresents the decomposition of the scene into its shots (line9). The semantic annotation is realized by references to theCS (line 3 and 10).

Page 10: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

OBJECT LEVEL VIDEO ANNOTATION 9

Automatic shot annotation

The automatic shot annotation is a flexible process whichcan be adapted to application contexts. This implies that anappropriate choice between the setups discussed in the re-quirement section is based on the case at hand and a concretediscussions attaches itself to an application scenario. There-fore, let’s consider a user centric adaptation application sce-nario where users do not want to see soda cans or bottlesin the videos delivered to them. To meet this requirement,the adaptation system needs an appropriate annotation aboutthe soda cans or bottles in the video. This scenario requiresthe full functionality of SVCAT system. First we have to beable to detect instances of soda bottles/cans in the shots andthen localize these spatially to get detailed information abouttheir size and shape, which is needed to execute adaptionoperations.

To do this, we have to train our concept detec-tors(classifiers) with positive and negative images of soda.For that, we use ImageNet as a visual lexicon. ImageNetis a relatively new yet popular image dataset that containsimages collected from the web and organized based on theWordNet lexical database (Deng, Berg, Li, & Fei-Fei, 2010).Currently, it contains over 14,192,122 images organized into 21841 categories via a human based verification process.There are over 500 images for each category on average. Wechose ImageNet because of its richness in concept categories.The large number of categories facilitates the customizationof our framework to a wide set of application requirementseasily.

Dense SIFT features are extracted from training imagesas well as keyframes. With Dense SIFT, descriptors are ex-tracted at dense regular grids instead of at sparse interestpoints. We have chosen Dense SIFT over the standard SIFTbecause it has shown better performance in classification re-lated tasks (Linderberg, 2013). Descriptors are extracted us-ing VLFeat toolbox version 0.9.16 (http://www.vlfeat.org/

download.html) after all images are resized to a maximumof 500 X 500 pixels.

The parameters for codebook generation and encoding arenot fixed. They are left open to be adjusted the system userdepending on the application requirement. Before applyingthe system in certain environment, users can experiment withdifferent parameter values, as demonstrated in the experi-mental evaluation section for the above mentioned applica-tion scenario, and set whichever suits them.

The core operation in the automatic shot annotationpipeline is the classification task. For now, a linear SVMis used and a one-versus-all(OVA) classification scheme im-plemented. for the classification task. Given an M-classclassification problem, where we have N training samplesx1, y1, . . . , xN , yM. where xi ∈ Rm which is an m dimen-sional feature vector and yi ∈ 1, 2, . . . ,M is the correspond-ing class label, one-versus-all approach constructs M binary

SVM classifiers, each of which separates one class from allthe rest. The ith SVM is trained with all the training exam-ples of the ith class as positive labels, and all the others withnegative lebels.The decision function of the ith SVM replacesthe class label of the jth sample s j with Li, as +1 if si = i;otherwise Li = −1. Since OVA classification is applied, eachkeyframe is assigned only one label.

To refine the shot annotations resulting from the classi-fication, we use a modified version of the strategy prposedby authors of (Zhong & Miao, 2012). Temporal refinementterm is calculated over a window of a 10 shots and spatial re-finement is done using WordNet and DBPedia based voting.

Object selection approach

As stated in section , two types of approaches are goodcandidates for SVCAT’s object selection function, namelygraph cuts and active contours. To select one for imple-mentation in SVCAT, we chose representative algorithmsof each approach, implemented them and compared themwith respect to segmentation accuracy and performance. Forgraph cut based approaches, we chose the GrowCut algo-rithm (Vezhnevets, 2004) and for contour evolution tech-niques we opted for a level set based snake implementa-tion (Lankton, 2009) using the Chan/Vese energy (Chan &Vese, 2001), expressed below in equation 3.

E =

∫interior

(I − µ1)2 +

∫exterior

(I − µ2)2 (3)

The experimental set-up for comparing the two approachesconsists of four classes of five images each. As depicted inTable 1, the classes represent different uniformity combina-tions (i.e. heterogeneous vs. heterogeneous) with respectto the color and texture characteristics of object and back-ground. To quantitatively evaluate the segmentation accu-

Object/Background Homogeneous HeterogeneousHomogeneous Class 1 Class 3Heterogeneous Class 2 Class 4

Table 1The experimental classes

racy of both approaches, we compare the segmented imageagainst the manually-segmented reference image (often re-ferred to as ground truth), which we represented as binarymasks. These masks enable the computation of the precisionand the recall measures at pixel-level accuracy. As shownin Figure 6 (a-b), the segmentation results of the Snake al-gorithm are slightly better than the one of GrowCut. Withrespect to the performance evaluation, we calculate for eachclass the average of the segmentation time in milliseconds.As illustrated in Figure 6 (c), the Snake algorithm outper-forms GrowCut. Moreover, we proved that the segmentationtime required using Snake is independent from the image

Page 11: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

10 VANESSA EL-KHOURY, MARTIN JERGLER, GETNET ABEBE, DAVID COQUIL, HARALD KOSCH

0102030405060708090

100

class 1 class 2 class 3 class 4

Per

cent

age

Image Category

SnakeGrow Cut

(a) Precision comparison

0102030405060708090

100

class 1 class 2 class 3 class 4

Per

cent

age

Image Category

SnakeGrow Cut

(b) Recall comparison

0

2000

4000

6000

class 1 class 2 class 3 class 4

Run

time

(in m

s)

Image Category

SnakeGrow Cut

(c) Runtime comparison

0

2000

4000

6000

8000

10000

12000

14000

16000

25 50 75 100 125 150 175 200

Run

timeC

(inCm

s)

ImageCscalingCinCpercentage

SnakeGrowCCut

(d) Runtime with image scalingcomparison

Figure 6. Quantitative comparison of the object selection:Snake against GrowCut.

size (see Figure 6 (d)). Indeed, we evaluate the time usingthe images of Class 1 with an increasing image scaling from25% to 200% with step of 25%. The resultant curve can beexplained as the snake only performs calculations along thecontour line, while GrowCut analyses each pixel within theimage. Based on these experimental results, we decided tointegrate the Snake approach in SVCAT.

Object tracking approach

Regarding the object tracking approach, we have imple-mented the method proposed by Shi and Karl (Shi & Karl,2005) as we discussed in section . The algorithm assumesthat each scene of the video is composed of a backgroundregion Ω0 and an object region Ω1. The contour of Ω1 isdenoted as C1. Each of the two regions is modeled with afeature distribution p

(v |Ωx

), where v is the feature vector

defined at each pixel. In our implementation we used thehsv color space and a pixel level texture descriptor (Ahmed,Karmakar, & Dooley, 2006). Assuming that the feature dis-tribution in each pixel is independent, the tracking can be re-garded as the minimum of the following region competitionenergy (equation 4).

E = −

1∑i=0

∫Ωi

log p(v(x)|Ωi

)dx︸ ︷︷ ︸

Ed

+ λ

1∑i=0

∫Ci︸ ︷︷ ︸

Es

(4)

which results in the following speed functions (equations5 and 6),

Fd = log

p(v(x)|Ω1

)p(v(x)|Ω0

) (5)

Fs = λκ (6)

Fd represents the competition between the two regionsand Fs smoothly regularizes the contour.

A nice feature of this algorithm, in the context of the in-tegration into SVCAT, is the fact that it also uses level setsto represent the contour. Thus, it is easy to transform thecontour output of the Snake algorithm to the representationthat is necessary for the tracker.

Annotation at the object level

In this section, we describe the representation of the ob-ject metadata, which is related to its high-level semantic (i.e.descriptive term derived from a CS) and its spatial-temporalsegmentation information (i.e. exact object position in eachframe in which it appears). In our approach, we decoupledescriptive metadata from the structural metadata in order toachieve a less verbose annotation. Thus, the object descrip-tion is split into two parts: static and dynamic part.

The static part corresponds to a concrete instance of anobject along with its semantics. This annotation is createdwhen the user selects an object and attaches a descriptiveterm to it. The object is annotated using a MovingRegiondescriptor, and linked to a descriptive term of the CS.

The dynamic part represents the information related to thespatial segmentation (i.e. the contour of the object and itssize in a frame) and the temporal segmentation (i.e. its ap-pearance with respect to scene and shot structure). To rep-resent the spatial information at pixel accuracy, the usualMPEG-7 descriptors are not expressive enough. For in-stance, the MPEG-7 RegionLocator only allows the anno-tation of simple geometric shapes (at most polygons). Amore expressive possibility would be the SpatioTemporalLo-cator in combination with a FigureTrajectory, but it lacksprecision as well. Indeed, this representation of regions isbased on parametric curves along with interpolation func-tions. These functions are expensive to calculate and the re-sulting curve only provides an approximation to the accuratecontour. Thus, we opted for gathering the position informa-tion in a separate XML document according to the schemadepicted in Listing 3.

Listing 3: XML schema for position information1 <xs : s chema x m l n s : x s=" h t t p : / /www. w3 . org / 2 0 0 1 / XMLSchema"2 e l e m e n t F o r m D e f a u l t=" q u a l i f i e d ">

3 < x s : e l e m e n t name=" O b j e c t I n f o r m a t i o n ">

4 <xs :complexType>

5 < x s : s e q u e n c e>

6 < x s : e l e m e n t maxOccurs=" unbounded " r e f =" TimeStamp " / >

7 < / x s : s e q u e n c e>

8 < / xs :complexType>

9 < / x s : e l e m e n t>10 < x s : e l e m e n t name=" TimeStamp ">

11 <xs :complexType>

12 < x s : s e q u e n c e>

13 < x s : e l e m e n t r e f =" FrameNumber " / >

14 < x s : e l e m e n t r e f =" O b j e c t S i z e " / >

15 < x s : e l e m e n t r e f =" O b j e c t C o n t o u r " / >

16 < / x s : s e q u e n c e>

17 < x s : a t t r i b u t e name=" t ime " use=" r e q u i r e d " t y p e=" x s : i n t e g e r " / >

18 < / xs :complexType>

19 < / x s : e l e m e n t>20 < x s : e l e m e n t name=" FrameNumber " t y p e=" x s : i n t e g e r " / >

21 < x s : e l e m e n t name=" O b j e c t S i z e " t y p e=" x s : i n t e g e r " / >

Page 12: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

OBJECT LEVEL VIDEO ANNOTATION 11

22 < x s : e l e m e n t name=" O b j e c t C o n t o u r " t y p e=" x s : s t r i n g " / >

23 < / xs : s chema>

The root element, ObjectInformation consists of an un-bounded number of TimeStamp elements, where each oneis a sequence of three elements FrameNumber, ObjectSizeand ObjectContour, with an attribute time. The time attributeprovides a temporal description, the position of the frameenclosing the object at a distinct point in time. In additionto the media time, we store the frame sequence number, theobject size (i.e. the number of pixels that form the objectin the frame) and, of course, the ObjectContour. The latteris obtained by transforming the frame containing the trackedobject into its binary mask, with 0 and 1 values representingobject pixels and background pixels, respectively. For mat-ters of size and performance, we use run-length encoding toencode the result and store it as a string. The run length en-coding scans the frame from left to right in first order andfrom top to bottom in second order. Note that the originalmask can be easily re-established based on the resolution ofthe video.

The dynamic part describes the sub-structure of a shot re-garding the appearance of objects. According to the modeldescribed in section , an object can appear and disappear sev-eral times within a shot. We denote by SOFS the set of thesegments enclosing an object in a shot contained in a par-ticular scene. As already mentioned, we have made this de-sign choice in order to favor a flexible usage of the metadatalater on. An example of dynamic object description is de-picted in listing 4. The excerpt consists of a SpatioTempo-ralDecomposition, which is embedded in the VideoSegmentof the corresponding shot. Each sub-VideoSegment corre-sponds to an SOFS, which is identified by the id attribute.It references both the .xml document that holds the positioninformation (line 3-7) and, the static description part usingthe MovingRegionRef descriptor (line 18). In addition, thedynamic description contains a TemporalMask (line 8-15),which describes the exact time interval in which the objectoccurs within the shot.

Listing 4: Dynamic object description1 <n s 1 : S p a t i o T e m p o r a l D e c o m p o s i t i o n>

2 <ns1 :VideoSegment i d=" SOFS_ID_0_1 . 2 − ( Swiss F l ag ) ">

3 <n s 1 : M e d i a L o c a t o r>

4 <n s 1 : M e d i a U r i>5 SvcatMR1328277Swiss FlagID_0 . xml6 < / n s 1 : M e d i a U r i>7 < / n s 1 : M e d i a L o c a t o r>

8 <ns1:Tempora lMask>

9 < n s 1 : S u b I n t e r v a l >

10 <ns1 :Med iaT imePo in t>11 2012−02−03 T00 :00 :00 :000F100012 < / ns1 :Med iaT imePo in t>13 <n s 1 : M e d i a D u r a t i o n>PT00H00M00S00N320F< / n s 1 : M e d i a D u r a t i o n>

14 < / n s 1 : S u b I n t e r v a l >

15 < / ns1:Tempora lMask>

16 < !−− R e f e r e n c e t o s t a t i c , s e m a n t i c d e s c r i p t i o n−−>

17 <n s 1 : S p a t i o T e m p o r a l D e c o m p o s i t i o n>

18 <ns1:MovingRegionRef h r e f =" SvcatMR1328277Swiss F l ag " / >

19 < / n s 1 : S p a t i o T e m p o r a l D e c o m p o s i t i o n>

20 < / ns1 :VideoSegment>21 < / n s 1 : S p a t i o T e m p o r a l D e c o m p o s i t i o n>

EXPERIMENTAL EVALUATION

In this section, we present results of the several evalua-tions we performed by dividing the discussion in to two partsfor clarity. First we present evaluations related to the seman-tic shot annotation process and follow that with the evalua-tion of the object annotation task. The experiments were runon a 2.6 GHz quad core machine with 12 GB of RAM.

Evaluation of the semantic shot annotation task

As mentioned in the preceding section, the different pa-rameters affecting the concept detection are tuned by the userof the system. What we give here is a result of such a testperformed considering the previously mentioned object leveladaptation scenario. We have selected 20 categories in Ima-geNet under food, nutrient category as these categories con-tain several images of soda cans and bottles. On average,each category had about 900 images out of which 50% isused for training and validation and the remaining 50% fortesting.

We tried out vector encoding(VQ), Fisher kernel (FK) andlocality constrained linear encoding(LLC) where K-meansclustering is used to generate the codebook,or VQ and LLCencoding and GMM clustering with 256 Gaussian compo-nents (Perronnin, Sanchez, & Mensink, 2010) is used for KFencoding. K-means codebook generation was achieved byapplying an approximated nearest search on randomized KDtree (Muja & Lowe, 2009) constructed from a set of 106

randomly selected training descriptors. The Fisher kernelmethod encodes the image by generating a high dimensionalvector which contains the average first and second order dif-ference between the image descriptors and the GMM centers.The resulting Fisher vector has a dimension of 2KD, whereD is the dimensionality of the descriptor and K is the numberof Gaussian used. To reduce the storage requirement of thesevectors, the dimensionality of the Dense SIFT descriptors isreduced to 64 via PCA. For LLC, nearest neighbors of fivecodewords are used to encode the image. Codebooks of size1024, 2048, 4096 and 8192 were used. The consequenceof these setups is evaluated with respect to how it affectedthe accuracy of classification that is given as a mean averageprecision (MAP).

Table 2 shows the MAP value for four codebook sizes andtwo different image encoding techniques. For FK encodinga codebook of length 256 was used based on the recommen-dation in (Perronnin et al., 2010) and a MAP of 58.68 wasobtained.

Regarding the computational time, FK based classifica-tion took 2-3 times more than VQ based approach, whereasLLC required an overwhelmingly large time -almost seventimes the VQ based approach. Figure 7 shows the computa-tional time requirement of each of these techniques for com-parison.

Page 13: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

12 VANESSA EL-KHOURY, MARTIN JERGLER, GETNET ABEBE, DAVID COQUIL, HARALD KOSCH

Figure 7. Time requirement of different codebook sizes

Looking closely at the MAP values in Table 2, one canclearly see that the computational requirement of LLC en-coding does not bring a proportional increase in accuracy.Hence, we used FK based encoding for the shot annotationtask.

Evaluation of the object annotation task

In this section, we present the results of the evaluation ofthe accuracy and performance of object annotation task inSVCAT. All object processing algorithms presented in thispaper were developed in Java version 1.6 and ran on Win-dows XP as an operating system.

The data set comprises four videos, each one representinga different class (see Table 1). All videos are in DivX for-mat stored as AVI with a resolution of 320*240 pixels and aframe rate of 25 Fps. For each frame holding the object, wemanually segmented it and generated the binary mask of theforeground. As this procedure is time-consuming, we onlysegmented the object in 45 frames.

To begin with, we studied the accuracy of the contourtracking algorithm for deformable objects taken by a movingcamera. As the tracking result of the objects can be storedin a binary mask, we used the same evaluation methodologyas for the image segmentation in Section . For each frame,we compared the segmentation results of the tracker againstmanually-segmented reference frame, and computed preci-sion and recall at pixel-level accuracy. Besides accuracy, wealso evaluated the runtime performance of the tracking algo-rithm. For each of the video sequences, we launched threeiterations of the contour object tracking and measured theaverage runtime in milliseconds per frame.

Comparative results of the accuracy evaluation are illus-trated in Figure 8. It can be observed from the precision-recall curves that the tracking algorithm returned more rele-Table 2MAP values for different encoding techniques and codebooksizes

EncodingMethod

Codebook length1024 2048 4096 8192

VQ 41.23 42.5 43.21 46.41LLC 48.69 50.12 51.36 54.39

vant results than irrelevant, such that not all the relevant pix-els are returned. For instance, the precision values reside in-side a range of 90% up to 100%, while the recall values reachan average of approximately 80%. This is due to the contourevolution process according to the calculated energy, whichis described in Section . Indeed, the texture description andthe feature representation of a pixel within a particular frame(n) rely on the luminance characteristics of its neighborhood.Thus, the texture descriptor for object pixels in areas closeto the contour line might also incorporate background pixelsto calculate the feature. This can result in an imprecise de-scription of such pixels yielding in a slightly distorted featuredistribution for the object region. Due to this distribution, thecontour evolution can sometimes regard object pixels in theconsecutive frame (n + 1) as background pixels erroneously.An additional conclusion drawn from these curves is that

0.7

0.8

0.9

1

0 10 20 30 40

Per

cent

age

Frame number

PrecisionRecall

(a) Video-Class 1

0.7

0.8

0.9

1

0 10 20 30 40

Per

cent

age

Frame number

PrecisionRecall

(b) Video-Class 2

0.7

0.8

0.9

1

0 10 20 30 40

Per

cent

age

Frame number

PrecisionRecall

(c) Video-Class 3

0.7

0.8

0.9

1

0 10 20 30 40

Per

cent

age

Frame number

PrecisionRecall

(d) Video-Class 4

Figure 8. Accuracy evaluation of the tracking algorithm.

tracking videos of Class 1 (Figure 8 (a)), which consist ofhomogeneous object and background, obtain better resultsthan tracking videos with heterogeneous regions (Figure 8(b-c-d)). Due to heterogeneity (e.g., different colors, varioustextures), the feature representation for both object and back-ground regions is not that distinctive as compared to homo-geneous conditions (e.g., a single color hue, smooth texture).As a result, the values in the histogram (i.e. the feature dis-tribution) will be scattered across a larger range. As a conse-quence, the failure rate increases with the contour evolutionfrom one frame to another, since pixels’ region membershipin the consecutive frame is estimated based on this featuredistribution. With respect to the runtime evaluation, the re-sults are depicted in Figure 9. By examining the performancecurves of each video class, we easily observe the tremen-dous differences between their tracking times, although they

Page 14: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

OBJECT LEVEL VIDEO ANNOTATION 13

have the same resolution. For instance, the average runtimeper frame for Video-Class 1, Video-Class 2, Video-Class 3and Video-Class 4 is about 1.118, 2.857, 0.233 and 8.852sec, respectively. This can be referred to the calculation ofthe texture feature during the tracking. In fact, in order toobtain a good tracking accuracy, we used different radii 3,6, 1 and 7 for Video-Class 1 to Video-Class 4 respectively,in our evaluation. Yet, SVCAT enables the adjustment ofthe neighborhood radius of pixels that should be relevant forthe texture description of a particular pixel. Hence, it be-came apparent that tracking performance massively dependson the amount of pixels that contribute to the texture descrip-tion. Although the tracking runtime performance deteriorateswith heterogeneous regions, we argue that the experimentalresults are quite acceptable for the purpose of SVCAT. In-deed, SVCAT aims to automatically provide object localiza-tion at pixel-level accuracy in each frame. To achieve such astrong requirement on precision, we consider that relativelylong computation times are reasonable.

0

2000

4000

6000

8000

10000

12000

0 10 20 30 40

Mill

isec

onds

Frame number

(a) Video-Class 1

0

2000

4000

6000

8000

10000

12000

0 10 20 30 40

Mill

isec

onds

Frame number

(b) Video-Class 2

0

2000

4000

6000

8000

10000

12000

0 10 20 30 40

Mill

isec

onds

Frame number

(c) Video-Class 3

0

2000

4000

6000

8000

10000

12000

0 10 20 30 40

Mill

isec

onds

Frame number

(d) Video-Class 4

Figure 9. Performance evaluation of the tracking algorithm.

Related Work

In the literature, several annotation tools have been devel-oped to describe video content, in which the metadata are ex-ploited in the context of video indexing, querying, retrieval,etc. An overview and comparison of these tools can be foundin the following survey by Dasiopoulou et al. (Dasiopoulouet al., 2011). However, these available tools are not capableof providing a rich and accurate metadata for an object-basedapplication scenario. A positioning of SVCAT among thesetools is illustrated in Table 3. The comparison is done withrespect to the criteria discussed in Section : 1) the metadataformat, 2) the accuracy and granularity level of the annota-tion and 3) the degree of automation.

As depicted in the table, most tools use self-defined XML

formats for output descriptions, thus complicating the inte-gration of the produced metadata in different contexts. TheSemantic Web and the MPEG-7 standard are used rathersparsely. Only SVAS (Johanneum Research, 2008) andVideoAnnEx (Smith & Lugeon, 2000) follow the MPEG-7standard providing a good exchangeability and compatibilityof the produced metadata.

Regarding the video segmentation, very few tools canautomatically identify temporal segments and provide themwith a semantic description (e.g., frame start, length). OnlyAdvene (LIRIS laboratory (UMR 5205 CNRS), 2008),SVAS and VideoAnnex can perform automatic shot detec-tion. With regards to scene segmentation, none of the toolsso far provide automatic scene detection. Indeed, there hasnot been an efficient approach until now that works reliablyin a non-restricted domain.

With respect to automatic concept detection in shots, thereexists no tool that is easily customizable as SVCAT. In addi-tion, none of the existing tools combine temporal conceptdetection with spatial object localization to guide users lo-calize objects of interest with minimal effort. In fact, SVACTis the only annotation tool that integrates researches in con-cept detection with object detection and tracking to facilitateautomatic video annotation while alleviating the expensivemanual involvement.

Concerning the spatial localization of objects, the situ-ation is quite similar. Only VideoAnnEx, VIA (Informat-ics and Telematics Institute (CERTH-ITI), 2009) and SVASallow selection and annotation of objects within the video.Nevertheless, the propagation of the object description andits spatial properties over consecutive frames still requireshuman intervention, and is deemed to be manual. For in-stance, the propagation of the object description is done ei-ther by dragging it while the video is playing (i.e. VIA), or bycopying it with one mouse click to detected similar regions inthe consecutive frames (i.e. SVAS). Moreover, none of thesetools supports an automatic propagation of the object’s spa-tial properties, such that its contour/boundary is accuratelytracked in the video while generating in parallel the descrip-tion of its shape and size. VIA and VideoAnnex representobjects using bounding rectangles. SVAS provides a slightlyhigher degree of precision and uses bounding polygons torepresent the objects contour.

To position SVCAT among these tools, we analyze it re-garding these aforementioned requirements, which are of ut-most importance for an object-based application. The posi-tioning is presented in the last row of Table 3. Compared tothe presented tools, SVCAT provides significant advantageswith respect to interoperability issues, accuracy of the objectrepresentation and degree of automation.

Page 15: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

14 VANESSA EL-KHOURY, MARTIN JERGLER, GETNET ABEBE, DAVID COQUIL, HARALD KOSCH

Tool Meta dataformat

Automaticconcept

detection

Localization Localizationtemporal structure spatial structure

automaticshot

automaticscene

object an-notation

selection accuracy propagationof description

VIA (Infor-matics andTelematicsInstitute(CERTH-ITI), 2009)

XML 6 6 6 4 rectangular boundingbox

manual

Ontolog(Heggland,2006)

RDF 6 6 6 6 - -

Video Annex(Smith & Lu-geon, 2000)

MPEG-7 6 4 6 4 rectangular boundingbox

manual

Advene(LIRIS labo-ratory (UMR5205 CNRS),2008)

customXML

6 4 6 6 - -

Elan (Laus-berg &Sloetjes,2009)

customXML

6 6 6 6 - -

Anvil (Kipp,2001)

customXML

6 6 6 6 - -

SVAS (Jo-hanneumResearch,2008)

MPEG-7 6 4 6 4 polygon region manual

SVCAT MPEG-7 4 4 6 4 exact region automaticTable 3Positioning of SVCAT among current video annotation tools.

Discussion and future work

In this paper, we proposed Extended Semantic Video Con-tent Annotation Tool (SVCAT), which targets the creationof structural and semantic video metadata. SVCAT makesuse of the MPEG-7 descriptions tools, providing standard-ized annotations at different granularities, starting from theentire video passing through the temporal segments (shots,scenes) and frames down to regions and moving regions withthe frame sequences in which they appear. Particularly, itachieves a semi-automatic annotation at the object level. Tothis end, it first performs automatic shot level concept detec-tion and semantic shot annotation. This annotation facilitatesthe selection of an object of interest by the user. Our systemis capable of using a rough user selection and perform anautomatic exact selection of the object contour, which it thenautomatically propagates to other frames, in which the objectappears, using a contour evolution tracking algorithm. Au-

tomating the processes of concept detection and automaticobject contour propagation, we extensively reduce the man-ual annotation time required for videos.

More, SVCAT is a domain independent and easily cus-tomizable annotation tool which does not depend on detailedframe segmentation to generate object level annotations. Itenables the semantic annotation of objects using a controlledvocabulary held by an MPEG-7 Classification Scheme (CS).Other than the default CS provided with SVCAT, the toolalso allows the annotator to explicitly supply his/her own CS.For the localization description, we have defined our ownschema to describe its size in pixels, its mask, as well asthe frame number where it appears. Furthermore, SVCAThas the functionality to export MPEG-7 metadata descrip-tions and validate them according to the MPEG-7 schema.By using the MPEG-7 description tools both at semantic andsyntactic level, we alleviate the problem of interoperability.

Page 16: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

OBJECT LEVEL VIDEO ANNOTATION 15

Also, we argue that semi-automatic annotation based on CSis a promising approach for sorting out the problem of sub-jectivity and incompleteness in the manual annotation, aswell as the limitation in the semantic expressiveness of theautomatic annotation. SVCAT provides functionalities forframe-accurate key-frame navigation through the temporalstructure of the video (i.e. scene, shots) via a user-friendlyinterface using a time-line. Existing MPEG-7 descriptionscan also be imported, enabling a more comfortable gradualannotation process. These functions lighten the task of theannotator in associating the metadata with the video structureand in updating it.

Finally, we proposed a new video data model, which cap-tures the low-level feature, high-level concept and structuralinformation generated by SVCAT. The model extends thetypical structure hierarchy to the region level, and uses con-ceptual knowledge (i.e. keywords derived from CS) to rep-resent the objects with their spatial and temporal properties.We argue that an expressive video data model along the se-mantic and structural dimensions, solves the problem of thesemantic gap. In order to justify our design choices for eachpart of the prototype, we have conducted an analytical andexperimental evaluation of the existing approaches. Then,we have performed a global evaluation of SVCAT provingthe accuracy of its metadata, and that this tool provides areliable input to object-based applications. Indeed, the ex-perimentation results showed that the precision values varybetween 90% and 100% according to the texture of the objectversus background, while the recall values of approximately80% are achieved on average.

Regarding future work, SVCAT can be improved to thepoint of being a completely interoperable annotation tool bysupporting ontologies for semantic metadata in addition toClassification Schemes. Furthermore, the process of the ob-ject detection in SVCAT can be fully automated for domain-specific applications. This requires having a training set thatcovers many variations of the object appearance. This train-ing set could be created by extending the functionality ofSVCAT so that it can learn from existing annotations. Thiswould enable the automatic detection of the spatio-temporallocation of objects in new videos using object recognitiontechniques. Clearly, these improvements of SVCAT candrastically reduce the effort required from the annotator.

References

Ahmed, R., Karmakar, G. C., & Dooley, L. S. (2006).Region-based shape incorporation for probabilisticspatio-temporal video object segmentation. In Icip(pp. 2445–2448).

Bruyne, S., Hosten, P., Concolato, C., Asbach, M., Cock,J., Unger, M., . . . Walle, R. (2011). Annotation basedpersonalized adaptation and presentation of videos formobile applications. Multimedia Tools and Applica-

tions, 55, 307–331. doi:10.1007 /s11042- 010- 0575-2

Caruana, R., Karampatziakis, N., & Yessenalina, A. (2008).An empirical evaluation of supervised learning in highdimensions. In Icml (pp. 96–103).

Chan, T. F. & Vese, L. A. (2001, February). An Active Con-tour Model without Edges. Image Processing, IEEETransactions on, 10(2), 266–277. doi:10 . 1109 / 83 .902291

Chatfield, K., Lempitsky, V., Vedaldi, A., & Zisserman, A.(2011). The devil is in the details: an evaluation ofrecent feature encoding methods. In British machinevision conference.

Comaniciu, D., Meer, P., & Member, S. (2002). Mean shift:a robust approach toward feature space analysis. IEEETransactions on Pattern Analysis and Machine Intelli-gence, 24, 603–619.

Csurka, G., Dance, C. R., Fan, L., Willamowski, J., & Bray,C. (2004). Visual categorization with bags of key-points. In In workshop on statistical learning in com-puter vision, eccv (pp. 1–22).

Dasiopoulou, S., Giannakidou, E., Litos, G., Malasioti, P., &Kompatsiaris, Y. (2011). A survey of semantic imageand video annotation tools. In G. Paliouras, C. D. Spy-ropoulos, & G. Tsatsaronis (Eds.), Knowledge-drivenmultimedia information extraction and ontology evo-lution (pp. 196–239). Berlin, Heidelberg: Springer-Verlag. Retrieved from http : / / dl . acm . org / citation .cfm?id=2001069.2001077

Deng, J., Berg, A. C., Li, K., & Fei-Fei, L. (2010). What doesclassifying more than 10,000 image categories tell us?

El Khoury, V., Jergler, M., Coquil, D., & Kosch, H. (2012).Semantic video content annotation at the object level.In 10th international conference on advances in mo-bile computing and multimedia (momm2012. Re-trieved from http://liris.cnrs.fr/publis/?id=5777

Heggland, J. (2006). Ontolog. http : / / www. idi . ntnu . no /

~heggland/ontolog/.Informatics and Telematics Institute (CERTH-ITI). (2009).

VIA - video image annotation tool. http://mklab.iti.gr/via/.

Johanneum Research. (2008). SVAS - semantic video anno-tation suite. http://www.joanneum.at/digital/produkte-loesungen/semantic-video-annotation.html.

Kass, M., Witkin, A., & Terzopoulos, D. (1988). Snakes: ac-tive contour models. Int. Journal of Computer Vision,1(4), 321–331.

El-Khoury, V., Coquil, D., Bennani, N., & Brunie, L. (2012).Personalized vIdeo Adaptation Framework (PIAF):high-level semantic adaptation. Multimedia Tools andApplications, 1–42. 10.1007/s11042-012-1225-7. Re-trieved from http://dx.doi.org/10.1007/s11042-012-1225-7

Page 17: Fine granularity Semantic Video Annotation: an approach ... · metadata describing the video content (El-Khoury, Coquil, Bennani, & Brunie, 2012). To realize fine-grained video annotation,

16 VANESSA EL-KHOURY, MARTIN JERGLER, GETNET ABEBE, DAVID COQUIL, HARALD KOSCH

Kipp, M. (2001). Anvil - a generic annotation tool for multi-modal dialogue. Pro. of the 7th Europ. Conf. on SpeechCommunication and Technology (Eurospeech), 1367–1370.

Lankton, S. (2009, July). Sparse field methods.Lausberg, H. & Sloetjes, H. (2009). Coding gestural be-

havior with the NEUROGES-ELAN system. BehaviorResearch Methods, Instruments, & Computers, 41(3),841–849.

Levin, A., Viola, P., & Freund, Y. (2003). Unsupervisedimprovement of visual detectors using co-training. InProc. of ieee int. conf. on computer vision (Vol. 1,pp. 626–633).

Linderberg, T. (2013). Sclae selection. Springer.LIRIS laboratory (UMR 5205 CNRS). (2008). Advene - an-

notate digital video, exchange on the net. http://liris.cnrs.fr/advene/.

Muja, M. & Lowe, D. G. (2009). Fast approximate nearestneighbors with automatic algorithm configuration. InIn visapp international conference on computer visiontheory and applications (pp. 331–340).

Osher, S. & Sethian, J. A. (1988, November). Fronts prop-agating with curvature-dependent speed: algorithmsbased on hamilton-jacobi formulations. J. Comput.Phys. 79, 12–49. doi:http://dx.doi.org/10.1016/0021-9991(88)90002-2

Perronnin, F., Sanchez, J., & Mensink, T. (2010). Improvingthe fisher kernel for large-scale image classification. InEecv 2010.

Shi, Y. & Karl, W. C. (2005). Real-time tracking using levelsets. In Proc. of the 2005 ieee conference on computervision and pattern recognition (cvpr’05) - volume 2 -volume 02 (pp. 34–41). CVPR ’05. Washington, DC,USA: IEEE Computer Society. doi:10 . 1109 / CVPR .2005.294

Smeaton, A., Over, P., & Doherty, A. R. (2010, April). Videoshot boundary detection: seven years of TRECVIDactivity. Computer Vision and Image Understanding,114(4), 411–418.

Smith, J. R. & Lugeon, B. (2000, November). A visual an-notation tool for multimedia content description. SPIE

Photonics East, Internet Multimedia Management Sys-tems.

Snoek, C. G. M. & Worring, M. (2009). Concept-basedvideo retrieval. Foundations and Trends in Informa-tion Retrieval, 4(2), 215–322. Retrieved from http: / /www. science . uva . nl / research / publications / 2009 /

SnoekFTIR2009Stegmaier, F., Doeller, M., Coquil, D., El-Khoury, V., &

Kosch, H. (2010). Vanalyzer: a mpeg-7 based seman-tic video annotation tool. Workshop on InteroperableSocial Multimedia Applications (WISMA).

Ulges, A., Schulze, C., Koch, M., & Breuel, T. M. (2010).Learning automatic concept detectors from onlinevideo. Computer Vision and Image Understanding,114(4), 429–438.

Vezhnevets, V. (2004). "growcut" - interactive multi-label n-d image segmentation by cellular automata. Cybernet-ics, 127(2), 150–156. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.59.8092&rep=rep1&type=pdf

Weber, J., Lefevre, S., & Gancarski, P. (2010). Video ob-ject mining: issues and perspectives. In Proc. of the4th ieee int. conf. on semantic computing (pp. 85–90).ICSC ’10. Washington, DC, USA: IEEE Computer So-ciety. doi:10.1109/ICSC.2010.71

Yilmaz, A., Javed, O., & Shah, M. (2006, December). Objecttracking: a survey. ACM Comput. Surv. 38. doi:http ://doi.acm.org/10.1145/1177352.1177355

Yilmaz, A., Li, X., & Shah, M. (2004, November). Contour-based object tracking with occlusion handling in videoacquired using mobile cameras. IEEE Trans. PatternAnal. Mach. Intell. 26, 1531–1536. doi:http://dx.doi.org/10.1109/TPAMI.2004.96

Zhong, C. & Miao, Z. (2012). A two-view concept correla-tion based video annotation refinement. IEEE SignalProcess. Lett. 259–262.

Zhu, S. C. & Yuille, A. (1996, September). Region competi-tion: unifying snakes, region growing, and bayes/mdlfor multiband image segmentation. IEEE Trans. Pat-tern Anal. Mach. Intell. 18, 884–900. doi:10.1109/34.537343