On-line video abstract generation of multimedia news

Multimed Tools Appl (2012) 59:795–832DOI 10.1007/s11042-011-0774-5

On-line video abstract generation of multimedia news

Víctor Valdés · José M. Martínez

Published online: 23 March 2011© Springer Science+Business Media, LLC 2011

Abstract The amount of video content available nowadays makes video abstractiontechniques a necessary tool to ease the access to the already huge and ever growingvideo databases. Nevertheless, many of the existing video abstraction approacheshave high computational requirements, complicating the integration and exploitationof current technologies in real environments. This paper presents a novel method fornews bulletin abstraction which combines on-line story segmentation, on-line videoskimming and layout composition techniques. The developed algorithm provides anefficient, automatic and on-line news abstraction method which takes advantageof the specific characteristics of news bulletins for obtaining representative newsabstracts.

Keywords Video · Abstraction · Skimming · Summarization · On-line ·Real-time · Multimedia · News

1 Introduction

Nowadays video abstraction (also called video summarization) is becoming a needto deal with the increasing amount of available video content in networked or homerepositories. The amount and variety of available video makes its search and retrievala more and more difficult task and many times content is lost and never used due tothe difficulties to navigate in such large repositories. The search and visualizationeffort implies a waste of time and, in many cases, of bandwidth. These problems can

V. Valdés (B) · J. M. MartínezEscuela Politécnica Superior, Universidad Autónoma de Madrid,C/Francisco Tomás y Valiente 11, 28049, Madrid, Spaine-mail: [email protected]

J. M. Martíneze-mail: [email protected]

796 Multimed Tools Appl (2012) 59:795–832

be reduced or eliminated with the application of video abstraction techniques, whichoffer solutions providing short and representative versions of original videos thatcan be easily downloaded and watched in a shorter amount of time, reducing as wellthe employed bandwidth. Nevertheless, one of the main disadvantages of existingtechnologies is the high computational resources and time needed for the generationof video abstracts. This is due to the large amount of audiovisual information thatvideo content is composed of, and to the complex analysis techniques and abstractionalgorithms typically used. Those disadvantages discourages the implementation andintegration of a great part of the existing solutions in real or commercial environ-ments (an example of commercial application can be found in [46] where one of themain challenges is the consumption of a minimal amount of resources).

We define on-line abstraction systems as systems with linear performance andprogressive generation (see Section 2 for more details about both concepts). Tofulfill a linear performance, the amount of resources required by the abstractionapproach must scale linearly with the length of the original video. On the otherhand, the progressive generation implies that the availability of the complete originalvideo is not required to begin the output abstract generation. With the fulfillmentof both conditions, the abstract can be generated ‘on the fly’, as the original video isbeing broadcasted or recorded, making a video abstract available with a limited delayonce the original video finishes (the amount of acceptable delay will depend on theapplication scenario) and being able to provide partial output during the originalvideo processing. In this way, the video abstract can be generated while content isbeing broadcasted or uploaded to a repository, so the user has an instantly availablevideo abstract without the need of waiting for more complex off-line processing.Furthermore, the progressive generation approach makes it possible to providepartial abstracts before the end of a broadcast so that a user is able to watch anabstract of already broadcasted content. For example, during a sport match half timebreak or once the news bulletin has already begun.

Besides ‘instantly’ available video abstracts, the proposed philosophy and tech-niques are applicable as well for fast video abstract generation from already storedmultimedia content (e.g., avoiding to store pregenerated video abstracts and re-quiring minimal resources for their on-demand generation) or for personalizationpurposes (e.g., allowing to create abstracts with different characteristics for each user,thus avoiding the need of storing thousands of abstracts versions per video).

The work presented in this paper focuses on the integration of a completeend-to-end on-line news abstraction system, being the way in which such ’on-line’operation modality is achieved the main contribution of the present work. The on-line operation implies that each individual component must operate both efficientlyand progressively and, for this reason, alternative solutions adapting common ap-proaches for news stories segmentation, video skimming and composition techniqueshave been developed for their real-time operation. In the rest of the paper, theimplementation details of each individual component are provided, showing howhigh quality results can be obtained with the application of progressive and lowcomputational complexity solutions in opposition to the traditionally applied off-linetechniques for shot classification and video skimming.

The result of the process is a video abstract in which, for each story found inthe news bulletin, a visual composition is generated combining the anchorpersonintroduction and a video skim of the visual segments of the story. The system

Multimed Tools Appl (2012) 59:795–832 797

is able to work with news bulletins composed by an arbitrary number of storiesand without assumptions about their length or shot composition. The real-timegeneration capability of the module is specially interesting for this type of content(the fast availability of the news information is of highest relevance for journalistsor regular users interested in access to the latest news). A working version of theproposed algorithm has been successfully integrated in the IST FP6-027685 Meshproject1 which focuses on the development of tools for enhancing the access tomultimedia news for professional users.

This paper is structured as follows: after this introduction, Section 2 presentsthe state of the art on both video abstraction generic techniques and news con-tent specific approaches. Section 3 gives an overview of the proposed abstractionsystem which combines different stages for the incoming content classification,video skimming and abstract composition. Section 4 presents a study of the specificcharacteristics of news video bulletins and the developed techniques for on-line newssegments classification. Section 5 details the on-line video skim generation algorithm.In Section 6 the orchestration of all the previously defined modules is depicted. Theobtained results, in terms of objective and subjective evaluations, are presented inSection 7. Finally, future work is foreseen and conclusions are drawn in Section 8.

2 State of the art

The work presented in this paper combines techniques usually applied for bothgeneric video abstraction techniques and specific approaches dealing with newsvideo content. In the last years a lot of video abstraction techniques have enabledapplications for fast content browsing, transmission and retrieval (for a more detailedoverview of the state of the art, the interested reader is referred to the differentsurveys on video abstraction approaches, e.g., [1, 26, 28, 36, 40, 51]). According to[28] and [51] there are two main types of video abstraction techniques: keyframeextraction and video skimming. In the first case, keyframe extraction, the outputof the process is a set of still images representing the original video according to aspecific criterion. This kind of approach can be found, for example, in [13, 33, 66]and allows further formatting of the output keyframes for the creation of mosaics orvideo-posters [9]. According to [28], existing work in video keyframe extraction canbe categorized in three classes: sampling based, shot based and segment based. Inthe first group, sampling based approaches, the keyframe extraction is performed bysampling the original audiovisual content uniformly or randomly from the originalvideo [8, 13, 35, 37, 49, 60]. Shot change detection is a common starting point in mostof the video abstraction techniques applied to segment the videos in shots, which areused as the basic unit for analysis. In shot based algorithms each shot is consideredas the basic abstraction unit and a number of keyframes are extracted from each shotaccording to different criteria [13, 50, 66, 67]. Segment based approaches performkeyframe selection over higher-level units known as segments, for example, different

1http://www.mesh-ip.eu

http://www.mesh-ip.eu


scenes in a movie, each of them composed by several shots (see more details aboutvideo content structuring and decomposition in [58]).

Video skimming consists in the extraction of several continuous video segmentsfrom the original video which can be later composed (edited) in different ways. Inthis case it is considered that the temporal sequence of frames is preserved betweenthe beginning and the end of each selected segment. A clear advantage of thismethod is the inclusion of motion and audio information. In highlight oriented videoskimming the output is composed by a set of relevant parts of the original content,as in the case of movie trailers or sport highlights abstracts [62]. In the case ofsummary oriented video skimming, the output is composed by different segmentswhich provide an overview of the whole original video [20, 48]. This category isusually related to approaches where the abstraction process is treated as a globaloptimization problem. Clustering [16] and rate-distortion optimization methods [27]fall into this category.

Some typical features applied to measure the relevance of different parts ofthe video are, for example, motion activity [25] or video associated audio [20].Other systems, applied to very specific domains, make use of several specializedfeatures (i.e. person detection or gesture analysis [21]) but can not be consideredas generalized techniques. Other approaches take into consideration video seg-ment/shot comparison techniques for the elimination of duplicates or too similarvideo segments based on camera motion patterns or visual similarity (for examplemaking use of color histogram calculation [55]).

Taking into consideration the computational complexity of the abstraction ap-proaches we can differentiate between linear and non-linear complexity systems.Linear methods are those in which the amount of processing resources needed bythe abstraction algorithm scales proportionally with respect to the original videoduration and are, therefore, the most suitable for real-time abstract generation.Non-linear approaches include techniques which require computationally costlyalgorithms which do not scale linearly and, in consequence, are commonly appliedonly in off-line scenarios (see [54] for a more detailed discussion about different typesof abstraction systems). In most cases, abstraction systems with linear complexityare those performing local optimization or selection of the original video frag-ments maintaining a constant analysis and selection complexity. Many abstractionapproaches rely on visual redundancy elimination and, in those cases, costly imageand video fragment comparisons must be carried out. If those comparisons areavoided or reduced, the abstraction systems are more likely to perform linearly.Straightforward solutions are, for example, the selection of the first frame of eachshot [15] or direct video subsampling [18]. More complex systems where the numberof comparisons are applied only to surrounding frames [3] or a limited amount ofpreceding video fragments [55] fall into the linear performance category as well.In [19] a real-time performance system which includes audiovisual analysis andautomatic content edition for home video is presented. Approaches aimed for itsimplementation in commercial devices such as personal video recorders (PVRs) payspecial attention to the computational performance of the system. Examples can befound in [46], an automatic highlight scene detection system, in [44], which presentsa fast-forward abstraction approach relying in the detection of face tracks on theoriginal video, or in [45], a recorded programs browsing system which classifies thecontent according to the number and position of detected faces. On the other hand,


methods dealing with the abstraction problem as an optimization problem [6, 31],maximization of an objective function [14], or clustering based approaches [7], makeuse of the whole available original content for the abstract generation and requirea number of comparisons which heavily increases with respect to the amount oforiginal information, yielding to non-linear performance.

One of the principal characteristics proposed in the present work, related withthe algorithms computational complexity, is the progressive generation of the videoabstract: the system does not require the complete original video availability tobegin the abstract output. The most common approach is the off-line operationmode, that is, the abstraction algorithm requires the complete original data beforeprocessing the abstract. Clustering [7, 16, 67], rate-distortion [27] approaches orother methods such as [24], where the complete original video is mapped to apolyline later simplified for the generation of the video abstract, are typical off-line solutions. Most of the existing progressive abstraction approaches are reducedto subsampling methods like fast-forward approaches [18, 60] or systems whereone keyframe is selected from each incoming shot [15] or group of frames (e.g. in[13] keyframes are extracted from each video segment accumulating a predefinedamount of variation). Other more complex methods include potentially progressiveadaptive playback approaches [43] or sufficient content change based methods [55],where video segments are added to the output if no visually similar fragments arealready included. Progressive analysis for the identification of motion accelerationor deceleration points as keyframes [33] or methods based on local analysis of afeature curve extracted from the original video [11] generate progressive abstracts aswell. To the best of our knowledge there are no works in the literature similar to thepresent one, which includes on-line (that is, in real time with progressive processingand output) content classification, skimming and abstract composition.

When considering the application of abstraction techniques for news content,one of the main problems to deal with is the identification of story boundaries. Inthe TRECVID 2003 story segmentation task, aimed to identify story boundarieswithin news bulletins, participants employed a wide variety of effective techniques,including text-based (the original videos were provided with closed caption text) andaudiovisual approaches. In [10] several of the presented techniques are compared.The results show that the best results were obtained when applying audiovisualor a combination of audiovisual+text (up to 0.77 F1 scores) techniques, while thetext-only based approaches obtained worse results. Some of the participants [2, 39]obtained up to 80% accuracy in the detection of anchorperson and in [10] it isstated that it would be possible to obtain a F1 measure of 0.62 in story segmentationbased only in an anchorperson detection process if the detection rate approaches to100%. The correct anchorperson detection is, therefore, of the utmost relevance fornews abstraction. Face detection techniques have been a commonly applied for suchpurpose: for example, in [32] a list of major casts (including anchorperson in newscontent) is generated by a clustering of the content based on face detection and audiofeatures. In [45] a system for news content browsing making use of the same face de-tection technique [57] as this work is presented. O’hare et al. [41] includes, as part ofthe extracted feature set for story segmentation, a face detection algorithm based onflesh color detection followed by a shape analysis. Such work assumes that each storybegins with an anchorperson followed by a more detailed report. The video bulletinis divided into shots clustered using shot length, distribution, motion activity and


face detection features. Authors found that anchorperson shots tend to be clusteredtogether due to their high similarity and make use of a Support Vector Machine(SVM) for its classification. In [29] it is proposed to make use of compressed-domain extracted features (motion activity and DC-images) for the detection of theanchorperson based on color comparison in high motion areas of the image. In thiscase, the anchorperson audio is kept and an abstract is generated by its combinationwith a video skim of the following news report segment, constrained to a lengthequal to the kept audio length. In [23] a system for the selection of news highlightsbased on the analysis of closed-captions and its alignment with news bulletin audiois depicted. Zhang et al. [66] deals with the presentation aspects of video search inthe news domain proposing, in this case, video collages as the tool for fast browsing.Work described in [65] proposes the division of news bulletins in anchorperson andnews shots. Anchorperson shots are identified by calculating the difference betweenconsecutive frames and comparing those with small differences (anchorperson shotsare almost static) with a quite simple anchorperson model which defines certainareas, like head or body, where motion should be found. Another off-line clustering-based approach can be found in [64], where face detection is performed including theconsideration of cloth color under the head. Shots with faces are clustered based onthis information and the largest cluster is assumed to correspond to the anchorperson(it should be noted that this approach could fail in cases where, like in the contentset we used, there is more than one anchorperson during the news bulletin). Weatherreport shots are detected as well by making use of color histograms (blue and greenpredominance can usually be found) and motion vector information. In cases wherean anchorperson appears among two reports corresponding to the same story, amerging process is carried out based on textual information analysis and visualcomparison allowing the fusion of segmented stories sharing the same topic. Theusage of anchorperson cloth color can also be found in [63] where faces are detectedbased on flesh-color analysis in images. Other systems consider a high number ofpossible shot categories: in [4] a decision tree, based on low level (color histogram,motion activity, shot duration, etc.) and high level (face detection and text captions)features, is applied for differentiating between 13 possible shot categories. A furtherHidden Markov Model (HMM) analysis is then applied to locate scene boundaries.A completely different approach can be found in [17] where stories segmentationheavily relies on closed-captions, speech alignment and commercial detection (thelatter based on shot change rate and black frames detection). The type of techniquessuitable for the classification of different news content categories is closely relatedto those applied for high-level concept detection systems, such as those presented tothe TRECVid high-level feature extraction task (consisting in the detection of highlevel concepts in video content) [38, 47, 59] which include, among others, conceptssuch as ’news subject monologue’ or ’weather news’. However, in the presentapplication scenario, where only news content is processed, the task is simpler asit is only necessary to discriminate between a constrained number of predefined shotcategories.

In summary, the studied systems for video abstraction and their specific appli-cation for news content include a high variety of extracted features and appliedtechniques, many of them focused on the detection of anchorperson shots. For thispurpose, face detection, color and shape analysis algorithms are commonly applied.Nevertheless, although several of the existing techniques are quite efficient, none of


the existing approaches seem to work as a real-time and progressive system, pro-viding instant abstract availability in any moment during the broadcast/abstractionprocess. Most techniques assume the complete availability of the original contentand unlimited time for the generation of the news bulletin abstracts. Even thosesystems which provide real-time browsing capabilities or a high efficiency system relyon content analysis carried out with the availability of the complete original contentand should be, therefore, considered as off-line systems. When studying existinggeneric video abstraction systems not necessarily focused on news content, it ispossible to find several progressive generation systems. Nonetheless the complexityof the existing techniques and the type of generated abstract are limited. In thiswork we propose a complete system able to carry out content feature analysis,classification, video skimming based on visual redundancy elimination and, finally,output abstract composition and coding. The solution is able to operate in real-timeand progressively, that is, sequentially processing and outputting content, and for thispurpose a set of novel techniques have been developed, and existing ones have beenadapted, focusing on their efficiency.

3 Overview of the on-line abstraction module

In this section an overview of the On-Line Abstraction Module (OLAM) archi-tecture and functionalities is presented. The OLAM is in charge of generating on-line multimedia abstracts of news bulletins by combining real time techniques forshot classification, news stories segmentation, video skimming and video layoutcomposition. The main challenge of the system is to build an abstraction systemrunning on-line, that is, while the content is being broadcasted (e.g., for makingthe content available in an Internet portal simultaneously to the program creation),and finishing the abstraction process with a negligible delay after the original videobroadcast finishes. The application of on-line algorithms to solve those problems isnot a common approach and most studied works do not aim to develop efficientprogressive generation solutions. The efficiency and on-line generation we look forraise a number of technical challenges due to the high efficiency required for thedifferent processes carried out and because only partial information (the alreadyreceived/broadcasted original content) is available at any given instant during theabstraction process.

In order to enable the on-line abstract generation and reduce the complexity offrame/shot analysis and comparison algorithms, the input video is divided in shortsegments, never longer than 30 frames (slightly over the commonly accepted minimalperceptible size of 25 frames [18] in order to increase the output smoothness in thevideo skimming stage), which are processed sequentially, being analyzed, classified,selected or discarded separately in the different stages of the abstraction process. Thelength of the obtained video segments can be smaller than 30 frames in cases wherethe segment contains a shot change. Such small granularity in the video processingenables to output partial results from the abstraction process when the originalvideo has not been completely received. As depicted in Section 2, most commonapproaches deal with visual classification, skimming and composition problemswithout taking into account computational constraints such as those needed for theon-line and real-time operation provided by the OLAM.


As depicted in Fig. 1, the system is divided in four modules:

– Analysis: The analysis stage is in charge of the extraction of low-level featuresfrom the original video stream for their use in the following classification andvideo skimming stages. The original video stream is divided in small segmentsand, for each one, features such as the MPEG-7 Color Layout, frame differences,color analysis and face detection are extracted (see Section 4.1).

– Classification: In this stage each received video segment, annotated in theanalysis stage, is classified based on the information provided by a set of inde-pendently trained SVMs for the different possible shot categories (see Section4.2). This stage works also at subshot level, with small video segments composedby a maximum of 30 frames so, once each segment is classified it can almostimmediately be discarded, selected for the composition stage or sent to theskimming stage. The actions associated to each different video segment may varydepending on the configuration of the OLAM system as will be described in thefollowing sections.

– Skimming: The skimming module is in charge of generating video skims from thecombination of segments received after their classification. In this case a novelalgorithm [52], described in Section 5, is applied for the on-line generation of thevideo skims. The result of the skimming process is sent to the composition stagefor its combination with other selected video segments.

– Composition: In the composition stage the final abstract presentation is gener-ated. The final layout is a combination of resized video segments rendered in theforeground of the image together with full sized segments in the backgroundplane. By default, the OLAM generates video abstracts with a foregroundwindow including the anchorperson’s complete story introduction while, in thebackground, a condensed video skim of the news report is presented (see Fig. 2).The configuration of the abstract presentation layout can vary depending on thedesired abstract characteristics as will be explained in Section 6. The usage of theanchorperson’s audio introduction is similar to the approach proposed in [29]although, in that case, the visual composition of the anchorperson is not carriedout. Moreover, the algorithm presented in [29] does not work on-line (the wholevideo is needed to begin the process), it is applied for individual news stories thatmust always begin with an anchorperson shot, and it does not provide flexibilityin terms of abstract length, number of stories in the news bulletin and arbitrarynumber of anchorperson appearances.

The OLAM system is prepared for on-line processing of arbitrary lengthnews bulletins. The on-line approach, apart from the previously enumerated

Fig. 1 OLAM modules


Fig. 2 Layout for the newsvideo abstract

functionalities in terms of instant abstract availability, efficiency and personalizationpotential, would be easily adaptable to continuous running of abstraction processesfor 24-h news broadcasting and to any other kind of broadcasting or recordingsystems (e.g. video surveillance systems).

4 News content classification

The first stage in the abstraction process consists in the classification of the incomingvideo segments in the different possible categories included in a news bulletin. Theavailable corpus for development and testing is constituted by 54 complete newsbulletins, about 28 min long each, provided by Deutsche Welle to the IST-FP6-027685 Mesh project (see footnote 1), totaling about 25 h of news content. Thebasic structure of the news bulletins is quite similar to those identified in previousworks [29, 41, 63, 65] and consists of a number of concatenated news stories, eachintroduced by an anchorperson section and followed by a visual report with thedetails of the story. The assumption of such a basic structure has been successfullyapplied for the segmentation of news stories but a further refinement would be usefulfor a meaningful abstraction process: inside a news bulletin many other types ofshots, such as reporters, interviews, commercials, etc., can be found. By observing theavailable content, the following types of shots have been identified: Anchorperson,Animation, Black(Frame), Commercial, Communication, Interview, Map, Report,Reporter, Studio, Synthetic and Weather.

Figure 3 shows a typical example for each of the defined categories (except Blackand Commercial). The most usual ones are the Anchorperson and Report categories,which can be found in almost every news story. The Weather, Animation and Studiocategories are not associated to news stories and usually appear at the beginningor end of a news bulletin or as transitions between different parts of it. The Map,Communication, Report and Synthetic categories are usually found interleaved withother shot categories as part of a news story but may not appear at all.


Fig. 3 News shot categories

4.1 Analysis

The proposed system aims to provide on-line abstract generation, that is, to processthe original content and generate the output abstract progressively and in real-time.For this reason, every stage in the whole abstraction process must fulfill certain re-quirements related to the efficiency and progressive operation. Small video segmentswill be the basic processing unit in all the stages of the abstraction process allowing toprovide the needed granularity for on-line generation while reducing the complexityof shot analysis processes and comparisons dependent on the video segment length.In addition, the small video segment approach reduces the dependency on accurateshot boundary detection systems and enables the possibility of eliminating intra-shot redundancies (other systems in which the basic unit is the shot do not allowto discard only short portions for the reduction of visually steady segments length).This approach has been successfully applied in our previous on-line video abstractionworks [52, 53, 55].

The feature extraction process starts with the calculation of the MPEG-7 ColorLayout descriptor [22] for each decoded video frame. This descriptor is particularlysuitable for our system proposal as it has been designed as a fast solution forhigh-speed image retrieval. After its calculation, frames are grouped in blocks ofa maximum of 30 consecutive frames, depending on a simple threshold-based shotchange detection mechanism. Such mechanism has been previously experimentedin [53] and it is implemented by calculating the color layout distance betweenconsecutive frames and splitting the video segments when the difference exceedsan experimentally set threshold (see [53]). This mechanism provides only a slightimprovement with respect to the fixed block separation, avoiding the mix of differentshots in a single video segment. Nevertheless, performance of the overall abstractionalgorithm does not have a high dependency on this mechanism because the smallvideo segment size minimizes the possible impact of shot change location errors.

For the classification of each video segment in one of the defined categories,additional features must be extracted segment by segment. Such extraction mustbe efficient enough so the classification process, together with the execution of therest of the abstract generation modules, can be completed in real time. For thereduction of the required computational complexity, the features are not extracted


Fig. 4 Anchorperson face position examples

in a frame by frame basis but subsampled and averaged for each video segment. It isassumed that, given the small length of the video segments and the subsampling rate,only small variations may occur in the reduced time intervals between the featureextraction instants. The set of extracted features has been selected trying to maximizetheir meaningfulness (given the different shot categories a news bulletin can contain)while keeping low extraction complexity. The obtained classification results, shownlater, demonstrate the feasibility of applying the following ‘light’ descriptors for thecategory classification:

– Face detection: The OpenCV library2 provides a very fast method for arbitraryobject detection based on Haar features [30, 57]. In our case a frontal facedetection model, particularly suitable for the anchorperson detection, alwaysstaring at the camera and with the face located in particular positions (see Fig. 4),has been applied. The average number, size and coordinates of detected facesas well as the variance of such features are calculated for each independentvideo segment. These features are aimed to allow the differentiation betweenAnchorperson, Reporter, Interview and the rest of possible categories.

– Color Variety: The color distribution varies between natural and syntheticallygenerated images and represents a good feature for their differentiation. To mea-sure the number of representative colors in an image, the Y, U and V channelshistograms are calculated. For each of them a color representativeness thresholdis experimentally defined as 1/3 of the maximum histogram value. For each videosegment we obtain a single color variety value by averaging the number of colorsin the histograms with a value over the defined threshold. Figure 5 shows anexample of the calculation of the histograms and representative colors (colorsover threshold) for both a synthetic and a natural image.

– Frame Differences: As part of the Color Layout Descriptor extraction, an 8 × 8thumbnail image is generated for each decoded frame. For an estimation ofthe video activity, the average variation for each video segment is calculatedby subtracting consecutive frames thumbnails. In order to differentiate betweendifferent activity types, for example local or global motion patterns, five differentactivity areas, shown in Fig. 6a, have been defined.

– Shot Variation: In order to obtain an average segment variation measure,the Color Layout difference is calculated every three frames within the videosegment and then it is averaged. This provides a different activity measure tothat obtained with thumbnail subtraction.

2http://sourceforge.net/projects/opencvlibrary/

http://sourceforge.net/projects/opencvlibrary/


Fig. 5 Representative color calculation

– DCT Coefficients Energy: The Color Layout descriptor consists in the DiscreteCosine Transform (DCT) coefficients of each color plane 8 × 8 thumbnail. Mak-ing use of those pre-calculated coefficients, it is possible to characterize imageswith smooth or abrupt changes. Images with different variation characteristicscontain different energy distribution within the DCT coefficients. In this casethe descriptor coefficients have been divided in four areas (see Fig. 6b), whichare added up and averaged within each video segment to obtain four frequencymeasures.

– Image Intensity: Shots recorded in a TV set usually have constant and controlledillumination conditions. In this case the mean and variance of the intensity ofeach frame are calculated and averaged for each video segment.

The set of extracted features has been selected trying to keep both simplicity anddiscrimination capacity. Several of the extracted features make use of the ColorLayout descriptor associated information, which is later used for shot comparisonin the video skimming process, avoiding the need of extracting new features whichcould slow down the process.

All the extracted features are based on visual features only. Several experimentswere carried out with the inclusion of simple audio features (audio energy, zero-cross rate, etc.) which did not produce significant improvement in the classification

Fig. 6 a Frame block variationareas; b DCT coefficientsblocks


Table 1 Feature extractionaverage time per second ofvideo

Feature Average Extractionextraction time frequencyper second (ms) (every × frames)

Frame decoding 120.3 1Color layout 12.4 1Face detection 75.6 7Color variety 1.5 4Frame differences 8.7 4Shot variation 0.6 4DCT coefficient blocks 0.0045 4Image intensity 0.018 4Total 219.1 –

process. This might be produced because in most of the news content the audiotrack contains only narrations; and ambient sound or music in a minority of theshots which are already characterized by visual descriptors only (for example inthe case of the news bulletin introductory animations with music). On the otherhand, the most relevant categories such as anchorperson or reports are separabletaking into consideration visual-only features. Previous works devoted to audiovisualscene change detection [61] found out that news is one of the genres in which theaudio features are least effective. Moreover, in this case, the performance constraintsand lack of available information related to the on-line operation complicates theinclusion of more sophisticated (e.g. speech recognition, prosodic analysis, speakerchange) and potentially effective audio analysis techniques.

Table 1 summarizes the extraction times3 for each of the described featurestogether with the decoding time. Values are averaged so the reported time representsthe average feature calculation time for every second of incoming video. Theextraction frequency is included in the table as well. It must be noted that, for theachievement of real-time performance, the average decoding, feature extraction,classification, selection and coding steps required for each incoming video segmentmust be smaller than its playing time. In this case, the average feature extraction timeper second, including the frame decoding, is 219.1 ms, providing 780.9 ms per secondof video still available for the rest of the abstraction processes.

4.2 Video segment classification

4.2.1 Category classif iers

Once all the features have been extracted, each incoming segment must be classifiedin one of the categories defined in Section 4. The chosen classifier is the broadly usedSVMs (Support Vector Machines) [12] which has proven to provide a good perfor-mance in different classification problems [34]. The libSVM library [5], integrated inthe OLAM, provides a fast and easy to use SVM implementation.

For the training process, ten complete Deutsche Welle (DW) news bulletins havebeen manually annotated classifying each shot according to the defined categories.

3Hardware platform: Intel Core 2 Duo @2.53 GHz with 4 GB of RAM.


The videos are split and features are extracted following the process described inSection 4.1. The purpose is to feed the classifier training process with a set of featuresextracted in the same way as in the real abstraction process. As result a total of 13,855annotated segments are available for the training and validation processes.

An independent binary SVM with RBF (Radial Basis Function) kernel classifierhas been trained for each category with a grid search of C and gamma parametersof the SVM. The numbers of positive and negative samples have been equalized foreach training process. A uniform sampling was carried out obtaining 5,000 negativesamples while the positive samples were oversampled in order to get balanceddata sets. For each possible C and gamma parameter combination a five-fold crossvalidation is carried out with 90% of the training set. The obtained classifier is usedfor the classification of the 10% remaining test samples for validation purposes.Table 2 summarizes the obtained C and gamma parameters as well as obtainedprecision and recall for the 10% validation samples.

It can be observed how the synthetically generated categories, Black, Weather,Synthetic, Animation and Communication are the ones with better classificationrates due to the low variability in the specific characteristics of this kind of content.The Map category classifier performs slightly under the other synthetic categories,probably because the variability in the maps is higher and can eventually containanimations. The Anchorperson classifier has a very high classification performance aswell, given the well defined characteristics (face presence and location, illuminationconditions) of this kind of shots in the DW content. The Reporter and Intervieware two of the categories with lower classification performance because, in manycases, the classifiers are not able to differentiate between them or, under specificcircumstances, can consider a reporter or interviewed as an anchorperson. TheCommercial category is another with low performance, an expectable result becausecommercials contain very different kind of shots easily mistakable with any othercategories.

For the proposed abstraction process the good results obtained with the Anchor-person classification are very important: the correct identification of the anchorper-son shots is of the highest relevance for the correct news segmentation, extraction ofrelevant news stories introduction and correct overlapping with news reports.

Table 2 DW single category classification results

Category # segments log(C) log(gamma) Precision Recall

Anchorperson 3,124 3 −1.025 0.989 0.984Animation 236 10.25 −3.87 0.997 0.996Black 131 −1.37 5.75 1.00 1.00Commercial 293 24.92 −18.55 0.829 0.873Communication 128 6.55 −7.87 0.995 1.00Interview 1,466 −0.125 −0.9 0.862 0.780Map 286 22.5 −17.75 0.978 0.997Report 6,212 4.5 −3 0.928 0.965Reporter 821 21.3 −20.98 0.969 0.792Studio 387 11.25 −4.35 0.950 1.00Synthetic 317 −1 0.2 1.00 1.00Weather 454 0.925 −0.2 1.00 1.00


4.2.2 Multi-class classif ier

The individual SVM classifiers provide a very good starting point for the clas-sification of the different kind of shots in the news bulletin. Nevertheless, thefinal decision about which category a shot belongs to is not straightforward: theclassification of a shot with the complete set of trained binary classifiers produces,in many cases, a multiple positive situation, that is, the shot is simultaneouslyconsidered to belong to more than one category.

Another point to take into consideration is the consistency in the category ofconsecutive video segments. The proposed approach works at sub-shot level and,therefore, it is very likely to find consecutive video segments belonging to the samecategory.

Both situations have been solved by the training of an additional ‘global’, multi-class, SVM which is fed with the individual classifiers predictions in a five segmentswindow (including a temporal dimension in the classification data) and outputs avideo segment category prediction in one of the 12 possible categories. Figure 7 de-picts the two steps in the classification process: in the first step a Classif ication Vectorfor a given time instant is composed by the 12 individual classifications of the videosegment obtained from the 12 binary SVMs. In the second step a Global FeatureVector for a given time instant is composed by its corresponding extracted features(enumerated in Section 4.1) and the Classif ication Vector of the two previous andtwo subsequent segments as well as the current Classif ication Vector. The originalset of extracted features are included again in the Global Feature Vector becausethey can provide useful information, not taken into account in the binary classifiers,for the discrimination between two or more specific categories. For example theanchorperson single category classifier is trained for the discrimination betweenanchorperson and any other type of video shots (reporter, animation, maps, etc.).Such classification relies in those features which provide the best overall classificationperformance. Considering, for example, the situation of having the anchorpersonand reporter binary classifiers activated, the inclusion of the low level features inthe Global Feature Vector would allow the multi-class classifier to ‘reconsider’ thefeatures which better discriminate between anchorperson and reporter. Such set ofmore discriminative features could vary if other binary classifiers are activated. The

Fig. 7 Multi-classclassification steps


Table 3 DW multi-class classifier confusion matrix

1 2 3 4 5 6 7 8 9 10 11 12

Anchorperson 1 99.75 0 0 0 0 0.25 0 0 0 0 0 0Animation 2 0 91.5 0 0 0 0 0 8.5 0 0 0 0Black 3 0 0 100 0 0 0 0 0 0 0 0 0Commercial 4 0 33.25 0 58 0 0 0 8.75 0 0 0 0Communication 5 0 0 0 0 100 0 0 0 0 0 0 0Interview 6 0.75 0 0 0 0 98.75 0 0.5 0 0 0 0Map 7 0 0 0 0 0 0 100 0 0 0 0 0Report 8 0 0 0 7.25 0 1.25 1.25 89.25 0 0 0 0Reporter 9 0 0 0 0 0 22.75 0 0 76.25 0 0 0Studio 10 0 0 0 0 0 0 0 0 0 100 0 0Synthetic 11 0 0 0 0 0 0 0 0 0 0 100 0Weather 12 0 0 0 0 0 0 0 0 0 0 0 100

experiments during the classifiers training showed that the overall classification rateimproved if the original set of extracted features were included in the Global FeatureVector.

The multi-class classifier has been trained with the same corpus and methodologyas the individual classifiers, balancing the amount of samples per category. Theoptimal log(C) = 2.45 and log(gamma) = −58.9 parameters obtained an overall92.79% correct classification rate. Table 3 shows the confusion matrix values forthe multi-class classifier when applied to the 10% validation samples. The behaviorof the classifier is similar to the individual classifiers performance: the best resultsare obtained for the Anchorperson, Map, Studio, Synthetic and Black categories.The higher incorrect classification rates have been obtained between the Reporterand Interview categories and between the the Commercial and Report ones for thepreviously exposed reasons.

Finally, in order to reduce possible classification mistakes, a temporal filteringof outliers is carried out considering that video segment categories can not presentmany consecutive changes. For this reason a video segment classified in a categorya surrounded by two segments of a different category b is assigned to belong tocategory b. For the same reason, a segment classified in a category a, precededby a segment belonging to category b and followed by segment with a differentcategory c is classified as b or c depending on which of the adjacent segments is moresimilar from a visual point of view (the comparison is carried out making use of theColor Layout descriptor, with the same mechanism as the one applied for the videoskimming process described in Section 5).

Table 4 summarizes the average time (see footnote 3) per second of classifiedvideo consumed by the binary and multi-class classification stages. The total averageclassification time is about 2 ms which is negligible and shows the high efficiency ofthe SVMs once trained.

Table 4 Average classificationtime per second of video

Classification step Classification time (ms)

Binary classifiers 1.111Multi-class classifier 1.114Total 2.226


4.2.3 Alternative content classif ication

For a further validation of the proposed descriptors and segment classificationmechanism, the training and testing processes were repeated with a smaller set ofalternative news bulletins. In this case the steps followed in Sections 4.2.1 and 4.2.2were repeated making use of news broadcasts from the Chinese channel CCTVavailable in the TRECVID 2005 content set. In this case ten news bulletins ofabout 10 min each were manually annotated for the training process. The CCTVnews bulletins have a similar structure to the DW ones but no maps, weather orblack categories were found in the training set. Moreover the content presents asmaller resolution and worse quality than the DW content. On the other hand, theanchorperson shots are very stable, without changes of anchorperson nor backgroundduring a single news bulletin and this fact should ease the anchorperson classification.

A total of 5,309 individual segments resulted from the annotation process andwere used for the feature extraction and training processes. Table 5 summarizesthe number of each segment category used, optimal C and gamma parameters andobtained classification precision and recall for 10% validation samples (balancedpositive and negative samples in the same way as in the DW content). The obtainedindividual classification results are, in principle, better that those obtained for theDW news bulletins. Nevertheless the DW results are more reliable for different rea-sons: the amount of content is about three times higher than the applied CCTV, thereis a higher number of different anchorperson, camera and background combinationsas well as different kinds of animations and maps in the DW content. Finally, it wasfound that the commercials segments in the CCTV content are the same in all thebulletins and this fact explains the unusually good results obtained in the CCTVcommercial category classification. The worst results were obtained in the interviewcategory, behavior found in the DW content as well. It was observed that in severalof the interviews, people were recorded in a profile position, what tends to producefails in the frontal face detection. The reporter category obtained a surprisingly goodresult but the small amount of available reporter annotated fragments makes thisresult unreliable. In the same way, the small amount of synthetic content does notallow us to assure a so high classification precision.

The training process of the multi-class classifier, following the same steps de-picted in Section 4.2.2, resulted in log(C) = −11.5 and log(gamma) = −26.75 optimalparameters. Table 6 shows the confusion matrix obtained for the multi-class clas-sifier. In this case the categories with higher classification error are the interview

Table 5 CCTV single category classification results

Category # segments log(C) log(gamma) Precision Recall

Anchorperson 1,138 0.85 −0.8 0.991 0.993Animation 74 −0.65 −3.35 0.998 0.999Commercial 115 19.2 −11.45 0.982 0.980Communication 122 −1.7 −1.5 0.999 0.999Interview 625 4.55 −9.8 0.924 0.854Report 2,990 7.25 −3.25 0.923 0.938Reporter 78 12.2 −7 0.991 1.00Studio 139 3.3 −0.8 0.999 0.999Synthetic 28 −8.85 0.4 0.998 0.999


Table 6 CCTV multi-class classifier confusion matrix

1 2 3 4 5 6 7 8 9

Anchorperson 1 100 0 0 0 0 0 0 0 0Animation 2 0 100 0 0 0 0 0 0 0Commercial 3 0 0 100 0 0 0 0 0 0Communication 4 0 0 0 100 0 0 0 0 0Interview 5 10.67 0 10 0 75.33 4.0 0 0 0Report 6 0 0 1.33 0 4.0 92 1.33 1.33 0Reporter 7 0 0 0 0 0 0 100 0 0Studio 8 0 0 0 0 0 0 0 100 0Synthetic 9 0 0 0 0 0 0 0 0 100

and report ones. Most of the erroneous interview segments are misclassified asanchorperson which is expectable given both categories common characteristics. Themissclassification of interview segments as report or commercial may be produced byfailures in the frontal face detection process. In the opposite case we can find reportfragments incorrectly classified as interview probably because in such segments,although not being interviews, frontal faces appear. It should be expected that theobtained classification results could be improved with a bigger training content set.Nonetheless, apart from those specific issues, the overall results are coherent with theresults obtained with the DW bulletins and demonstrates the possible application ofthe extracted descriptors and classification scheme for their application with differentcontent to the one used during the development of the system. In Section 7.1.2additional evaluations of the CCTV content abstraction results are reported.

5 Video skimming

The on-line approach implies that the video skim is generated progressively, whilethe video is being received, with limited delay and in real time, i.e. the processingtime will be equal to or smaller than the video play time. For this reason, there isa lack of knowledge about the characteristics and duration of the incoming videocontent (the available information is reduced to already received video only) and it isnot possible to substitute video segments already included in the output. Therefore,the control of the skim size and the selection of segments become rather complexproblems.

The video skimming algorithm presented in this paper is a slightly improvedversion of the implementation [52] evaluated in the TRECVID 2008 BBC RushesSummarization Task [42] and continues our work carried out in the last years inon-line video skimming techniques [53, 55, 56]. It provides a generic method foron-line video skimming which is scalable in terms of computational requirements(amount of processing, video skim generation delay and memory consumption)and hence particularly suitable for its integration in the OLAM module in whichthe computational resources must be carefully distributed between the differentabstraction stages.

The proposed approach aims to generate a video skim as much informative aspossible and, for this reason, the principal criterion for the selection of the videosegments is to avoid the inclusion of repeated content (visually similar content).


At the same time, the selection process is constrained in order to select as muchconsecutive segments as possible, maximizing the continuity of the output videoskim and producing more pleasant results. Additionally, it is possible to prioritize theselection of high-activity fragments aiming to capture more events from the originalcontent. The mechanism for carrying a segment selection fulfilling those conditionswhile keeping certain target length is implicit in the binary tree algorithm describednext.

The algorithm is based on the dynamic generation of a skimming tree whichmodels the different possibilities for inclusion or exclusion of the incoming videosegments. The path in this binary tree representing the best skim (according to apredefined criterion) is iteratively selected and provides the information requiredfor deciding between selecting or discarding each incoming video segment. Thisapproach is based on the assumption that, considering a limited size buffer ofn incoming video segments, the selection process could be improved with thepossibility of choosing 1 to n segments on each iteration instead of taking aninstantaneous decision, inclusion or discard, for each incoming video segment. Thesegment selection precision of the skimming system is increased with higher n values,enabling to achieve the same precision as an off-line approach with n equal to thenumber of video segments in the original video. The drawback of increasing n is theintroduction of a minimum delay in the video skim generation of n video segmentsand the increase in the complexity of the selection system (2n possible combinationsof segment inclusions). For the implementation of a buffered on-line video skimmingsystem and to control the complexity of the process, a binary tree representation hasbeen chosen: for each incoming video segment two nodes are added to the skimmingtree (see Fig. 8). Such nodes represent the inclusion or discard of the received videosegment in the output video skim and are appended to every previously existingterminal node. Each branch in the skimming tree represents a possible video skimas a sequence of included or discarded nodes and it is scored taking into account thefollowing criteria:

– Size: Relative size of the resulting video skim calculated by considering thenumber of inclusion/discard nodes on each branch. The video skimming moduleintegrated within the OLAM aims for a 1/3 length ratio, close to the relationbetween the number of anchorperson and other segments in the news bulletins(see Table 2).

– Continuity: In order to generate more pleasant video skims a continuity measureis considered. Such measure computes the ratio of consecutive video segmentsincluded in the output skim. Avoiding too many cuts in the video skim producessmoother results while fewer events can be included in them. This parametershould be balanced in order to get good event inclusion and smoothness results.

– Redundancy: The main purpose of most abstraction approaches is the elimina-tion of the original video redundancy. In this case the similarity of each videosegment with respect to other video segments included in the same tree branchis calculated by making use of the Color Layout previously extracted.

– Activity: In many video abstraction systems it is a common approach to give ahigher priority to the inclusion of those segments with high activity due to itshigher probability to include ‘relevant’ events. In this case the variation of thesegments included in each branch has been previously calculated in the featureextraction stage (see Section 4.1).


Fig. 8 Dynamic tree example

The on-line video skim generation and computational complexity of the approachis controlled by considering only small sub-trees instead of generating the completeskimming tree for the whole video (see Fig. 8). The depth of the skimming tree andthe number of branches are limited by the selection of the subtrees with higherscore paths and by eliminating those with lower scores. Higher depth and treeleaf limit imply better quality output (more possibilities are evaluated) but alsohigher computational requirements (processing time and memory consumption).The proposed algorithm allows the usage of different tree parameters according toeach situation: a real time skimming application would require high computationalefficiency and, therefore, the tree parameters should be set for fast skimming.

The target of the system, applied in the present context, is to maximize therepresentativeness of the video skims while reducing the news report redundancy.The system was evaluated by sending two different runs of the algorithm, generatedwith slightly different parameters, to the TRECVID 2008 BBC rushes summarizationtask [42]. The machine used was a Pentium Xeon @ 3.7 GHz with 3 MB of RAM and,in both cases, the algorithm performance (120 s on average for Run1 and 99 s forRun2) outperformed most of the other systems (4,879 s effort on average) and onlyone of the baselines was faster than the proposed system, being its approach uniformsampling (cmubase3 (17.2 s)).

Table 7 TRECVid 2008 BBC rushes summarization evaluation results

Run DU XD TT VT IN JU RE TE

Run1 31.2 0.5 45.1 33.1 0.55 3.27 2.97 2.71Run2 31.1 0.5 47.7 33.5 0.56 3.32 2.96 2.62Avg. 27 4.5 41.4 29.0 0.46 3.15 3.2 2.72


Table 8 Average treeprocessing time per second

Insertion Extraction Total(ms) (ms) (ms)

2.97 0.15 3.12

The duration (DU) and time difference (in seconds) with the target (XD) demon-strate the good performance of the output size control incorporated in the proposedsystem, specially considering a fully on-line approach without information about thelength of the incoming video (Table 7). Metrics related to the judging time (in s)(TT, VT) are slightly higher than the average, an expectable result as the obtainedvideo skim lengths are over the average values as well. The junk metric (JU) (absenceof undesired shots such as blank frames and clapboards in a scale from 1 to 5) cannot be extrapolated to the current system as there is no junk filtering mechanismincorporated in it. Both runs obtained very good results in the inclusion rate (IN),that is the fraction of events from the original video included in the summary,particularly relevant to measure the representativeness of the generated video skims.The redundancy (RE) and tempo (TE) (subjective measures expressed in a 1–5 scale)of the video skims are slightly under the average results, a common result for systemswith high inclusion rates, and not very relevant in the current context as the newsreport videos are not as redundant as unedited content (as the BBC Rushes are).The obtained results demonstrate the good performance of the only on-line videoskimming system presented to the BBC rushes summarization task when comparedwith off-line approaches. Further technical details about the binary tree based on-line skimming approach, branch scoring mechanism and results can be found in [52],while more information about the TRECVID 2008 BBC Rushes SummarizationTask and evaluation can be found in [42].

Table 8 shows the average processing time (see footnote 3) for each incomingvideo fragment in terms of insertion (time needed to add a new video fragment to thetree) and extraction (time needed to extract the tree root fragment). The times areobtained with a maximum tree depth of 20 nodes and 200 evaluated tree branches.This small skimming tree is sufficient for the type of content we are dealing with,short news reports with low redundancy values, and benefits the performance ofthe skimming process. The descriptors needed for the evaluation of each branchare already extracted in the OLAM analysis module. For these two reasons, smallskimming trees and available descriptors, the overall process is very fast, requiringonly an average of 3.12 ms for every processed second of video.

6 Abstraction process

In this section the news story abstract creation process is detailed. It is consideredthat the anchorperson provides the essential audio information to allow the users toget an idea of what each news story is about and that, in most cases, it is followed bya report section in which the introductory information is extended. From this startingpoint three abstraction strategies are combined:

– Video Composition: The simultaneous display of different video segments allowsto reduce the video abstract length, condensing the information presented. Theanchorperson segments contain compact and high-interest audio information


but not relevant visual information while the report sections include extendedaudio information together with relevant visual content. It is possible to takeadvantage of this particularity of the news bulletins by presenting the anchor-person segments, that provide a natural audio abstract of the news story, ina reduced window with a full-size background composed of the most relevantvisual information of each news story.

– Video Skimming: The more relevant video segments from a visual point of view,those corresponding to the news story report, are selected to be displayed in thefull-size background of the abstract layout. Any kind of video content usuallycontains redundant visual information so the news reports length can be reducedwith a video skimming process and for this purpose the algorithm depicted inSection 5 is applied.

– Segment Filtering: Segments which are included as part of the news story reportssection but do not provide relevant visual information can be directly eliminated.For example, if it is considered that reporter or interview segments being part ofthe news report do not provide additional visual information about the newsstory, they can be discarded.

The selection of which video segment categories, from those defined in Section 4,should be displayed in the small foreground window, which should be skimmedand presented in the background, and which should be directly eliminated is easilyconfigurable in the developed abstraction system. In the same way, the presentationlayout could have different configurations, as shown in Fig. 9 where different

Fig. 9 Abstract compositionlayouts


Fig. 10 State machine for abstract generation

combinations of foreground-window/background are presented, depending on whichshot category is to be emphasized.

In the implemented system, the layout depicted in Fig. 9a has been applied:the initial news introduction is completely kept (audio and images) and displayedin a reduced size window in the top-left corner of the image together with otherinformation such as maps or synthetic content. This size reduction allows the full-size display of the more informative images from a visual point of view, in this casethe abstraction of the report section of each news story.

The on-line, and hence progressive, operation mode implies to solve the correctalignment of anchorperson and corresponding report sections of each news story asthe incoming video segments are received. Such sections can be of very different sizesand it is possible that, in some cases, one of them may not exist (for example specialreports where no anchorperson introductory section exists). The on-line abstractgeneration process has been implemented by the definition of a three-state machineand individual buffers for the temporal storage of foreground and backgroundcontent. Figure 10 shows the three-state machine diagram, temporal buffers and statechange conditions (detailed in Table 9). The overall abstraction process begins in theStart state where the incoming, already classified video segments, are received. Anykind of incoming content is discarded with the exception of Anchorperson or Report

Table 9 State change conditions

Condition Description

Anchorperson detected At least 5 s of consecutive Anchorperson segments accumulated.Report detected At least 5 s of Report video segments accumulated.Intro end At least 5 s of no Anchorperson, Map, Synthetic or Report

content accumulated.Report end At least 5 s of no Anchorperson, Interview, Map or Synthetic

content accumulated.


segments which are accumulated until the conditions for changing to News Introor Report states are fulfilled (sufficient amount of Anchorperson or Report contentaccumulated (see Table 9)). In order to avoid undesired effects caused by incorrectsegment classification all the state change conditions require the accumulation ofa minimum number of segments from a given category, that is, a stable categoryclassification. The News Intro state is reached when the incoming video correspondsto an anchorperson section and, in this case, all Anchorperson, Map or Syntheticincoming video is stored in the Overlay Buf fer which will be later displayed in thereduced-size interface window. The Report state is reached from the Start or NewsIntro states when a number of Report segments have been received. In this state theInterview, Reporter or Animation incoming segments are discarded while the Reportvideo segments are stored for a further video skimming. Both News Intro and Reportstates return to the Start state if a few seconds of unexpected content categories arereceived.

In a typical news bulletin structure the state machine will mainly switch betweenthe News Intro and Report states. Each Report segment received in the Reportstate is skimmed following the process depicted in Section 5 targeting 1/3 of theoriginal size (which is the average proportion between anchorperson and otherkind of content in the news bulletin). The result of the video skimming process isprogressively presented in the output abstract background. If the Overlay Buf fercontains previously stored content it is presented simultaneously in the foregroundreduced-size window obtaining the anchorperson-report synchronization, otherwise,if no foreground content is available, the overlay window is not displayed.

Each time the News Intro state is reached the Overlay Buf fer is flushed to theabstract output so, if the report video skim of a news story is shorter than the anchor-person introduction, the abstract corresponding to that story will finish with a full-screen anchorperson. Additionally, at the beginning of the foreground/backgroundcomposition, when the Report state is reached, the first seconds are composed onlyby a full-screen anchorperson, taking out part of the Overlay Buf fer content, beforemaking the foreground and background composition. This mechanism, besidesproviding a pleasant edition effect, helps to avoid incorrect anchorperson-reportalignment in situations when, after a news report, the anchorperson makes a shortcomment about the preceding news story before starting with the following oneor when introductory sections of news stories without report are concatenated. Fordealing with those cases in which the anchorperson section is too long, a length limithas been defined for the Overlay Buf fer. If such limit is exceeded the buffer begins tobe progressively displayed in full-screen size automatically, avoiding excessive delayin the abstract outputting and memory consumption for the storage of too long videosegments as well as improving the alignment of anchorperson and report sectionwhen several news stories are concatenated within the anchorperson section.

Figure 11 shows a simple example of the abstraction process for two consecutivenews stories. Both abstracts begin with the full-screen anchorperson followed by asimultaneous display of the anchorperson and report skim sections in a composedlayout. Finally, the first story ends with a full-screen report skim while the secondone, where the video skim is considerably shorter, finishes with the news story in afull-sized layout.

The proposed model enables the sequential processing of the incoming video, andtherefore the real-time abstract generation and progressive output: each received


Fig. 11 Abstraction example

video segment is immediately analyzed, classified and processed with one of thedefined actions according to the state in which the abstract generation is and eachnews story abstract is finished with negligible delay once finished.

Table 10 summarizes the average time per second of written video spent inthe output layout composition and video coding processes. The time is averagedconsidering the output abstract length which is only a fraction of the original videolength. The results are calculated for an average output abstract length of 30% theoriginal video length.

Table 11 summarizes the average time (see footnote 3) required by each abstrac-tion stage and for the complete abstraction process if considering a 28 min originalnews bulletin and a 30% length generated abstract. The average amount of timerequired for a video abstract generation is below 1/3 of the original news bulletinlength. As it will be shown in the results section, the experimental times measuredfor the generation of a complete news bulletin abstract are coherent with thoseaverage results and are considerably under the original video length itself, achievingthe required real-time operation mode.

7 Results

The OLAM system has been evaluated in both objective and subjective levels.From an objective point of view, the correct identification of the introductoryanchorperson and news reports sections and their synchronization for a correctlayout composition as well as the inclusion of all the news bulletin stories in theabstract has been measured. The obtained results are described in Section 7.1.

The overall validation of the system has been carried out with a set of user testsin which the quality and representativeness of the proposed approach have beenevaluated by the visualization of several of the generated abstracts by different users.

Table 10 Averagecomposition & coding time persecond

Composition Coding Total (per written Total (30% abstract(ms) (ms) second) (ms) length) (ms)

135.05 140.56 275.62 82.68


Table 11 Average abstraction time (30% length abstract)

Step Average time (per second) (ms) 28 min bulletin (s)

Decoding 120.3 202.10Feature extraction 98.84 166.05Classification 2.6 4.37Skimming 3.12 5.24Composition 40.51 68.05Coding 42.168 70.84Total 326.81 516.65 (∼8′36′′)

The user evaluation includes examples of incorrectly composed news stories forthe study of its impact in the users perception. The quality of the video skimmingapproach, previously discussed in Section 5 within the scope of the TRECVID 2008BBC rushes summarization task [42], is implicitly validated within the user testscarried out which are described in Section 7.2.

7.1 Objective evaluation

In this section the correct identification of anchorperson and report sections of thenews stories as well as their correct composition are measured. The individual videosegment classification performance of the system has been previously evaluated anddescribed in Section 4.2.1. Such individual category classification results have a greatinfluence in the overall system performance because the correct news story iden-tification and anchorperson-report alignment depend on them. The results obtainedin the Anchorperson classification are very important for the overall performanceof the system while the incorrect classification of Report, Interviews or Commercialsegments have a very reduced impact in the output abstract quality.

For the evaluation of the system, the inclusion of each news story in the outputabstract and the correct synchronization of its anchorperson (in case s/he exists) withthe report video skim has been taken into account. For reports without introductoryanchorperson, the inclusion of the associated video skim is considered as a correctnews story inclusion and it is not computed for the correct anchorperson-reportalignment statistics. The inclusion of no relevant content has not been penalized but,in order to have a reference to measure the length reduction performance (targetingto a 1/3 length ratio), the average relation between the anchorperson and the rest ofthe content in complete news bulletins has been considered as an optimal result (it isthe maximum possible reduction if all anchorperson sections are kept).

7.1.1 DW data set

Table 12 summarizes the results obtained for six complete DW news bulletins(different to those annotated and used for segment classification training). The tablepresents the original video and obtained abstract lengths, number of news stories ineach original video, and how many of them are compound ones, that is, composedof anchorperson and a report and therefore subject to a possible foreground anchor-person/background video skim alignment. The obtained results are presented as thenumber of stories included in the abstract (when the corresponding anchorperson in-troduction is included) and the number of correct alignments between anchorperson


Table 12 DW abstraction results

Video Original Abstract # Stories # Inclusion # Alignment Processlength length (# compound) time(min) (min) (min)

321070_4 (spa) 28 9.63 (34%) 17 (14) 17/17 (100%) 12/14 (85.7%) 7.9327904_3 (spa) 28 8.68 (31%) 21 (19) 17/21 (81%) 13/19 (68.4%) 7.4326156_3_16 (eng) 28 7.34 (26%) 12 (10) 12/12 (100%) 9/10 (90.0%) 7.6326156_3_18 (eng) 28 7.15 (25%) 14 (10) 14/14 (100%) 8/10 (80%) 7.9326181_3_08 (eng) 28 7.93 (28%) 16 (16) 14/16 (87.5%) 13/16 (81%) 8.0327916_3_18 (eng) 28 10.43 (37%) 19 (15) 18/19 (94.7%) 13/15 (86%) 8.2Total 168 51.16 (30.4%) 99 (84) 92/99 (92.9%) 68/84 (80.9%) 47.0

and report skims when dealing with compound stories (some results are available, atreduced resolution, at http://www-vpu.eps.uam.es/publications/newsAbstraction/).

The obtained results demonstrate the feasibility of obtaining good size reduction(close to the ‘optimal’ target of 1/3 of the original length) with news video contentwhile retaining most part of the news stories in the news bulletins (92.9%) and witha correct overlapping between the anchorperson and the report sections in 80.9%of the cases. Most of the incorrect news story inclusions are due to the incorrectclassification of the anchorperson sections as interview or reporter ones. In thecases where incorrect alignment occurs, it is usually because Interview fragments aremisclassified as Anchorperson, being overlapped over Report fragments, or when aReport fragment classification error produces an incorrect state change (for exampleReport fragments classified as Commercial may produce a premature news storycomposition finalization). It is expectable to correct those situations and improve theresults with the development of more precise classification mechanisms. It should bepointed out that all the introductory and end animations for the news bulletins, aswell as the studio shots, were correctly eliminated in all bulletins. A reduced part ofthe commercial sections included in the news bulletin (those not correctly classifiedas Commercial content) are skimmed and included in the output abstract, but heavilyreducing the length of these sections.

The overall process is carried out in an on-line processing way and therefore,given the operative constraints, the classification, inclusion and alignment resultscan be considered as very good. The total processing time (see footnote 3) isconsiderable below the original video duration and slightly under the output abstractlength, thus demonstrating the feasibility of the proposed approach for continuousbroadcasting processing and for real-time abstraction (displaying the output as it isbeing generated) of already stored content.

7.1.2 CCTV data set

In Section 4.2.3 the feature extraction and classification processes were validatedwith a content set different to the one used during the development of the system(the DW news bulletins). In this section we carry out a simplified validation test ofthe complete abstraction process trying to determine the whole abstraction systemapplicability to a different content set just by a retraining of the segment classifiers.

The anchorperson overlapping window proportions were slightly modified for abetter visualization given the smaller resolution of the CCTV bulletins (see Fig. 12).

http://www-vpu.eps.uam.es/publications/newsAbstraction/


Fig. 12 CCTV news abstractcomposition example

With respect to the overall bulletin structure, the news bulletins are shorter (about 10min) than the DW ones (28 min) but the anchorperson-report news stories structureis kept along the news bulletin. A commercial section is included within the newsbulletin but, as commented in previous sections, it is the same in all the news bulletinsused and, therefore, it is always correctly detected and eliminated. In this case, exactalignment between the anchorperson speech and the background video skim was notevaluated4 and therefore only the correct inclusion of all the anchorperson sectionsand the following reports (if applicable) were evaluated. Table 13 summarizes theobtained results for three CCTV news videos depicting the original and abstractlengths, the number of existing and included anchorperson, the number of correctalignments (anchorperson overlapped over the following news report) and the totalprocessing time per video (it can be observed how the processing times are verysmall in comparison with the DW content due to the reduced resolution of the CCTVvideos).

The anchorperson inclusion errors are produced in all cases because the newsbulletins include short anchorperson fragments under the established 5 s minimallength (see Table 9). In two cases, this situation was produced at the finalization ofthe news bulletin and therefore no relevant information was missed. Only one of thecases was produced in the middle of a news story. All the reports were appropriatelyskimmed and included in the abstracts with a correct alignment in 16 out of 17 cases(the only exception is the no detection of the anchorperson within a news story). Theobtained abstract length is close to the ‘optimal’ 1/3 value (it is substantially higherin one of the bulletins due to a long anchorperson appearance at the end of the videowithout being followed by a news report).

The obtained results are quite good in general terms, even considering the rela-tively small amount of data utilized in the training of the classifiers (see Section 4.2.3),and demonstrate the applicability of the proposed approach for different newsbroadcast content.

4Due to the lack of knowledge about the Chinese language.


Table 13 CCTV anchorperson–report inclusion results

Video Original Abstract # Anchorperson # Reports # Alignment Processlength length inclusion inclusion time(min) (min) (min)

20041101_110000 10:00 3:14 (32.3%) 9/10 (90.0%) 8/8 (100%) 8/8 (100%) 1.120041108_110000 9:40 4:56 (51.0%) 6/7 (85.7%) 5/5 (100%) 4/5 (80.0%) 1.220041109_110100 9:00 3:02 (33.7%) 6/7 (85.7%) 4/4 (100%) 4/4 (100.0%) 1Total 28:40 11:12 (39%) 21/24 (87.5%) 17/17 (100%) 16/17 (94.12%) 3.3

7.2 Subjective evaluation

For the validation of the proposed abstraction approach from an subjective pointof view, an user test campaign was carried out. There were three principal aspectsin which the tests were focused: the representativeness of the proposed approach,the generated abstracts pleasantness, and the usefulness of the abstracts. The testswere carried out with a total of 27 users. Three different tests were implemented,combining different news bulletin fragments from the DW content set, and eachuser was asked to visualize and evaluate one of the tests (yielding a total of nineusers per test). Each test was composed of four different news bulletin summarizedfragments and their corresponding original video. Instead of evaluating completebulletins, small fragments of one or two news stories were presented to the users. Thisdesign decision was taken, because a complete 28-min bulletin would have been toolong to keep the users attention level and to allow the user remembering the detailsabout all the individual news stories. Nevertheless, some of the evaluated videos arecomposed by two consecutive stories so that the user can check the individual storyabstract concatenation. Table 14 summarizes the different news story fragments usedin each of the three different tests, including the number of stories that each segmentcontained, the original video duration, the abstract length, the news story languageand the correct alignment (specifying if the anchorperson introduction was correctlyoverlapped with the news report or news, or if there were overlapping errors). Most

Table 14 News segments for user evaluation

Segment Test Original # Stories Original Abstract Output Language Correctvideo length length length alignment

(s) (s) ratio

S1 1 327904_3 2 94 37 0.39 SPA NoS2 1 327904_3 2 123 92 0.75 SPA YesS3 1 327904_3 2 113 50 0.44 SPA NoS4 1 327904_3 1 83 27 0.33 SPA YesS5 2 327904_3 1 93 49 0.53 SPA YesS6 2 327904_3 2 111 37 0.33 SPA YesS7 2 327904_3 2 159 49 0.31 SPA YesS8 2 327904_3 2 129 64 0.5 SPA NoS9 3 327916_3 2 121 43 0.36 ENG YesS10 3 327916_3 1 38 13 0.34 ENG YesS11 3 327916_3 2 144 45 0.31 ENG YesS12 3 327916_3 1 174 81 0.47 ENG Yes


Table 15 Test questions

Question ID Assertion

Video questionsQ1 The video abstract adequately represents the original bulletin...Q2 The video abstract rhythm and composition are pleasant...Q3 There is relevant/fundamental information missing in the video abstract...Q4 The video abstract length is...

General questionsGQ1 The proposed video abstraction technique is useful for news video content...GQ2 The displayed video abstracts are pleasant to see...GQ3 The video abstracts provide a proper understanding about the original

news bulletin video...

Possible answers

Applied for Choices

Q1–Q3; GQ1–GQ3 1—strongly disagree, 2—disagree, 3—no opinion, 4—agree, 5—strongly agreeQ4 1—too short, 2—short, 3—adequate, 4—long, 5—too long

of the segments are in Spanish because most of the evaluators were Spanish nativespeakers and the correct understanding of the news stories has high relevance.

After the visualization of each pair of abstract/original video, the users were askedto rate their level of agreement with several assertions (Q1–Q4) about each videoabstract, and, at the end of the test, they were asked to rate a final set of threegeneral assertions (FQ1–FQ3). Table 15 shows the different assertions. For eachquestion the user was able to choose between five different levels of agreementexcept for question Q4 (about the length of the abstract), in which the user hadto indicate his/her opinion about the length of the abstract. Finally, at the end ofthe questionnaire, the users were able to make any desired comment about theabstraction method or the questions.

Table 16 shows the evaluation results (average and standard deviation) per videofor questions Q1–Q4 (asked after the visualization of each news bulletin fragment).

Table 16 Evaluation results per video segment

Segment Q1 Q2 Q3 Q4

Average : deviation Average : deviation Average : deviation Average : deviation

S1 4.22 : 1.09 4.44 : 0.53 2.44 : 1.01 3.00 : 0.71S2 4.33 : 0.50 3.77 : 1.20 1.78 : 0.97 4.00 : 1.00S3 3.88 : 0.93 3.22 : 1.30 2.11 : 1.17 3.11 : 0.78S4 4.66 : 0.50 4.66 : 0.50 1.89 : 0.93 3.22 : 0.44S5 4.44 : 0.53 4.22 : 0.44 2.67 : 0.87 3.56 : 0.53S6 3.55 : 1.01 2.44 : 1.13 3.00 : 1.00 2.89 : 0.33S7 4.44 : 0.52 3.00 : 1.22 2.67 : 1.12 3.11 : 0.33S8 4.11 : 1.17 3.67 : 1.12 2.67 : 0.87 3.33 : 0.50S9 4.44 : 0.53 4.00 : 1.00 2.11 : 0.33 3.11 : 0.33S10 4.11 : 0.78 4.67 : 0.50 3.11 : 1.45 3.00 : 0.50S11 3.89 : 0.93 3.44 : 1.01 2.67 : 0.87 2.78 : 0.44S12 4.55 : 0.53 3.66 : 1.12 1.89 : 1.05 3.33 : 0.70Average 4.22 : 0.75 3.76 : 0.92 2.42 : 0.97 3.20 : 0.55


The obtained results are, in general terms, quite positive for the validation of theproposed approach. Q1 results (‘The video abstract adequately represents the originalbulletin’) obtained very good results with the values for most part of the videos closeor above the 4 (‘agree’) and an average value of 4.22. Q2 results (‘The video abstractrhythm and composition are pleasant’) present more variations. The average value,3.76, is close to the agreement value and therefore users tend to think that the videoabstracts are pleasant. Nevertheless some of the abstract obtained values closer toa neutral opinion (S3 and S7, with scores 3.22 and 3, respectively) or even to thedisagreement score (S6, score 2.44). In the case of S3 there is a composition mistakein the video abstract which clearly affects the user perception. In the news segment S7the main problem may be related to the anchorperson introductory narration whichdoes not finish before the visual report starts and, therefore, it is incomplete in theoutput abstract. The S6 abstract presents the case of short visual reports fragmentsafter the anchorperson finishes which may produce an unpleasant rhythm. Severalof the users commented that the news report video skim, in the cases in which itwas longer than the anchorperson introduction, presented audio cuts which had anegative influence in the abstracts pleasantness. Such issue is a typical problem inmany video skimming approaches and should be addresses in the future to enhancethe abstracts pleasantness (for example, selecting a continuous audio fragment fromthe report instead of the audio of each fragment). Nonetheless, the average resultsare good and the representativeness of the abstracts is still high even in abstractswith lower pleasantness score. The third question (‘There is relevant/fundamentalinformation missing in the video abstract’) was aimed to determine if there was reallyimportant information missed in the abstract and complements question Q1. Theaverage obtained score was 2.42, between the ‘no opinion’ and ‘disagree’ values,showing a slight tendency by the users to consider that there is not really fundamentalinformation lost in the abstracts. Of course, the complete news stories provide moreinformation about the story than just the single anchorperson introduction, but,taking into consideration the combination of Q1 and Q3, it can be stated that theabstracts adequately represent the original video information. Several users pointedout that there was a high dependency on how well the anchorperson introductiondescribed the rest of the news with this possible lack of information. For example,segment S10 obtained one of the worst results for Q3, 3.11, (which is, however. aneutral result) while the scores for Q1, Q2 and Q4 (later analyzed) were quite good.These results can only be explained by the specific content of such news story andthe information they contain.

The last question presented to the user, Q4 (‘The video abstract length is...’),was aimed to determine if there was any users’ preference about the original andabstract lengths ratio. The average score obtained was 3.22, which is very close to the‘Adequate’ length choice in the test. Therefore, in general terms, the length of thegenerated abstracts seem to be correct. The abstracts lengths ratios are depicted inTable 14 and a high correlation between such values and the obtained Q4 score canbe observed. Segments S2, S5, S8 and S11, with abstract length ratios of 0.75, 0.53,0.50 and 0.47, obtained Q4 scores of, respectively, 4.00, 3.55, 3.33 and 3.33, showing along abstract perception by the users. The abstracts with best Q4 scores, presentingan adequate length for the users, are those with a length ratio of about 1/3.

Figure 13 depicts the answer frequencies for questions Q1–Q4 for the whole set ofsegments included in the three different tests carried out. Summarizing the results,


Fig. 13 Q1–Q4 answer frequencies

in 87% of the cases, the users agreed or strongly agreed with ‘The video abstractadequately represents the original bulletin’. In 65.7% of the cases, the users disagreedor strongly disagreed with ‘There is relevant/fundamental information missing in thevideo abstract’ against a 23.14% of the cases where the users considered relevantinformation was missed (agreed or strongly agreed with Q3). ‘The video abstractrhythm and composition are pleasant’ for the users in a 71.29% of the cases whilein the 20.3% of them, users disagreed or strongly disagreed with such assertion.Finally, users considered the video abstract lengths as adequate in 66.6% of the cases,somehow short or long in 29.6% of them, and too long or short in only 3.7% of thedisplayed abstracts.

After watching and rating each individual abstract, the users were asked to specifytheir level of agreement with three general questions (GQ1–GQ3, see Table 15). Inthis case, the users had to consider the whole set of displayed abstracts and originalvideos in order to provide an overall impression about the proposed abstractionapproach. Figure 14 presents the results obtained for the 27 users which carriedout the tests: 96.3% of the users agreed or strongly agreed in FQ1 “The proposedvideo abstraction technique is useful for news video content”, 81.6% of the usersconsidered that (FQ2) “The displayed video abstracts are pleasant to see” and, finally,for question FQ3, “The video abstracts provide a proper understanding about theoriginal news bulletin video”, 96% of the users agreed or strongly agreed.

The obtained results validate the proposed news video abstraction approach inboth the individual evaluation of the abstracts and the general questions about

Fig. 14 FQ1–FQ3 answer frequencies


the usefulness and quality of the abstraction approach. The main purpose of theabstracts, to provide a representative short version of the original news content, is,according to the obtained results, successfully achieved. It seems that the pleasant-ness of the abstracts visualization, although validated by the users opinion, couldbe improved if, as several users commented, the cuts in the audio track could becorrect in the cases where the report video skim is longer than the anchorpersonintroduction. In general terms, the obtained user evaluation results are very good,specially considering the constraints of real-time processing and progressive outputgeneration (the latter limiting the amount of available information for the abstractgeneration).

8 Conclusions and future work

This paper has presented a system for the on-line generation of complete multi-media news bulletin abstracts. The on-line operation mode requires the sequentialprocessing of the incoming video as well as progressive output generation and impliesto work with only partial original content information (the already broadcastedcontent at any given instant). Considering the on-line and efficiency constraint, theindividual way in which the different techniques for content classification, videoskimming and abstract composition have been applied and how such techniqueshave been combined represents a novel way to deal with news abstract generation.User validations demonstrate that the proposed approach, even with the on-lineprocessing restrictions, produce useful, representative and pleasant abstracts. Thegeneralization of the abstraction algorithm has been validated with its applicationto different news content providers and the obtained results demonstrate that thedeveloped system provides a complete solution for instant news abstract availabilityduring or at the end of the broadcast. The progressive abstract generation schemeallows the continuous abstract generation for 24-h channels, and provides new appli-cation possibilities such as its adaptation to other fields where continuous abstractioncould be applicable (for example surveillance recordings). The proposed systemphilosophy could be extended to real-time visualization systems where abstracts aregenerated on viewing time and could allow many personalization and interactivitypossibilities.

Future work will be mainly focused on the extraction of additional low-levelfeatures for the improvement of the video segment classification process and theimprovement of such classification process internal mechanisms. The better clas-sification of visual content and the possibility of discrimination between a highernumber of different shot categories will allow more possibilities for abstracts com-position in the future (e.g. identification and presentation of a set of overlap-ping windows with the faces of the story protagonists, selection and presentationof relevant fragments/content with different layouts). The addition of audio andtextual information to the abstract could enable better video story identification,categorization and semantic based video skimming which could yield personalizednews bulletin abstracts for different users interested in different kinds of news andvisual content. Real-time abstraction of already stored news bulletins could allowapplications for interactive browsing over huge news bulletin repositories generatingnews abstracts according to specific user queries.


Acknowledgements The authors want to thank Wilfried Runde and Jochen Spangenberg fromDeutsche Welle for their collaboration in order to make the results of this work available onthe web site. All the news bulletins used as content set for this work are © Deutsche Welleand/or respective copyright holders. (Material has been kindly provided for research/academicpurposes only. Not for commercial use. No duplication, copying, re-use of any kind allowed.) Worksupported by the European Commission (IST-FP6-027685—Mesh), Spanish Government (TIN2007-65400—SemanticVideo), Comunidad de Madrid (S-0505/TIC-0223 (ProMultiDis) CM), Consejeríade Educación of the Comunidad de Madrid and by The European Social Fund.

References

1. Aigrain P, Zhang H, Petkovic D (1996) Content-based representation and retrieval ofvisual media: a state-of-the-art review. Multimed Tools Appl 3(3):179–202. Available:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.5650 (Online)

2. Browne P, Czirjek C, Gaughan G, Gurrin C, Jones GJF, Lee H, Marlow S, Donald KM, MurphyN, O’connor NE, Smeaton AF, Ye J (2003) Dublin City University video track experiments fortrec 2003. In: TREC video retrieval evaluation online proceedings

3. Calic J, Izquierdo E (2002) Efficient key-frame extraction and video analysis. In: Internationalconference on Information technology: coding and computing, vol 0, p 0028

4. Chaisorn L, Chua T-S, Lee C-H (2002) The segmentation of news video into story units. In:ICME ’02: proceedings of the IEEE international conference on multimedia and expo, vol 1,pp 73–76

5. Chang CC, Lin CJ (2001) LIBSVM: a library for support vector machines. Available:http://www.csie.ntu.edu.tw/~cjlin/libsvm (Online)

6. Chang HS, Sull S, Lee SU (1999) Efficient video indexing scheme for content-based retrieval.IEEE Trans Circuits Syst Video Technol 9(8):1269–1279

7. Chen F, Adcock J, Cooper M (2008) A simplified approach to rushes summarization. In: TVS’08: proceedings of the 2nd ACM TRECVid video summarization workshop. ACM, New York,pp 60–64

8. Chien HJ, Smoliar SW, Wu JH (1995) Video parsing, retrieval and browsing: An integrated andcontent-based solution. In: MULTIMEDIA ’95: proceedings of the third ACM internationalconference on multimedia

9. Christel MG (2006) Evaluation and user studies with respect to video summarization andbrowsing. In: SPIE MCAMR ’06: proceedings of the conference on multimedia content analysis,management, and retrieval, vol 6073, no 1, pp 196–210

10. Chua T-S, Chang S-F, Chaisorn L, Hsu W (2004) Story boundary detection in large broadcastnews video archives: techniques, experience and trends. In: MULTIMEDIA ’04: proceedings ofthe 12th annual ACM international conference on multimedia. ACM, New York, pp 656–659

11. Ciocca G, Schettini R (2005) Dynamic key-frame extraction for video summarization. In: SantiniS, Schettini R, Gevers T (eds) Internet imaging VI, vol 5670, no 1. SPIE, San Jose, pp 137–142.Available: http://link.aip.org/link/?PSI/5670/137/1 (Online)

12. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–29713. Divakaran A, Radhakrishnan R, Peker K (2002) Motion activity-based extraction of key-frames

from video shots. In: ICIP ’02: proceedings of the 2002 international conference on imageprocessing, vol 1, pp I-932–I-935

14. Fayzullin M, Subrahmanian VS, Picariello A, Sapino ML (2003) The cpr model for summarizingvideo. In: MMDB ’03: proceedings of the 1st ACM international workshop on multimediadatabases. ACM, New York, pp 2–9

15. Gunsel B, Tekalp A (1998) Content-based video abstraction. In: ICIP ‘98: proceedings on 1998International Conference on Image Processing, vol 3, pp 128–132

16. Hanjalic A, Zhang H (1999) An integrated scheme for automated video abstraction based onunsupervised cluster-validity analysis. IEEE Trans Circuits Syst Video Technol 9(8):1280–1289

17. Hauptmann AG, Witbrock MJ (1998) Story segmentation and detection of commercials inbroadcast news video. In: ADL ’98: proceedings of the IEEE advances in digital librariesconference, pp 168–179

http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.50.5650

http://www.csie.ntu.edu.tw/~cjlin/libsvm

http://link.aip.org/link/?PSI/5670/137/1


18. Hauptmann AG, Christel MG, Lin W-H, Maher B, Yang J, Baron RV, Xiang G (2007) Cleverclustering vs. simple speed-up for summarizing rushes. In: TVS ’07: proceedings of the interna-tional workshop on TRECVID video summarization. ACM, New York, pp 20–24

19. Hua X-S, Lu L, Zhang H-J (2004) Optimization-based automated home video editing system.IEEE Trans Circuits Syst Video Technol 14(5):572–583

20. Huang Q, Liu Z, Rosenberg A, Gibbon D, Shahraray B (1999) Automated generation of newscontent hierarchy by integrating audio, video, and text information. In: ICASSP ’99: proceedingsof the 1999 international conference on acoustics, speech, and signal processing

21. Ju S, Black M, Minneman S, Kimber D (1998) Summarization of videotaped presentations:automatic analysis of motion and gesture. IEEE Trans Circuits Syst Video Technol 8(5):686–696

22. Kasutani E, Yamada A (2001) The mpeg-7 color layout descriptor: a compact image featuredescription for high-speed image/video segment retrieval. In: ICIP ’01: proceedings of the 2001international conference on image processing, vol 1, pp 674–677

23. Kim J, Chang H, Kang K, Kim M, Kim J, Kim H (2003) Summarization of news video and itsdescription for content-based access. Int J Imaging Syst Technol 13(5):267–274

24. Latecki L, DeMenthon D, Rosenfeld A (2001) Extraction of key frames from videos by polygonsimplification. In: Sixth international symposium on signal processing and its applications, vol 2,pp 643–646

25. Li B, Sezan MI (2001) Event detection and summarization in american football broadcast video.In: Proceedings of the conference on storage and retrieval for media databases, vol 4676, no 1,pp 202–213

26. Li Y, Zhang T, Tretter D (2001) An overview of video abstraction techniques. HP Laboratories,Palo Alto

27. Li Z, Schuster G, Katsaggelos A, Gandhi B (2005) Rate-distortion optimal video summarygeneration. IEEE Trans Image Process 14(10):1550–1560

28. Li Y, Lee S-H, Yeh C-H, Kuo C-C (2006) Techniques for movie content analysis and skim-ming: tutorial and overview on video abstraction techniques. IEEE Signal Process Mag 23(2):79–89

29. Lie W-N, Lai C-M (2004) News video summarization based on spatial and motion feature analy-sis. In: Aizawa K, Nakamura Y, Satoh S (eds) Advances in multimedia information processing-PCM 2004, ser. Lecture notes in computer science, vol 3332. Springer, Berlin, pp 246–255

30. Lienhart R, Maydt J (2002) An extended set of haar-like features for rapid object detection.In: ICIP ’02: proceedings of the 2002 international conference on image processing, vol 1,pp 900–903

31. Liu T, Kender JR (2002) Optimization algorithms for the selection of key frame sequences ofvariable length. In: ECCV ’02: proceedings of the 7th European conference on computer vision-part IV. Springer, London, pp 403–417

32. Liu Z, Wang Y (2001) Major cast detection in video using both audio and visual information.In: ICASSP ’01: proceedings of the acoustics, speech, and signal processing on IEEE interna-tional conference. IEEE Computer Society, Washington, pp 1413–1416

33. Liu T, Zhang H-J, Qi F (2003) A novel video key-frame-extraction algorithm based on perceivedmotion energy model. IEEE Trans Circuits Syst Video Technol 13(10):1006–1013

34. Meyer D, Leisch F, Hornik K (2003) The support vector machine under test. Neurocomputing55(1–2):169–186

35. Mills M, Cohen J, Wong YY (1992) A magnifier tool for video data. In: CHI ’92: proceedings ofthe SIGCHI conference on human factors in computing systems. ACM, New York, pp 93–98

36. Money AG, Agius H (2008) Video summarisation: a conceptual framework and survey of thestate of the art. J Vis Commun Image Represent 19(2):121–143

37. Nam J, Tewfik A (1999) Video abstract of video. In: MMSP ’99: proceedings of the IEEE 3rdinternational workshop on multimedia signal processing, pp 117–122

38. Naphade MR, Smith JR (2002) On the detection of semantic concepts at trecvid. In: MULTIME-DIA ’04: proceedings of the 12th annual ACM international conference on multimedia. ACM,New York, pp 660–667

39. Chaisorn L, Chua TS, Koh CK, Zhao Y, Xu H, Feng H, Tian Q (2003) A two-level multi-modalapproach for story segmentation of large news video corpus. In: Proc. of TRECVID conference

40. Oh J, Wen Q, Hwang S, Lee J (2004) Video abstraction. In: Video data management andinformation retrieval, pp 321–346


41. O’hare N, Smeaton A, Czirjek C, O’Connor N, Murphy N (2004) A generic news story segmenta-tion system and its evaluation. In: ICASSP ’04: proceedings of the 2004 international conferenceon acoustics, speech, and signal processing, vol 3, pp iii-1028–iii-1031

42. Over P, Smeaton AF, Awad G (2008) The trecvid 2008 bbc rushes summarization evaluation. In:TVS ’08: proceedings of the 2nd ACM TRECVid video summarization workshop. ACM, NewYork, pp 1–20

43. Peker KA, Divakaran A (2004) Adaptive fast playback-based video skimming using acompressed-domain visual complexity measure. In: ICME. IEEE, Piscataway, pp 2055–2058

44. Peker KA, Otsuka I, Divakaran A (2006) Broadcast video program summarization using facetracks. In: ICME. IEEE, Piscataway, pp 1053–1056

45. Peker KA, Divakaran A, Lanning T (2005) Browsing news and talk video on a consumerelectronics platform using face detection. In: Vetro A, Chen CW, Kuo C-CJ, Zhang T, TianQ, Smith JR (eds) Multimedia systems and applications VIII, vol 6015, no 1. SPIE, Boston, p601519. Available: http://link.aip.org/link/?PSI/6015/601519/1 (Online)

46. Shipman S, Divakaran A, Flynn M Highlight scene detection and video summarization forpvr-enabled high-definition television systems. In: ICCE ’07: proceedings of the internationalconference on consumer electronics, pp 1–2

47. Smeaton A, Over P, Kraaij W (2009) High-level feature detection from video in trecvid: a 5-yearretrospective of achievements. In: Multimedia content analysis, pp 1–24

48. Sundaram H, Xie L, Chang S-F (2002) A utility framework for the automatic generation of audio-visual skims. In: MULTIMEDIA ’02: proceedings of the tenth ACM international conference onmultimedia. ACM, New York, pp 189–198

49. Taniguchi Y, Akutsu A, Tonomura Y, Hamada H (1995) An intuitive and efficient accessinterface to real-time incoming video based on automatic indexing. In: MULTIMEDIA ’95:proceedings of the third ACM international conference on multimedia. ACM, New York, pp 25–33

50. Toklu C, Liou S-P (1999) Automatic key-frame selection for content-based video indexing andaccess. In: Proceedings of the conference on storage and retrieval for media databases, vol 3972,no 1. SPIE, Bellingham, pp 554–563

51. Truong BT, Venkatesh S (2007) Video abstraction: a systematic review and classification. ACMTrans Multimedia Comput Commun Appl 3(1):1–37

52. Valdés V, Martínez JM (2008) Binary tree based on-line video summarization. In: TVS ’08:proceedings of the 2nd ACM TRECVid video summarization workshop. ACM, New York,pp 134–138

53. Valdés V, Martínez JM (2008) On-line video summarization based on signature-based junk andredundancy filtering. In: WIAMIS ’08: proceedings of the 2008 ninth international workshop onimage analysis for multimedia interactive services. IEEE Computer Society, Washington, pp 88–91

54. Valdés V, Martínez J (2010) A framework for video abstraction systems analysis and modellingfrom an operational point of view. Multimed Tools Appl 49(1):7–35

55. Valdés V, Martínez JM (2007) On-line video skimming based on histogram similarity. In: TVS’07: proceedings of the international workshop on TRECVID video summarization. ACM, NewYork, pp 94–98

56. Valdés V, Martínez JM (2007) Post-processing techniques for on-line adaptive video summa-rization based on relevance curves. In: Falcidieno B, pagnuolo M, Avrithis YS, KompatsiarisI, Buitelaar P (eds) SAMT, ser. Lecture notes in computer science, vol 4816. Springer, Berlin,pp 144–157

57. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features.In: CVPR ’01: proceedings of the IEEE Computer Society conference on computer vision andpattern recognition, vol 1. IEEE Computer Society, Los Alamitos, p 511

58. Wang M, Zhang H (2009) Video content structuring. Scholarpedia 4(8):943159. Wang M, Hua X-S, Hong R, Tang J, Qi G-J, Song Y (2009) Unified video annotation via

multigraph learning. IEEE Trans Circuits Syst Video Technol 19(5):733–74660. Wildemuth B, Marchionini G, Yang M, Geisler G, Wilkens T, Hughes A, Gruss R (2003) How

fast is too fast? evaluating fast forward surrogates for digital video. In: JCDL ’03: proceedings ofthe joint conference on digital libraries, pp 221–230

61. Wilson KW, Divakaran A (2009) Discriminative genre-independent audio-visual scenechange detection. In: Schettini R, Jain RC, Santini S (eds) Multimedia content ac-cess: algorithms and systems III, vol 7255, no 1. SPIE, San Jose, p 725502. Available:http://link.aip.org/link/?PSI/7255/725502/1 (Online)




62. Xiong Z, Radhakrishnan R, Divakaran A (2003) Generation of sports highlights using motionactivity in combination with a common audio feature extraction framework. In: ICIP ’03: pro-ceedings of the 2003 international conference on image processing, vol 1, pp I-5–I-8

63. Yeh C, Chang M, Lu K, Shih M (2006) Robust tv news story identification via visual characteris-tics of anchorperson scenes. In: PSIVT ’06: proceedings of the Pacific-rim symposium on imageand video technology, pp 621–630

64. Zhai Y, Yilmaz A, Shah M (2005) Story segmentation in news videos using visual and text cues.In: CIVR ’05: proceedings of the international conference on image and video retrieval, pp 92–102

65. Zhang H-J, Gong Y, Smoliar S, Tan SY (1994) Automatic parsing of news video. In: ICMCS ’94:proceedings of the international conference on multimedia computing and systems, pp 45–54

66. Zhang H, Wu J, Zhong D, Smoliar SW (1997) An integrated system for content-based videoretrieval and browsing. Pattern Recogn 30(4):643–658

67. Zhuang Y, Rui Y, Huang T, Mehrotra S (1998) Adaptive key frame extraction using unsu-pervised clustering. In: ICIP ’99: proceedings of the 1999 international conference on imageprocessing, vol 1, pp 866–870

Víctor Valdés received the Ingeniero en Informática degree from the Universidad Autónoma deMadrid in 2004 and the PhD in Computer Science and Communications in 2010. Since 2004 he hasbeen working as researcher in the Video Processing and Understanding Laboratory (VPULab) atthe Escuela Politécnica Superior of the Universidad Autónoma de Madrid, where he is teachingassistant in the Telecommunications degree, namely teaching for 3 years (2 years as coordinator) thelaboratory units of the “Source and channel coding” course and for 2 years the “Digital Television”course. His research interests include multimedia systems with a focus on multimedia contentadaptation. He has co-authored of several international papers, and has participated in two EUfunded projects: acemedia (IST FP6-001765), where he acted as technical coordinator of GTI workduring the last 2 years, and Mesh (IST FP6-027685).


José M. Martínez received the Ingeniero de Telecomunicación degree (six years engineeringprogram) with honors in 1991 and the Doctor Ingeniero de Telecomunicación degree (PhD inCommunications) with “Summa Cum Laude” in 1998, both from the E.T.S. Ingenieros de Teleco-municación of the Universidad Politécnica de Madrid.. Since 1998 he was associate professor at theDepartment of Signals, Systems, and Communications of the Universidad Politécnica de Madrid, andsince 2002 he is associate professor at the Escuela Politécnica Superior of the Universidad Autónomade Madrid. Since 1998 he was involved in the development of the MPEG-7 standard, acting ascontributor, chair and co-chair in different AHGs and editor of some documents, among them the“Multimedia Description Schemes Committee Draft” (part 5 of the ISO/IEC 15938 standard) andthe “Overview of MPEG-7”. He is also following and contributing to MPEG-21. He is author andco-author of more than 60 papers in conferences and magazines, and co-author of the first bookabout the MPEG-7 Standard published 2002, has is an active participant in IST projects as well asnational projects. He is auditor and reviewer for the EC for projects of the framework programsfor research in Information Society and Technology (IST). He has acted as reviewer for journalsand conferences, is acting as Program Committee Member of several conferences (e.g., ICIP, AMR,CBMI, SAMT, PCS) and has been Technical Co-chair of the International Workshop VLBV’03,co-organizer of the 70th MPEG meeting at Palma de Mallorca (October 2004) and Special SessionChair of the International Conference SAMT 2006.

On-line video abstract generation of multimedia news

Documents

Transcript of On-line video abstract generation of multimedia news