A Fast Method for Textual Annotation of Compressed Videoicvgip/PAPERS/103.pdf · database of...

6
A Fast Method for Textual Annotation of Compressed Video Amit Jain and Subhasis Chaudhuri Department of Electrical Engineering Indian Institute of Technology, Bombay, Mumbai - 400076. INDIA. ajain,sc @ee.iitb.ac.in Abstract We present a method for automatic textual annotation of compressed video. Initially, cuts are detected in the video sequence where consecutive frames show a large difference. Key frames are then extracted to represent each shot. A video sequence is, thus, condensed into a few images, hence this forms a compact key frame representation of the video. Each key frame is compared with an annotated database of images, using standard CBIR techniques, to obtain a textual description of the scene. The entire process is designed to work on MPEG-DC sequences, where only a partial decod- ing is required for performing content-based operations on MPEG compressed video streams. 1. Introduction In recent years, technology has reached a level where a vast amount of digital information is available at a low price. The ease and low cost of obtaining and storing digital infor- mation as well as the almost unlimited possibility to manip- ulate it makes it a popular means of storing data. One of the key features required in a visual information system is effi- cient indexing to enable meaningful and fast classification of objects in a video database. It is generally accepted that content analysis of video se- quences requires a preprocessing procedure that first breaks up the sequences into temporally homogenous segments called shots [1],[5],[7],[9], [12]. A shot is a sequence of frames generated during a continuous operation and there- fore represents continuous action in time or space. Shots represent temporal segments of a video with smoothly changing contents. These segments are condensed into one or a few representative frames (key frames) [2] [13] to yield a pictorial summary of the video. Semantic video indexing based on object segmentation is discussed in [8], [14]. Ob- ject and camera motion can also be used for analyzing and annotating video [3],[11]. Video annotation is often facili- tated by prior knowledge of some general structure for the class of video under study. A basketball annotation system is discussed in [11]. Since most of the video streams that modern digital stor- age systems have to deal with are available in the MPEG compressed format, no content-related operations are pos- sible on these streams directly. The analysis of compressed video can proceed in one of the two fundamental ways. The first is by decompressing the video bitstream and us- ing the individual frames to gather information about vari- ous characteristics of the video such as content or motion, and extracting indexable features in the pixel domain. Op- erations on fully decompressed video do not permit rapid processing because of the data size. The second method involves exploiting encoded information contained in the compressed representation. This method operates directly on a small fraction of the compressed data. Such greatly reduced data is readily extracted from compressed video without full frame decompression and still captures the es- sential information of the video. The information available in compressed representation of video includes the types of each Macroblock (MB), the Discrete Cosine Transform (DCT) coefficients of each MB. and the motion vector com- ponents for the forward, backward and bidirectionally pre- dicted MBs. In this paper, we present a computationally efficient method for automatic annotation of video sequences. The approach makes use of the compressed domain video data and can be split into two parts - segmentation and index- ing. Temporal segmentation divides the video into shots or scenes. The segmentation is done by partially decoding the compressed video bitstream to obtain DC sequences. By comparing the subsampled frames, a scene change is de- tected where successive frames show a large difference. In- dexing is done by choosing key frames to represent each scene. This gives a compact pictorial representation of the contents of the video. By comparing the key frames to a database of annotated still images, using standard CBIR techniques, a textual annotation of the scene is obtained. Hence, a textual representation of the complete video is now available. The organization of the paper is as follows. In Section 2, we briefly discuss the generation of DC-Image and DC-

Transcript of A Fast Method for Textual Annotation of Compressed Videoicvgip/PAPERS/103.pdf · database of...

Page 1: A Fast Method for Textual Annotation of Compressed Videoicvgip/PAPERS/103.pdf · database of annotated still images, using standard CBIR techniques, a textual annotation of the scene

A Fast Method for Textual Annotation of Compressed Video

Amit JainandSubhasisChaudhuriDepartmentof ElectricalEngineering

IndianInstituteof Technology, Bombay,Mumbai- 400076.INDIA.�

ajain,sc � @ee.iitb.ac.in

Abstract

We present a method for automatic textual annotation ofcompressed video. Initially, cuts are detected in the videosequence where consecutive frames show a large difference.Key frames are then extracted to represent each shot. Avideo sequence is, thus, condensed into a few images, hencethis forms a compact key frame representation of the video.Each key frame is compared with an annotated database ofimages, using standard CBIR techniques, to obtain a textualdescription of the scene. The entire process is designed towork on MPEG-DC sequences, where only a partial decod-ing is required for performing content-based operations onMPEG compressed video streams.

1. IntroductionIn recentyears,technologyhasreacheda level whereavastamountof digital information is available at a low price.Theeaseandlow costof obtainingandstoringdigital infor-mationaswell asthealmostunlimitedpossibilityto manip-ulateit makesit apopularmeansof storingdata.Oneof thekey featuresrequiredin a visual informationsystemis effi-cient indexing to enablemeaningfulandfastclassificationof objectsin avideodatabase.

It is generallyacceptedthatcontentanalysisof videose-quencesrequiresapreprocessingprocedurethatfirst breaksup the sequencesinto temporally homogenoussegmentscalled shots [1],[5],[7],[9], [12]. A shot is a sequenceofframesgeneratedduringa continuousoperationandthere-fore representscontinuousaction in time or space. Shotsrepresenttemporal segments of a video with smoothlychangingcontents.Thesesegmentsarecondensedinto oneor a few representativeframes(key frames)[2] [13] to yielda pictorial summaryof thevideo. Semanticvideoindexingbasedon objectsegmentationis discussedin [8], [14]. Ob-ject andcameramotioncanalsobeusedfor analyzingandannotatingvideo [3],[11]. Videoannotationis often facili-tatedby prior knowledgeof somegeneralstructurefor theclassof videounderstudy. A basketballannotationsystemis discussedin [11].

Sincemostof thevideostreamsthatmoderndigital stor-agesystemshave to dealwith areavailable in the MPEGcompressedformat, no content-relatedoperationsarepos-sibleon thesestreamsdirectly. Theanalysisof compressedvideo can proceedin one of the two fundamentalways.The first is by decompressingthe video bitstreamandus-ing the individual framesto gatherinformationaboutvari-ouscharacteristicsof the videosuchascontentor motion,andextractingindexablefeaturesin thepixel domain.Op-erationson fully decompressedvideo do not permit rapidprocessingbecauseof the datasize. The secondmethodinvolves exploiting encodedinformation containedin thecompressedrepresentation.This methodoperatesdirectlyon a small fraction of the compresseddata. Suchgreatlyreduceddata is readily extractedfrom compressedvideowithout full framedecompressionandstill capturesthees-sentialinformationof thevideo. Theinformationavailablein compressedrepresentationof video includesthe typesof eachMacroblock(MB), the DiscreteCosineTransform(DCT) coefficientsof eachMB. andthemotionvectorcom-ponentsfor the forward,backwardandbidirectionallypre-dictedMBs.

In this paper, we presenta computationallyefficientmethodfor automaticannotationof video sequences.Theapproachmakesuseof the compresseddomainvideodataandcanbe split into two parts- segmentationand index-ing. Temporalsegmentationdividesthevideointo shotsorscenes.Thesegmentationis doneby partially decodingthecompressedvideo bitstreamto obtain DC sequences.Bycomparingthe subsampledframes,a scenechangeis de-tectedwheresuccessive framesshow a largedifference.In-dexing is doneby choosingkey framesto representeachscene.This givesa compactpictorial representationof thecontentsof the video. By comparingthe key framesto adatabaseof annotatedstill images,using standardCBIRtechniques,a textual annotationof the sceneis obtained.Hence,atextualrepresentationof thecompletevideois nowavailable.

The organizationof the paperis asfollows. In Section2, we briefly discussthegenerationof DC-ImageandDC-

Page 2: A Fast Method for Textual Annotation of Compressed Videoicvgip/PAPERS/103.pdf · database of annotated still images, using standard CBIR techniques, a textual annotation of the scene

Figure1: Full Imageat 352x 240andits DC imageat 44 x30.

Sequence.Section3 addressestheproblemof temporalseg-mentationof thevideo. Section4 dealswith key framese-lection andgenerationof textual index of the video. Theresultsarepresentedin Section5 followedby conclusioninSection6.

2. DC Image and DC SequencesDC imagesarespatiallyreduced(onepixelpermacroblock)versionsof the original images. The DC image is noth-ing but a top level representationof an imagein its multi-resolutionpyramid. Suchspatially reducedimages,onceextracted,arevery usefulfor efficient scenechangedetec-tion andotherapplications[12]. In this sectionwe brieflydiscusshow theDC-imageandtheDC-sequencecanbeef-ficiently extractedfrom compressedvideos[10].

We focuson MPEG-2 video streams.We restrict our-selves to using data which can be easily extractedfromMPEG bit streamswithout the full frame decompression.Specifically, weusetheframenumber, frameencodingtype( ������� or � ), andtheDC coefficient of eachDCT-encodedpixel block.

The( ��� ) pixel of theDC-imageis theaveragevalueofthe i,j ����� macroblockof theoriginal image.While theDC-imageismuchsmallerthantheoriginal image,it still retainssignificantamountof information. The subsampledvideosequenceformedin suchmanneris calledtheDC sequence.Thisentailssignificantsavingsin computationalandstoragecosts,resultingin a fasterannotation.Fig. 1 illustratesanoriginal imageof size352x 240andits DC-imageof size44x 30usingN = 8.

The extractionof DC imagefrom � -framein MPEG istrivial sincetheDCT coefficientsarereadilyaccessiblefor� -frames.For the � -framesof MPEG,theoriginal imageisgroupedinto 8 x 8 blocks,andtheDC term � (0,0)of its 2-DDCT is relatedto thepixel values� (i,j) via

�� �������������! #"$�! &%

'� #"$'� &% �( )*���+�

Motion

Vector

Current Block

MB MB

MB

1 2

3 MB 4

Reference Block

Figure2: Illustrationof how theDC coefficientsfor amacroblock is computedin aP frame.

which is 8 timestheaverageintensityof theblock.Since� and � -framesarerepresentedby theresidualer-

ror afterpredictionor interpolation,their DCT coefficientsneedto beestimated.To calculatetheDCT coefficientsofa Macroblock (MB) in a � frame or � -frame, the DCTcoefficients of the 8 x 8 areaof the referenceframe thatthe currentMB waspredictedfrom needto be calculated.Supposethis areais calledthe referenceMB (thoughit isnot an actualMB). Sincethe DCT is a linear transform,the DCT coefficientsof the referenceMB in the referenceframecanbe calculatedfrom the DCT coefficientsof thefour MBs that canoverlap this referenceMB, albeit withsubstantialcomputationalexpense.It is easy, however, tocalculatereasonableapproximationsto theDC coefficientsof anMB of a � or a � -frame.Fig. 2 showsanMB in a � -frame, ,-�/.1032 , beingpredictedfrom a 8 x 8 areadenotedby ,-�/4#5�6 . While encodingthe � -frame,only theresidualerrorof ,-�/.1032 with respectto ,-�/4#5�6 is stored.TheDCcoefficientsof ,-�74#5�6 canbecalculatedfrom theDCT co-efficientsof four MBs- ,-�983�*,-�7:;�,-�/< and ,-�7= . Toavoid expensive computation,the DC coefficient aloneisapproximatedby a weightedsumof theDC coefficientsofthe four MBs, with the weightsbeing the fractionsof theareasof theseMBs thatoverlapthereferenceMB, i.e,

>@? A,-�74#5B6+�C� =$�D 81E �BF

>@? �,-� � �where E � is given by the ratio of the areaof the shadedregion of ,-� � to its total area. Similarly, if an MB in a� -frame is interpolatedfrom two referenceMBs, its DCcoefficient is approximatedby anaverageof the estimatedDC coefficientsof eachof thesetwo MBs.

Thepartialdecodingof a MPEGvideobitstream,usingonly the DC term of the DCT, givesa spatiallydecimatedversionof the video. This is calledtheDC sequence.Theframesin a DC sequencearespatiallyreducedby a factorof 8 in both the dimensions.In our proposedscheme,weform the DC sequenceof the video by operationsonly ontheluminancechannelwhich leadsto significantsavingsincomputationalandstoragecosts.

Page 3: A Fast Method for Textual Annotation of Compressed Videoicvgip/PAPERS/103.pdf · database of annotated still images, using standard CBIR techniques, a textual annotation of the scene

a. Abrupt b. Gradual

Figure3: Illustrationof Interframedifference>@GH for a DC

sequenceof the movie for (a) abruptscenechange(q=1),(b) gradualscenechange(q=10).

3. Temporal SegmentationTo analyzethecontentof a givenvideosequence,we needto first segmentthesequenceinto individual shots.TheDCsequencesextractedfrom compressedvideoareusedto per-form temporalsegmentationof the video. As discussedin[12], two separatedetectionalgorithmsareusedto detectabruptandgradualscenechanges,respectively. We furtherimprove the shotdetectionmethodto handletransmissiondisturbances.The methodis basedon pixel level differ-encesbetweenDC images. Let IJ�LK �)M ' and NO�QP �)M 'betwo DC imagesandlet R� �IS��NT� denotetheir difference.Thedifferencemetricis setaspixel level difference,i.e.

RU )IS�NV�C� $��M 'XW K �)M 'ZY P �)M ' W F

3.1 Abrupt Scene Changes

To detectabruptscenechanges,peaksaredetectedin theplot of the pixel level differencesof successive framesina DC sequence.Sincescenechangeis a local activity inthe temporaldomain,a thresholdto detectthe peakis setto match the local activity. A []\)^R��_a` E �_#R+b E is usedto examine c successive frame differences.Let ��d+�*ef�� �g�� FDF!F ��h be a sequenceof DC images. A differencese-quence

> d �iR� A� d �� dkj 8l� is formed. A scenechangeisdeclaredfrom �nm to �3m j 8 if

i.> m �ocqp+Kr > m)sUt � FDF!F � > m j t � , and

ii.> m(uwv F >@xm

where> xm is thenext largestvaluewithin thesamewin-

dow y Y cz�Bc|{ .Thecondition(i) detectstheactualpeaksandthecondi-

tion (ii) guardsagainstfastpanningor zoomingscenes.Thewindow size c is setto be smallerthanthe minimum du-rationbetweentwo consecutive scenechanges.It hasbeenfound that valuesof v rangingfrom 3.0-4.0give goodre-sults. Fig. 3 (a) shows the plot of interframepixel leveldifferenceappliedto DC sequenceof anMPEGbitstream.

3.2 Gradual Scene Changes

A robustwayof detectinggradualtransitionis by usingnota single frame difference

> d , definedearlier, but the }3���framedifference

> Gd in the DC sequencewhere> Gd is de-

finedas > Gd �~RU ���d����dkj G � FBy selecting} greaterthanthe durationof the transitions,weget“plateaus”in theplotof

> Gd . Onenow needstodetectsuchplateausin theplot

> Gd . We usethefollowing criteriato detectaplateau.Wedeclarethestartof aplateauatframee��~\ (i.e, startof a gradualscenechange)if

i. W >Gm Y > Gm j ' W��o� 8 for �V� � �*g�� FDF ��} and

ii. W >Gm Y > Gm�s&8 W��o� :

where� 8 is athresholdthatallowssmallvariationsin theheightof theplateau,� : is a largethreshold(usually � :V�� 8 ) demandsthat the plateaurisesquite steeplywhen thegradualtransitionstarts,andtheparameter} (asexplainedearlier)definestheminimumdissolve or fadeout duration.Fig. 3 (b) shows theplot of

>@GH for q=10.Thearrow in Fig.3 (b) pointsto a “plateau”.

3.3 Scene Changes in Presence of Transmis-sion Disturbances

LargedifferencesbetweentheDC imagesmaybeencoun-teredbecauseof the transmissiondisturbances.This mayhappendueto burstynetwork congestionwhenanumberofpacketsmaygetdroppedduringthestreamingprocess.Weusethe observation that suchdisturbancesmanifestthem-selvesastwo sharppeaksquitecloseto eachotherandin-tersparsedwith fairly large valuesin the differenceplot ofthe DC images. The two peakscorrespondto the startofthe burst error andthe resumptionof the quality transmis-sion,respectively. Ideally theseshouldnotbeconsideredasscenechanges.We solve this problemby neglectingwhen-evertherearetwo peaksin thedifferenceplotscloseto eachother. It shouldbe notedthat during the normaltransmis-sion,therewill beoccasionalpacket dropsbut the increasein the differencemetric is quite marginal. The conditionsgiven in section3.1 areableto handlesucha situation. Aschemeas simple as describedabove is found to provideenoughresilienceto transmissiondisturbances.

3.4 Overall Scene Change Detection

To detectscenechangesin a video, we combineboth thealgorithmsdiscussedearlierto detectbothabruptandgrad-ual scenechanges.We keepin mind that after any scenechange,a gradualscenechangecannottake placebeforeaparticularnumber( _ % where _ % � c ) of frames. Henceif a scenechangeis declaredat a particularframe,thenwe

Page 4: A Fast Method for Textual Annotation of Compressed Videoicvgip/PAPERS/103.pdf · database of annotated still images, using standard CBIR techniques, a textual annotation of the scene

(Shot1) (Shot2)

(Shot3)

(Shot4) (Shot5) (Shot6) (Shot7)

Figure4: Pictorialsummarizationof themovie throughkey frameselectionin eachshot.

look for thegradualscenechange_ % framesaftertheabruptscenechange.However, therecouldbeanotherabruptscenechangeandhencethe schemerestrictsthe computationtosearchingonly anabruptscenechangeimmediatelyafteradetectedshot. We alsoincludethe modificationsuggestedin section3.3to robustlydetecttheshotswhentransmissiondisturbancesareexpected.

Eachshot,thusobtained,formsthebasisfor furtheranal-ysis,thatis, selectionof key frames.

4. Indexing

Theapproachto indexing a video involvesextractinga setof key framesfor the entirevideoclip, to captureasmuchof thecontentsof thevideoaspossible,but at thesametimeredundantframesareexcluded. To createa pictorial sum-maryof thevideo,representativeimageshaveto beselectedfrom eachshot.Therepresentative framefor eachshotcanbeselectedfrom amongsttheframesin themiddleto avoidspecialeffectssuchasfade-inandfade-outwhich aremoreoftenfoundat thebeginningor endof a shot. This givesacrudepictorial summaryof thevideosincetheuseof onlyoneframepershotis insufficient to capturethe time vary-ing essentialfeaturesof videosequences.Therefore,therecouldbemultiplekey framesin thesameshot.

We usetheDC histogramcomparisontechnique[4] forthenonuniform samplingof all framesthatconstituteashot.Theproposedalgorithmfor key frameselectionis basedonDC coefficients that are calculatedonly for the Y (lumi-nance)component.Thisisbecause(1)humanvisualsystemis moresensitive to Y componentthanto theotherchromi-

nancecomponents,and(2) the MPEG standardstypicallyusehigherspatialsamplingfor Y thanthesamefor theothertwo components.Thehistogramsof DC coefficientsof theY componentof all theI-framesconstitutingashotarecom-pared.Further, to reducethecomputation,we useonly theDC-image(thethumbnailimage)for all analyses.Thesim-ilarity metricusedto comparethehistogramof the ���� and����� DC imagesis �� �*�A��� metric,definedas:

�� �*�A���C�� d #�$d 8 y � � �e�� Y � ' Ae���{

: �+��

where � � �e�� denotesthe e���� histogrambin valueof theof ���� DC image.Wedeclarethe ���� frameto beakey frameif it is sufficiently differentfrom thepreviouskey frame,i.e.if

�� ���X����uw� (1)

and

�� )&� � ������� � �� �*���X��� (2)

Here � is athresholdthatcontrolsthedensityof temporalsamplingandKF denotesthelastkey framedetectedin thesequence.Thesecondconditionensuresthatthechosenkeyframeis maximallyapartfrom thepreviousKF comparedtotherestof thecandidateframes.

Thedescribedprocessis appliedto � -framesonly sincetwo consecutive � -frameshave a larger interframediffer-encethantwo consecutive � or � -frames.Hence� -framesarethebestcandidatesto capturethetemporalvariationsin

Page 5: A Fast Method for Textual Annotation of Compressed Videoicvgip/PAPERS/103.pdf · database of annotated still images, using standard CBIR techniques, a textual annotation of the scene

Table1: TextualSummaryof theMovie Segment“Soundof Music”.

Shot No. Frames Key Frames Key Frame ShotAnnotation Annotation

0500 Snow coveredSlope1 0500-0725 0560 MountainPeak MountainPeak

0694 MountainPeak0726 Cliff

2 0726-0949 0787 Cliff Clif f0834 Cliff0950 Snow coveredSlope1185 Cliff

3 0950-1501 1247 Cliff Clif f1309 Cliff1394 Cliff1462 Cliff

4 1502-1748 1502 River River5 1749-1901 1749 Gorge Gorge6 1902-2110 1902 Forest

2098 Forest Forest7 2111-2286 2111 Gorge

2217 Gorge Gorge

a shotasit requiresthe minimal decompression.The first� -framein a videoclip (shot)is alwaysdeclaredasthekeyframe. Thenthe other framesarecomparedto this frame.Wedeclarethe A��� frameto bethenew key frameif it satis-fiesconditions(1) and(2). Criterion(1) is to ensurethatthecurrentframeis significantlydifferentfrom the previouslydeclaredkey frame. Criterion(2) is imposedto ensurethattheframewhichismostdifferentfromthecurrentkey frameis selectedasthenew key frame.Thesubsequentframesarethencomparedto thisnew key frame.

It is appropriateto set the thresholdlocally, since thenumberof key framesrequiredto representascenedependsonthelocalactivity in thescene.Theproposedmethodsetsthethresholdto be � timesthestandarddeviationof theDCimageof the precedingkey frame. It hasbeenexperimen-tally foundthatvaluesof � rangingfrom 2.5- 4.5givegoodresults.

To obtain the textual annotationof the video eachkeyframefrom everyshotis comparedwith adatabaseof textu-ally annotatedstill images,usingany standardcontentbasedimageretrieval (CBIR) technique.Any CBIR techniquecanbeused,in principle,andweusetheCBIR methodproposedin [6]. We form theimagedatabaseby key framinga largenumberof videosequencesandmanuallyannotatingthesekey frames.Now thebestmatchfor agivenkey framefromatestvideosequenceis obtained.Thetextual annotationofthebestmatchimagefrom thedatabasehighlightsthecon-tent of a key frame andgivesa textual descriptionof the

scene.Thusa sparsebut coherentandtemporallyorderedtextualdescriptionof thevideois generated.

5. Results

To test the proposedalgorithm on realistic, compressedvideo, we have appliedthe algorithmto a wide variety ofMPEG video testsequences.All video clips aredigitizedat 25 frames/secandat a resolutionof 352x288pixels. Allthe imagespresentedin this sectionareconvertedto blackandwhite for the purposeof presentingresultsin the pa-per. For theperformanceanalysisof theproposedscheme,thegroundtruth wasobtainedthroughhumanintervention.Thedatabaseusedfor comparingthekey framesis obtainedby manuallyannotatingeachimagepresentin thedatabase.

Theresultfor theentireschemeof annotatinga videoispresentedfor clip from themovie “Soundof Music” andisof 5 minutes11 secondsduration. It is mostly aboutthepanningof a cameraarounda naturalsurroundingbeforethe leadactressentersthescene.We neglect frames0-499becausethey containcaptions. Fig. 3 shows the plot oftheframedifferencefor theDC sequenceobtainedfrom themovie clip andFig. 4 depictsthepictorial summaryof themovie upto the 7th shot due to the brevity of the presen-tation. The textual summaryof the video sequenceis pre-sentedin table I. For sceneswhich generatemultiple keyframes,we definetheshotannotationto bethemostoccur-ring key frameannotation.Thefirst shotis representedby

Page 6: A Fast Method for Textual Annotation of Compressed Videoicvgip/PAPERS/103.pdf · database of annotated still images, using standard CBIR techniques, a textual annotation of the scene

threekey frames. Out of thesethreekey frames,the firstkey frameis annotatedas“Snow coveredSlope” whereasthe remainingtwo key framesareannotatedas“MountainPeak”.Thefirst key frameis actuallyanimageof a cloudysky, but theCBIR techniqueannotatesit to ”Snow coveredSlope”.Weselecttheshotannotationasthemostoccurringkey frameannotation,i.e. “Mountain Peak”. The secondshotis representedby threekey framesandsinceeachkeyframeis annotatedas“Clif f ”, the sceneannotationis also“Clif f ”. Thethird shothassix key frames,out of which thefirst is annotatedas“Snow CoveredSlope”andtherestareannotatedas“Clif f ”. Hencetheshotannotationis “Clif f ”.The next shot is annotatedas“River” sincethe key framewhich representtheshotis annotatedas“River”. Thefifthshotis depictedby asinglekey framewhich is annotatedas“Gorge” andtheshotannotationis thesame.Thesixthshotis annotatedas“Forest”,sinceboth the key framesarean-notatedas“Forest”. Thelastscenehasonly onekey framewhich is annotatedas“Gorge”.

The efficacy of the proposedalgorithmdependson theschemeusedfor temporalsegmentation.The schemeim-plementedfor temporalsegmentationgivesa detectionrateof 95% andprecisionrateof 92%. The ability of the keyframesto highlightthetemporalvariationsin ashotis,how-ever, a subjective issue.Theeffectivenessof thekey frameselectionalgorithmis alsotestedusingthegroundtruthdataobtainedmanually. It is observedthatfor themovie clip, 18framesare declaredas key frame candidates;while thereare 2 additional frameswhich could have beendeclaredas possiblekey frame candidates.Hencethe accuracy inkey frameselectionis 90%. It is needlessto mentionthatquality of textual annotationdependson the effectivenessof theCBIR algorithmandhow exhaustive is theannotateddatabase.

6. Conclusions

We presentin this paperthe conceptof video visualiza-tion in the form of pictorial and textual summarygenera-tion. Thepictorial summarycreatedby key frameselectioncompactlyrepresentsthe original video with considerablefidelity. Thetemporalprogressof thestory is preservedbypresentingthekey framesin thetimeorder. A techniquehasbeenproposedto createatextualsummaryof thevideo.Thetextual andpictorial summarycontribute to semanticvisu-alizationandsuccinctpresentationof the pictorial contentof video.

All of thekey framesbasedapproachesrepresentavideoby a smallsetof images.Thus,they losethemotion infor-mationof the original video. Theproposedalgorithmalsosuffersfrom thesamedefect.By theadditionaluseof mo-tion informationof thevideo,betterindexing andretrievaltechniquescanbe developed,which is our future research

problem.

References

[1] P. Bouthemy, M. Gelgon,and F. Ganasia. A Unified ap-proachto ShotChangeDetectionandCameraMotion Char-acterization.IEEE Transactions on Circuits and Systems forVideo Technology, 9(7):1030– 1044,October1979.

[2] H. S.ChangandS.U. Lee.EfficientVideoindexing Schemefor ContentBasedRetrieval. IEEE Transactions on Circuitsand Systems for Video Technology, 9(8):1269– 1279,De-cember1999.

[3] S. Dagtas,W. Al-Khatib, A. Ghafoor, andR. L. Kashyap.Models for Motion-basedVideo Indexing and Retrieval.IEEE Transactions on Circuits and Systems for Video Tech-nology, 9(1):88– 101,January2000.

[4] B. Furht and P. Saksobhavivat. A Fast Content-BasedMultimedia Retrieval TechniqueUsing CompressedData.www.cs.fau.edu/� ���l�l�;�]�B�+���� ¡� ¢&£�¤�¥§¦�¨l© ��ª�«+©

[5] U. Gargi, R. Kasturi,andS. H. Strayer. PerformanceChar-acterizationof Video-shot-changeDetectionMethods.IEEETransactions on Circuits and Systems for Video Technology,10(1):1– 13,February2000.

[6] N. Jhawar, S. Chaudhuri,G. Seetharaman,and B. Zavi-dovique. ContentBasedImageRetrieval Using OptimumPeanoScan. Proc. Intl. Conf.PatternRecognition(ICPR),QuebecCity, Canada,August2002.

[7] S.-W. Lee, Y.-M. Kim, and S. W. Choi. Fast SceneChangeDetection Using Direct FeatureExtraction frommpeg CompressedVideos. IEEE Transactions on Multime-dia, 2(4):240– 254,December2000.

[8] M. R.NaphadeandT. S.Huang.A ProbabilisticFrameworkfor SemanticVideo Indexing, FilteringandRetrieval. IEEETransactions on Circuits and Systems for Video Technology,3(1):141– 151,March2001.

[9] S.-C. Pei and Y.-Z. Chou. Efficient mpeg CompressedVideoAnalysisUsingMacroblockTypeINformation.IEEETransactions on Multimedia, 4(1):321 – 333, December1999.

[10] J.SongandB.-L. Yeo.FastExtractionof SpatiallyReducedImageSequencesfrom MPEG-2CompressedVideo. IEEETransactions on Circuits and Systems for Video Technology,9(7):1100– 1114,October1999.

[11] Y.-P. Tan, D. D. Saur, S. R. Kulkarni, and P. J. Ra-madge. Rapid Estimationof CameraMotion from Com-pressedVideo with Applicationto VideoAnnotation.IEEETransactions on Circuits and Systems for Video Technology,10(1):133– 146,February2000.

[12] B.-L. Yeo and B. Liu. Rapid SceneChangeAnalysis onCompressedVideo. IEEE Transactions on Circuits and Sys-tems for Video Technology, 5(6):533– 544,December1995.

[13] M. M. YeungandB.-L. Yeo. Video Visualisationfor Com-pact Presentationand FastBrowsing of Pictorial Content.IEEE Transactions on Circuits and Systems for Video Tech-nology, 7(5):771– 785,October1997.

[14] D. Zhong and S.-F. Chang. An IntegratedApproachforContent-BasedVideo Object Segmentationand Retrieval.IEEE Transactions on Circuits and Systems for Video Tech-nology, 9(8):1259– 1269,December1999.