Feature Fusion and Redundancy Pruning for Rush Video Summarization TRECVID Video Summarization...

Feature Fusion and Redundancy Pruning for Rush Video

Summarization

TRECVID Video Summarization Workshop, ACM-MM ’079/28/07

Vision Research Lab – ECE Department University of California, Santa Barbara

Jim Kleban, Anindya Sarkar, Emily Moxley, Stephen Mangiat, Swapna Joshi, Thomas Kuo, B. S. Manjunath

Outline

Motivation Feature selection for object/event inclusion and

retake detection Feature fusion and keyframe selection by

adaptive sampling ‘Junk’ shot removal Overall system workflow Discussion of results Future direction

Motivation

Good summarization is a function of a data set’s structure (what are the users interested in?) News - stories, Sports - highlights, Rushes - ?

Ground truth events denote interest for the pilot Ground truth annotation can be divided into 4 types Can certain features capture types of events? – inspired by user

attention model1

‘Junk shots’: re-takes, clapboards – should be identified and removed

Combine feature time functions to determine an adaptive sampling function to select candidate keyframes

1 Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li. A user attention model for video summarization. In ACM Multimedia, pages 533–542. ACM Press, 2002.

Ground Truth Categories

i) shots showing stationary objects and distinct backgroundsii) shots of people entering or leaving a sceneiii) shots containing camera motion, panning and zoomingiv) shots of distinct events

Type i) man in carType ii) woman entersscene

Type iii) camera followsas men enter building

Type iv) woman helpsman from car

QuickTime™ and a decompressor

are needed to see this picture.





Feature Selection

Our summarization system combines the following features to indicate importance: k-means clustering

type i) background shots

camera motion model type iii) pan/tilt/zoom

rate of change in color space type ii) and iv), entering/leaving & events

dynamic time warping (DTW)-based retake detection penalize repeated shots

speech/tone/background audio classification high RMS speech correlates with event importance

K-means clustering

Cluster to weight visually distinct keyframes

2 P. Bouthemy, M. Gelgon, and F. Ganansia. A unified approach to shot change detection and camera motion characterization. IEEE Trans. on Circuits and Systems for Video Technology, 9(7):1030–1044, 1999.

1) Over-segment shots according to global 2D affine motion model2

2) Merge segments to at least 40 frames, extract one keyframe from the center of each segment3) Cluster to K=2/3 number of input segments according to 12-dim HSV color feature, select keyframes closest to centroids 4) Re-iterate process five times with different initial allocations5) Feature function, fKMEANS[n], generated by summing keyframe locations smoothed with a 90-frame Hamming window and normalize

For each movie, generate fKMEANS[n] as:

K-means Example from Test

0.5 1 1.5 2 2.5 3

x 104

0

0.2

0.4

0.6

0.8

1

X: 1.78e+004Y: 0.7963

n, frame number

fkmeans[n]

X: 1.109e+004Y: 0.9443

X: 2.651e+004Y: 0.9733

Camera Motion Model Attempt to capture pan, tilt, zooms longer than roughly 1/2 second long Parameterized affine motion model from Bouthemy et. al

Parameters characterize camera motion Pure panning (tilt) - a1 (a4) only nonzero Zooming or forward tracking - div only nonzero Sideways tracking - all parameters nonzero

€

wθ ( p) =a1 + a2 x − xg( ) + a3 y − yg( )

a4 + a5 x − xg( ) + a6 y − yg( )

⎛

⎝ ⎜ ⎜

⎞

⎠ ⎟ ⎟

where p is a point (x,y), (xg,yg) is image center, is model (a1,…,a6,)

= (a1,a4,div, rot, hyp1,hyp2)

div = ½(a2+a6) rot=½(a5-a3)

hyp1 = ½(a2+a6) hyp2=½(a3+a5)

1) Compute global 2D affine motion model (, subsampling by rate of 5 frames2) For each subsampled frame n, with model (n, n if L1norm(n T,

s[n] = 1 if there is pan, tilt, zoom as above 0 otherwise

3) Convolve s[n] with L=16 order (16/25 sec) average filter 4) Output feature function, fCAM[n] is binary

Camera Example

0 0.5 1 1.5 2 2.5 3

x 104

0

0.5

1

n, frames

fcam[n]

X: 7116Y: 1

Adaptive Weighting Feature

Basic event model uses correlation between changes in overall color space and events Probably works better for drama rushes than for sports

Capture changes in visual domain as rate of change in HSV feature space

€

fΔHSV [n] = xd n[ ] − xd n +1[ ]( )2

d =1

12

∑

1) For each subsampled (5) frame vector x[n], compute L2-distance in 12-dim HSV space:

2) Apply 25 frame (1 sec) median filter to reduce noise and disregard abrupt cuts between shots3) Normalize

Adaptive Example

0.5 1 1.5 2 2.5 3

x 104

0

0.2

0.4

0.6

0.8

1

X: 5476Y: 0.8599

frames, n

fadapt[n] X: 2.063e+004

Y: 0.8538

X: 2.882e+004Y: 1







Speech detection

Observe that speech segments with high RMS often occur when actor is speaking on-camera during scene

Speech/environment/pure tone/silence VQ classifier with features: Spectral flux, high zero crossing rate ratio (HZCRR), low

short-term energy ratio (LSTER) Inclusion of speech improves precision, but not recall

(fraction of included events) Downweight long silences (black screens) Identify tone occurring with color bars

Binary output feature function – ‘1’ where speech with highest 30% RMS is detected

Speech Example

0 0.5 1 1.5 2 2.5 3

x 104

-0.2

0

0.2

0.4

0.6

0.8

1

1.2

frames, n

fspeech[n]

X: 1.755e+004Y: 1



DTW for Redundancy Removal

DTW - dynamic programming method using local path constraints3 to compare similarity between segments of differing length

Idea: as a scene is re-shot, time may be stretched but basic events, camera action occur in order as scripted Similar to multiple instances of pronouncing the same

word/sentence? DTW suited for matching with missing information, provided

segments are long enough

Most re-takes are not repeated in ground truth - include one in summary, discard the rest

3 L. Rabiner and B. H. Juang. Fundamentals of Speech Recognition. Prentice Hall Signal Processing Series, Englewood Cliffs, New Jersey, 1993.

DTW cont (1).

Sub-shot segments (as from Bouthemy) are identified as one of three types: (a) unique and without repeat (b) repeated and longest of similar segments (c) repeated but not longest

Example from dev. set:Pairwise DTW distance. Individual takes were manually segmented for training video CU497924. There are four sets of segment repeats: 1-7, 8-10, 11-13, and 14-16.

DTW cont (2).

Uses 1125-dim local color histogram from Gong et. al4

Point on ROC curve selected for similarity threshold

Let there be M segments {si}, i=1,…,M and form M x M inter-segment DTW distance matrix, D: Create Li groups of pairs of

segments similar to si by finding Dij

< Score as:

€

score(i) =2 ⋅Si,1 + 0.1⋅Si,0( )

Li

Li >1

1.5 Li =1

⎧

⎨ ⎪

⎩ ⎪where Si,1 is number of times si is longest in duration member among Li groups

and Si,0 is number of times si is not the longest in duration member among Li groups

4 Y. Gong and X. Liu. Video summarization and retrieval using singular value decomposition. Multimedia Systems, 9(2):157–168, 2003.

DTW Example

0 0.5 1 1.5 2 2.5 3

x 104

0

0.5

1

1.5

2

frames, n

fdtw[n]

Scene 1Take 1 Take 2 Take 3 Take 4 Take 5

1.91 0.64 0.75 1.25 1.38Scene 2Take 1

1.83Scene 3Take 1 Take 2 Take 3

2 2 0.108Scene 4Take 1

1.9325Scene 5Take 1 Take 2 Take 3

1.88 2 1.99Scene 6Take 1 Take 2 Take 3

1.97 1.85 1.87Scene 7Take 1 Take 2 Take 3 Take 4 Take 5 Take 6

0.11 0.13 1.34 1.3 0.25 0.77

Mean DTW function valuesover scene re-takes for

CU044500

Linear combination of five feature functions outputs to create a sampling importance function:

Learn w via a gradient descent search to maximize approximation of fraction of ground truth events included, Rfrac,approx.

We annotated frame numbers for 20 dev. set movies for ground truth Rfrac,approx = nappox/Ngt

for included keyframes, increment nappox if it uniquely overlaps one ground truth subshot for at least 15 frames

Best weights found ~6% improvement over our own uniform sampling baseline system:

Feature Fusion

€

f total n[ ] = w0 + wi ⋅ f i[n]i=1

5

∑

Constant, w0 0.09

Speech, w1 0.49

Camera, w2 0.55

K-means, w3 1.00

DTW-retake, w4 1.00

Adaptive sampling, w5 0.55

Importance Function Adaptive Sampling

n

ftotal[n]fbaseline[n]

n

Select candidate keyframes by sampling at a rate proportional to the area under the importance function Does not remove or guarantee inclusion of any particular frames Avoided peak selection by derivative -> too focused on rapid changes,

speech, camera motion

Base sampling rate of 1/20th of total frames, i.e. 20/25 Hz Run a final k-means step run to further remove duplicates

K number of clusters now selected to create 4% summary

‘Junk’ Shot Removal

‘Junk’ shots removed on postprocessing after sampling to select candidate keyframes

Black/gray screens removed by simple global color entropy threshold

Color bars removed via template matching in localized color space

Bag-of-features SIFT distance5 used to remove various clapboard types Trained on various examples Constructs a vocabulary tree of features to reduce distance computations True positive rate in devel. ~90%, false positive ~2%

5 D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In Proc. of CVPR, pages 2161–2168,

Washington, DC, USA, 2006.

System Workflow

Fixed 4% summary length

No audio ~ interrupted speech can be perceptually annoying

Selected keyframes padded with +/- 15 frames

Smaller padding size may improve fraction of inclusion but was found to be hard to watchDecode

audio

Screen &ClapboardRemoval

FinalClustering

SpeechDetection

RetakeDetection

CameraMotion

Detection

AdaptiveWeighting

ColorClustering

w1

w2

w3

w4

w5

w01

InputVideo

MotionModel

BuildSummary

MotionEstimation

AdaptiveImportanceSampling

ShotBoundaries

DownSample

↓5

LearnedWeights w0… w5

constant

Results

4th highest mean recall, Rfrac = 0.60 Yet a mere 8.6% improvement over CMU uniform baseline

Judged easy to understand, mean 3.46 (4th highest) Fair amount of redundancy, mean 3.67 (13th of 22) Did not optimize for system run time:

3.5 hours (!!!) to decode & summarize a 25.42 min movie serially on a 2.3GHz P-IV cluster node

SIFT extraction, clapboard detection ~50% total time

Result Analysis

Evaluation difficulties continue, i.e. are “hut” summaries good, or just short? Table shows ranked mean Rfrac(unnorm) as is and Rfrac(norm)

normalized by summary length Avg. durations: hut 26.1s, cityu 42.15s, ucal (UCSB) 63.6s

Feasible proposal?: Each system produces 2 summaries One at a fixed percentage of video length for straightforward

comparison One at system determined ‘optimum’ length

Conclusions/Future Work

Feature fusion for adaptive sampling technique does improve ground truth inclusion over baseline Should reduce computation time

DTW a good start for re-take detection, should compare with other good methods

How to deal with remaining redundancy in our summaries? Consider the 30-frame summary shots, run DTW again to

match and remove similar videos? Camera model improvement - which motions to include?

Does time location in shot matter?

Support for this work provided by NSF IGERT Grant#DGE-0221713

Thank You!

This presentation is kaput!

MRS044500

Feature Fusion and Redundancy Pruning for Rush Video Summarization TRECVID Video Summarization...

Documents

Transcript of Feature Fusion and Redundancy Pruning for Rush Video Summarization TRECVID Video Summarization...