Symbiosis Between the TRECVid Benchmark and Video Libraries at the
Feature Fusion and Redundancy Pruning for Rush Video Summarization TRECVID Video Summarization...
-
date post
19-Dec-2015 -
Category
Documents
-
view
218 -
download
1
Transcript of Feature Fusion and Redundancy Pruning for Rush Video Summarization TRECVID Video Summarization...
Feature Fusion and Redundancy Pruning for Rush Video
Summarization
TRECVID Video Summarization Workshop, ACM-MM ’079/28/07
Vision Research Lab – ECE Department University of California, Santa Barbara
Jim Kleban, Anindya Sarkar, Emily Moxley, Stephen Mangiat, Swapna Joshi, Thomas Kuo, B. S. Manjunath
Outline
Motivation Feature selection for object/event inclusion and
retake detection Feature fusion and keyframe selection by
adaptive sampling ‘Junk’ shot removal Overall system workflow Discussion of results Future direction
Motivation
Good summarization is a function of a data set’s structure (what are the users interested in?) News - stories, Sports - highlights, Rushes - ?
Ground truth events denote interest for the pilot Ground truth annotation can be divided into 4 types Can certain features capture types of events? – inspired by user
attention model1
‘Junk shots’: re-takes, clapboards – should be identified and removed
Combine feature time functions to determine an adaptive sampling function to select candidate keyframes
1 Y.-F. Ma, L. Lu, H.-J. Zhang, and M. Li. A user attention model for video summarization. In ACM Multimedia, pages 533–542. ACM Press, 2002.
Ground Truth Categories
i) shots showing stationary objects and distinct backgroundsii) shots of people entering or leaving a sceneiii) shots containing camera motion, panning and zoomingiv) shots of distinct events
Type i) man in carType ii) woman entersscene
Type iii) camera followsas men enter building
Type iv) woman helpsman from car
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Feature Selection
Our summarization system combines the following features to indicate importance: k-means clustering
type i) background shots
camera motion model type iii) pan/tilt/zoom
rate of change in color space type ii) and iv), entering/leaving & events
dynamic time warping (DTW)-based retake detection penalize repeated shots
speech/tone/background audio classification high RMS speech correlates with event importance
K-means clustering
Cluster to weight visually distinct keyframes
2 P. Bouthemy, M. Gelgon, and F. Ganansia. A unified approach to shot change detection and camera motion characterization. IEEE Trans. on Circuits and Systems for Video Technology, 9(7):1030–1044, 1999.
1) Over-segment shots according to global 2D affine motion model2
2) Merge segments to at least 40 frames, extract one keyframe from the center of each segment3) Cluster to K=2/3 number of input segments according to 12-dim HSV color feature, select keyframes closest to centroids 4) Re-iterate process five times with different initial allocations5) Feature function, fKMEANS[n], generated by summing keyframe locations smoothed with a 90-frame Hamming window and normalize
For each movie, generate fKMEANS[n] as:
K-means Example from Test
0.5 1 1.5 2 2.5 3
x 104
0
0.2
0.4
0.6
0.8
1
X: 1.78e+004Y: 0.7963
n, frame number
fkmeans[n]
X: 1.109e+004Y: 0.9443
X: 2.651e+004Y: 0.9733
Camera Motion Model Attempt to capture pan, tilt, zooms longer than roughly 1/2 second long Parameterized affine motion model from Bouthemy et. al
Parameters characterize camera motion Pure panning (tilt) - a1 (a4) only nonzero Zooming or forward tracking - div only nonzero Sideways tracking - all parameters nonzero
€
wθ ( p) =a1 + a2 x − xg( ) + a3 y − yg( )
a4 + a5 x − xg( ) + a6 y − yg( )
⎛
⎝ ⎜ ⎜
⎞
⎠ ⎟ ⎟
where p is a point (x,y), (xg,yg) is image center, is model (a1,…,a6,)
= (a1,a4,div, rot, hyp1,hyp2)
div = ½(a2+a6) rot=½(a5-a3)
hyp1 = ½(a2+a6) hyp2=½(a3+a5)
1) Compute global 2D affine motion model (, subsampling by rate of 5 frames2) For each subsampled frame n, with model (n, n if L1norm(n T,
s[n] = 1 if there is pan, tilt, zoom as above 0 otherwise
3) Convolve s[n] with L=16 order (16/25 sec) average filter 4) Output feature function, fCAM[n] is binary
Camera Example
0 0.5 1 1.5 2 2.5 3
x 104
0
0.5
1
n, frames
fcam[n]
X: 7116Y: 1
Adaptive Weighting Feature
Basic event model uses correlation between changes in overall color space and events Probably works better for drama rushes than for sports
Capture changes in visual domain as rate of change in HSV feature space
€
fΔHSV [n] = xd n[ ] − xd n +1[ ]( )2
d =1
12
∑
1) For each subsampled (5) frame vector x[n], compute L2-distance in 12-dim HSV space:
2) Apply 25 frame (1 sec) median filter to reduce noise and disregard abrupt cuts between shots3) Normalize
Adaptive Example
0.5 1 1.5 2 2.5 3
x 104
0
0.2
0.4
0.6
0.8
1
X: 5476Y: 0.8599
frames, n
fadapt[n] X: 2.063e+004
Y: 0.8538
X: 2.882e+004Y: 1
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
QuickTime™ and a decompressor
are needed to see this picture.
Speech detection
Observe that speech segments with high RMS often occur when actor is speaking on-camera during scene
Speech/environment/pure tone/silence VQ classifier with features: Spectral flux, high zero crossing rate ratio (HZCRR), low
short-term energy ratio (LSTER) Inclusion of speech improves precision, but not recall
(fraction of included events) Downweight long silences (black screens) Identify tone occurring with color bars
Binary output feature function – ‘1’ where speech with highest 30% RMS is detected
Speech Example
0 0.5 1 1.5 2 2.5 3
x 104
-0.2
0
0.2
0.4
0.6
0.8
1
1.2
frames, n
fspeech[n]
X: 1.755e+004Y: 1
QuickTime™ and a decompressor
are needed to see this picture.
DTW for Redundancy Removal
DTW - dynamic programming method using local path constraints3 to compare similarity between segments of differing length
Idea: as a scene is re-shot, time may be stretched but basic events, camera action occur in order as scripted Similar to multiple instances of pronouncing the same
word/sentence? DTW suited for matching with missing information, provided
segments are long enough
Most re-takes are not repeated in ground truth - include one in summary, discard the rest
3 L. Rabiner and B. H. Juang. Fundamentals of Speech Recognition. Prentice Hall Signal Processing Series, Englewood Cliffs, New Jersey, 1993.
DTW cont (1).
Sub-shot segments (as from Bouthemy) are identified as one of three types: (a) unique and without repeat (b) repeated and longest of similar segments (c) repeated but not longest
Example from dev. set:Pairwise DTW distance. Individual takes were manually segmented for training video CU497924. There are four sets of segment repeats: 1-7, 8-10, 11-13, and 14-16.
DTW cont (2).
Uses 1125-dim local color histogram from Gong et. al4
Point on ROC curve selected for similarity threshold
Let there be M segments {si}, i=1,…,M and form M x M inter-segment DTW distance matrix, D: Create Li groups of pairs of
segments similar to si by finding Dij
< Score as:
€
score(i) =2 ⋅Si,1 + 0.1⋅Si,0( )
Li
Li >1
1.5 Li =1
⎧
⎨ ⎪
⎩ ⎪where Si,1 is number of times si is longest in duration member among Li groups
and Si,0 is number of times si is not the longest in duration member among Li groups
4 Y. Gong and X. Liu. Video summarization and retrieval using singular value decomposition. Multimedia Systems, 9(2):157–168, 2003.
DTW Example
0 0.5 1 1.5 2 2.5 3
x 104
0
0.5
1
1.5
2
frames, n
fdtw[n]
Scene 1Take 1 Take 2 Take 3 Take 4 Take 5
1.91 0.64 0.75 1.25 1.38Scene 2Take 1
1.83Scene 3Take 1 Take 2 Take 3
2 2 0.108Scene 4Take 1
1.9325Scene 5Take 1 Take 2 Take 3
1.88 2 1.99Scene 6Take 1 Take 2 Take 3
1.97 1.85 1.87Scene 7Take 1 Take 2 Take 3 Take 4 Take 5 Take 6
0.11 0.13 1.34 1.3 0.25 0.77
Mean DTW function valuesover scene re-takes for
CU044500
Linear combination of five feature functions outputs to create a sampling importance function:
Learn w via a gradient descent search to maximize approximation of fraction of ground truth events included, Rfrac,approx.
We annotated frame numbers for 20 dev. set movies for ground truth Rfrac,approx = nappox/Ngt
for included keyframes, increment nappox if it uniquely overlaps one ground truth subshot for at least 15 frames
Best weights found ~6% improvement over our own uniform sampling baseline system:
Feature Fusion
€
f total n[ ] = w0 + wi ⋅ f i[n]i=1
5
∑
Constant, w0 0.09
Speech, w1 0.49
Camera, w2 0.55
K-means, w3 1.00
DTW-retake, w4 1.00
Adaptive sampling, w5 0.55
Importance Function Adaptive Sampling
n
ftotal[n]fbaseline[n]
n
Select candidate keyframes by sampling at a rate proportional to the area under the importance function Does not remove or guarantee inclusion of any particular frames Avoided peak selection by derivative -> too focused on rapid changes,
speech, camera motion
Base sampling rate of 1/20th of total frames, i.e. 20/25 Hz Run a final k-means step run to further remove duplicates
K number of clusters now selected to create 4% summary
‘Junk’ Shot Removal
‘Junk’ shots removed on postprocessing after sampling to select candidate keyframes
Black/gray screens removed by simple global color entropy threshold
Color bars removed via template matching in localized color space
Bag-of-features SIFT distance5 used to remove various clapboard types Trained on various examples Constructs a vocabulary tree of features to reduce distance computations True positive rate in devel. ~90%, false positive ~2%
5 D. Nister and H. Stewenius. Scalable recognition with a vocabulary tree. In Proc. of CVPR, pages 2161–2168,
Washington, DC, USA, 2006.
System Workflow
Fixed 4% summary length
No audio ~ interrupted speech can be perceptually annoying
Selected keyframes padded with +/- 15 frames
Smaller padding size may improve fraction of inclusion but was found to be hard to watchDecode
audio
Screen &ClapboardRemoval
FinalClustering
SpeechDetection
RetakeDetection
CameraMotion
Detection
AdaptiveWeighting
ColorClustering
w1
w2
w3
w4
w5
w01
InputVideo
MotionModel
BuildSummary
MotionEstimation
AdaptiveImportanceSampling
ShotBoundaries
DownSample
↓5
LearnedWeights w0… w5
constant
Results
4th highest mean recall, Rfrac = 0.60 Yet a mere 8.6% improvement over CMU uniform baseline
Judged easy to understand, mean 3.46 (4th highest) Fair amount of redundancy, mean 3.67 (13th of 22) Did not optimize for system run time:
3.5 hours (!!!) to decode & summarize a 25.42 min movie serially on a 2.3GHz P-IV cluster node
SIFT extraction, clapboard detection ~50% total time
Result Analysis
Evaluation difficulties continue, i.e. are “hut” summaries good, or just short? Table shows ranked mean Rfrac(unnorm) as is and Rfrac(norm)
normalized by summary length Avg. durations: hut 26.1s, cityu 42.15s, ucal (UCSB) 63.6s
Feasible proposal?: Each system produces 2 summaries One at a fixed percentage of video length for straightforward
comparison One at system determined ‘optimum’ length
Conclusions/Future Work
Feature fusion for adaptive sampling technique does improve ground truth inclusion over baseline Should reduce computation time
DTW a good start for re-take detection, should compare with other good methods
How to deal with remaining redundancy in our summaries? Consider the 30-frame summary shots, run DTW again to
match and remove similar videos? Camera model improvement - which motions to include?
Does time location in shot matter?
Support for this work provided by NSF IGERT Grant#DGE-0221713
Thank You!
This presentation is kaput!
MRS044500