Poster
Transcript of Poster
Exploring Inter-Frame Correlation Analysis and Wavelet-Domain Modeling for Real-Time Caption Detection in Streaming Video
Jia Li1, Yonghong Tian2, Wen Gao1,2
1Key Laboratory of Intelligent Information Processing, ICT,CAS2Institute of Digital Media, School of EE & CS, PKU
2
Outline
Background Problem statement System architecture Experiments Conclusion
3
Background: Caption Detection
Caption Text
Scene Text
Scrolling Text
4
Background : Frequently used methods Sobel Edges
Chunmei Liu, et al. Text detection in images based on unsupervised classification of edge-based features. ICDAR 2005.
Lyu, M.R., et al. A comprehensive method for multilingual video text detection, localization, and extraction. CSVT, 2005.
Wavelet Domain Huiping Li et al. Automatic text detection and tracking in digital vi
deo. IEEE Transactions on Image Processing, 2000 Qixiang Ye, et al. “Fast and robust text detection in images and v
ideo frames”. Image and Vision Computing. 2005 Directly On Image
Kwang In Kim, et al. Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm”. PAMI, 2003
5
The Problems
Text Detection in Images & Video
Text Detection in Streaming Video
Detection Speed
Discern Text Type
Removal of Texture
Why streaming videos are different: Faster: frames are coming in real time More detailed: few clues – organize texts by their types More accurate: simple features to remove text-like
textures
6
Our Solution
Inter-FrameCorrelations
Static Edges
Moving Edges
Caption Texts
Stable Background
Scene Text
Moving Textures
Scrolling Text
Frame Sequence
Fast & RobustQuantification
7
System Architecture
Data: using the previous and next frame to assist text detection in the current frame.
Temporal Analysis: Remove unstable edges Spatial Analysis: Remove weak edges
8
Temporal Analysis
The goal: remove unstable edges In subbands LH and HL of wavelet domain
Edge stability in subband WS (WS {LH,HL}) is evaluated ∈with Inter-Subband Correlation Coefficients (ISCC)
Variables:
1
[ 1, ]
1
1 ( , , ) ( , , ) ,
ISCC( , , 1, , ) ( , , )( (1, ), 1) ,
( , , ) ( , , )
i i
i i
i i
x y WS x y WS
x y i i WS x y WSMAX MIN elsewise
x y WS x y WS
2 2 2
1( , , ) ( , )
(2 1)
y Mx M
i ia x M b y M
x y WS WS a bM
[ 1, ] 12
1( , , ) ( , ) ( , )
(2 1)
y Mx M
i i i ia x M b y M
x y WS WS a b WS a bM
Local Covariance
Local Variance
9
Temporal Analysis (continue)
Based on ISCC, the inter-frame correlation coefficients (IFCC) and Temporal stability (TS) are defined as: Inter-frame correlation coefficients:
Temporal stability:
{ , }IFCC( , , 1, ) { ISCC( , , 1, , ) }
WS LH HLx y i i MAX x y i i WS
TS( , , ) IFCC( , , 1, ), IFCC( , , , 1) x y i MAX x y i i x y i i
10
Temporal Analysis (continue)
Temporal Stability of Edges
Inter-subband correlation
Inter frame correlation
11
(a) (b) (c) (d)
ISCC is robust to background changing.ISCC is robust to background changing.
ISCC is sensitive to slight motions.ISCC is sensitive to slight motions.
x
ISCC: Robust and Sensitive
12
Spatial Analysis
The goal: remove static backgrounds Texts are collections of strong edges Using adaptive global thresholds on wavelet subbands LH
/HL to remove the backgrounds. Modeling wavelet coefficients in LH/HL with two Gene
ralized Gaussian Distributions:
Parameters: Parameters: Mean, Variance, and Shape parameterMean, Variance, and Shape parameter 1111
(b )2( : , , ) a xf x e b 1 (3/ )
a b2 (1/ ) (1/ )
What is shape parameter?What is shape parameter?2222
13
Shape Parameter and Threshold Selection
(b )2( : , , ) a xf x e
Kamran, S., and Alberto, L.G. “Estimation of shape parameter for generalized Gaussian distribution in subband decompositions of video”. CSVT, 5(1), 1995 Page(s):52 – 56
More Edges Larger r
Less EdgesSmaller r
Larger Threshold
Smaller Threshold
WS c , 4.0WS WSthreshold c
Wavelet Subbands
Usually, the coefficients are zero-mean. Adaptive global threshold is selected according to
variance and shape parameter
14
Spatial Analysis (continue)
Remove Moving Regions
Remove Weak Edges
Morphological operations
Inter-Frame tracking
SVM-based classification
15
ExperimentsAlgorithms
For ComparisonsDescription
Algorithm 1
Lyu, M.R, et al.
“A comprehensive method for multilingual video text detection, localization, and extraction”.
CSVT, 2005
1 、 Sobel edges2 、 Local thresholding 3 、 Iterative projection
Algorithm 2
Qixiang Ye, et al.
“Fast and robust text detection in images and video frames”.
IVC. 2005
1 、 Wavelet domain2 、 Select 15% edges3 、 Sophisticated SVM
16
The Data Set
Data Set Resolution Duration Test Environment
Test Set I
16 Video Clips 6 h 49 min. 2 frame/second 49,177 frames
89,639 captions
Pentium IV 3.2G CPU, 512 MB.
Test Set II
Test Set III
Test Set IV
720 576
400 328
352 288
176 144
(Used in Algorithm 2 for comparison)
(Size of the original Video)
(Down sampled for CIF)
(Down sampled for QCIF)
17
NotesNotes IFCC histogram between caption regions with same texts in adjacent frames, are
used to evaluate the robustness IFCC histogram between caption regions with different texts in adjacent frames,
are used to evaluate the sensitivity Four Resolutions: 720*576, 400*328, 352*288, 176*144 (a) IFCC histogram to evaluate robustness; (b) IFCC histogram to evaluate the sensitivity
NotesNotes IFCC histogram between caption regions with same texts in adjacent frames, are
used to evaluate the robustness IFCC histogram between caption regions with different texts in adjacent frames,
are used to evaluate the sensitivity Four Resolutions: 720*576, 400*328, 352*288, 176*144 (a) IFCC histogram to evaluate robustness; (b) IFCC histogram to evaluate the sensitivity
(a) (b)
Experiment 1: Robustness and Sensitivity of IFCC
18
Experiment 2 : Detection Speed
Table 1. Detection Speed and Speed Nonstationarity
Algorithm Our Algorithm 1 Algorithm 2
Speed(Frames/s) 9.09 4.46 1.18
Nonstationarity (%) 5.13 11.54 12.69
Notes:Notes: 7 frame sequences with no scene/scrolling texts are selected 7 frame sequences with no scene/scrolling texts are selected
from test set II.from test set II. Nonstationarity is defined as:Nonstationarity is defined as: Smaller nonstationarity means the algorithm spends similar tiSmaller nonstationarity means the algorithm spends similar ti
me on simple and complex frame. me on simple and complex frame.
Notes:Notes: 7 frame sequences with no scene/scrolling texts are selected 7 frame sequences with no scene/scrolling texts are selected
from test set II.from test set II. Nonstationarity is defined as:Nonstationarity is defined as: Smaller nonstationarity means the algorithm spends similar tiSmaller nonstationarity means the algorithm spends similar ti
me on simple and complex frame. me on simple and complex frame.
7 72
1 1 1
Nonstationarity [( ) / ] /iN
ij ALL i ALLi j i
S S N S
19
Why we are faster
ISCC is calculated with 2D-separable filters ISCC is calculated with 2D-separable filters 2222
Works on LH and HL, less pixels to deal withWorks on LH and HL, less pixels to deal with1111
Simple but robust features in SVM classificationSimple but robust features in SVM classification4444
Adaptive global threshold is faster in removing Adaptive global threshold is faster in removing background than local thresholdingbackground than local thresholding 3333
Our Demo5555
20
Experiment 3: Scene/Scrolling Text Removal- Examples of Success
Notes:Notes: Using 9 videos from all 4 test sets with scene/scrolling texts Using 9 videos from all 4 test sets with scene/scrolling texts Captions can be well distinguished from scene/scrolling textsCaptions can be well distinguished from scene/scrolling texts
(a). Small scrolling texts on the bottom with simple background.(a). Small scrolling texts on the bottom with simple background.(b). small scrolling texts on the bottom with transparent background.(b). small scrolling texts on the bottom with transparent background.(c). Big scrolling texts above the caption with transparent background. (c). Big scrolling texts above the caption with transparent background. (d). scene texts(d). scene texts
Notes:Notes: Using 9 videos from all 4 test sets with scene/scrolling texts Using 9 videos from all 4 test sets with scene/scrolling texts Captions can be well distinguished from scene/scrolling textsCaptions can be well distinguished from scene/scrolling texts
(a). Small scrolling texts on the bottom with simple background.(a). Small scrolling texts on the bottom with simple background.(b). small scrolling texts on the bottom with transparent background.(b). small scrolling texts on the bottom with transparent background.(c). Big scrolling texts above the caption with transparent background. (c). Big scrolling texts above the caption with transparent background. (d). scene texts(d). scene texts
(a) (b) (c) (d)
21
Experiment 3: Scene/Scrolling Text Removal- Examples of Failures
Explanations:Explanations: Hard to distinguish captions from static scene text linesHard to distinguish captions from static scene text lines Better features is required for distinguishing text-like texturesBetter features is required for distinguishing text-like textures
Explanations:Explanations: Hard to distinguish captions from static scene text linesHard to distinguish captions from static scene text lines Better features is required for distinguishing text-like texturesBetter features is required for distinguishing text-like textures
(a) (b) (c) (d)
22
Experiment 4: Recall and False alarm rate
Table 2. Performance comparison in experiment 1 and experiment 2
Experiment 1 Experiment 2
AlgorithmRecall
(%)
False Alarm
Rate(%)
Recall (%)
False Alarm Rate(%)
Temporal Coverage
(%)
Our 90.66 28.98 90.66 28.98 100
Algorithm 1 82.11 38.17 82.11 38.17 100
Algorithm 2 88.68 37.49 55.99 36.93 65.02
O GRecall S /S O DFalse alarm rate =1-S /S
Detected area — SDetected area — SDD, GroundTruth area — S, GroundTruth area — SGG, Intersection area—S, Intersection area—SOO
23
Why we perform better
Adaptive global threshold makes the text edges “suppress” other edges.22
Moving text-like textures are removed from IFCC, thus a higher precision.11
Only a small number of parameters to adjust44
Selection of fixed percentage of pixels in algorithm 2 leads to additional false alarms in the frame with no texts.33
24
Conclusion
Better features aiming at distinguishing captions from textures22
Inter-frame correlation is useful for high speed caption detection11
Incrementally learning the parameter settings (automatic or Semiautomatic)44
Distributed architecture: test various parameters at several terms and merge results together33
25
ThanksQ&A?