Poster

Exploring Inter-Frame Correlation Analysis and Wavelet-Domain Modeling for Real-Time Caption Detection in Streaming Video

Jia Li1, Yonghong Tian2, Wen Gao1,2

1Key Laboratory of Intelligent Information Processing, ICT,CAS2Institute of Digital Media, School of EE & CS, PKU

2

Outline

Background Problem statement System architecture Experiments Conclusion

3

Background: Caption Detection

Caption Text

Scene Text

Scrolling Text

4

Background : Frequently used methods Sobel Edges

Chunmei Liu, et al. Text detection in images based on unsupervised classification of edge-based features. ICDAR 2005.

Lyu, M.R., et al. A comprehensive method for multilingual video text detection, localization, and extraction. CSVT, 2005.

Wavelet Domain Huiping Li et al. Automatic text detection and tracking in digital vi

deo. IEEE Transactions on Image Processing, 2000 Qixiang Ye, et al. “Fast and robust text detection in images and v

ideo frames”. Image and Vision Computing. 2005 Directly On Image

Kwang In Kim, et al. Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm”. PAMI, 2003

5

The Problems

Text Detection in Images & Video

Text Detection in Streaming Video

Detection Speed

Discern Text Type

Removal of Texture

Why streaming videos are different: Faster: frames are coming in real time More detailed: few clues – organize texts by their types More accurate: simple features to remove text-like

textures

6

Our Solution

Inter-FrameCorrelations

Static Edges

Moving Edges

Caption Texts

Stable Background

Scene Text

Moving Textures

Scrolling Text

Frame Sequence

Fast & RobustQuantification

7

System Architecture

Data: using the previous and next frame to assist text detection in the current frame.

Temporal Analysis: Remove unstable edges Spatial Analysis: Remove weak edges

8

Temporal Analysis

The goal: remove unstable edges In subbands LH and HL of wavelet domain

Edge stability in subband WS (WS {LH,HL}) is evaluated ∈with Inter-Subband Correlation Coefficients (ISCC)

Variables:

1

[ 1, ]

1

1 ( , , ) ( , , ) ,

ISCC( , , 1, , ) ( , , )( (1, ), 1) ,

( , , ) ( , , )

i i

i i

i i

x y WS x y WS

x y i i WS x y WSMAX MIN elsewise

x y WS x y WS

2 2 2

1( , , ) ( , )

(2 1)

y Mx M

i ia x M b y M

x y WS WS a bM

[ 1, ] 12

1( , , ) ( , ) ( , )

(2 1)

y Mx M

i i i ia x M b y M

x y WS WS a b WS a bM

Local Covariance

Local Variance

9

Temporal Analysis (continue)

Based on ISCC, the inter-frame correlation coefficients (IFCC) and Temporal stability (TS) are defined as: Inter-frame correlation coefficients:

Temporal stability:

{ , }IFCC( , , 1, ) { ISCC( , , 1, , ) }

WS LH HLx y i i MAX x y i i WS

TS( , , ) IFCC( , , 1, ), IFCC( , , , 1) x y i MAX x y i i x y i i

10

Temporal Analysis (continue)

Temporal Stability of Edges

Inter-subband correlation

Inter frame correlation

11

(a) (b) (c) (d)

ISCC is robust to background changing.ISCC is robust to background changing.

ISCC is sensitive to slight motions.ISCC is sensitive to slight motions.

x

ISCC: Robust and Sensitive

12

Spatial Analysis

The goal: remove static backgrounds Texts are collections of strong edges Using adaptive global thresholds on wavelet subbands LH

/HL to remove the backgrounds. Modeling wavelet coefficients in LH/HL with two Gene

ralized Gaussian Distributions:

Parameters: Parameters: Mean, Variance, and Shape parameterMean, Variance, and Shape parameter 1111

(b )2( : , , ) a xf x e b 1 (3/ )

a b2 (1/ ) (1/ )

What is shape parameter?What is shape parameter?2222

13

Shape Parameter and Threshold Selection

(b )2( : , , ) a xf x e

Kamran, S., and Alberto, L.G. “Estimation of shape parameter for generalized Gaussian distribution in subband decompositions of video”. CSVT, 5(1), 1995 Page(s):52 – 56

More Edges Larger r

Less EdgesSmaller r

Larger Threshold

Smaller Threshold

WS c , 4.0WS WSthreshold c

Wavelet Subbands

Usually, the coefficients are zero-mean. Adaptive global threshold is selected according to

variance and shape parameter

14

Spatial Analysis (continue)

Remove Moving Regions

Remove Weak Edges

Morphological operations

Inter-Frame tracking

SVM-based classification

15

ExperimentsAlgorithms

For ComparisonsDescription

Algorithm 1

Lyu, M.R, et al.

“A comprehensive method for multilingual video text detection, localization, and extraction”.

CSVT, 2005

1 、 Sobel edges2 、 Local thresholding 3 、 Iterative projection

Algorithm 2

Qixiang Ye, et al.

“Fast and robust text detection in images and video frames”.

IVC. 2005

1 、 Wavelet domain2 、 Select 15% edges3 、 Sophisticated SVM

16

The Data Set

Data Set Resolution Duration Test Environment

Test Set I

16 Video Clips 6 h 49 min. 2 frame/second 49,177 frames

89,639 captions

Pentium IV 3.2G CPU, 512 MB.

Test Set II

Test Set III

Test Set IV

720 576

400 328

352 288

176 144

(Used in Algorithm 2 for comparison)

(Size of the original Video)

(Down sampled for CIF)

(Down sampled for QCIF)

17

NotesNotes IFCC histogram between caption regions with same texts in adjacent frames, are

used to evaluate the robustness IFCC histogram between caption regions with different texts in adjacent frames,

are used to evaluate the sensitivity Four Resolutions: 720*576, 400*328, 352*288, 176*144 (a) IFCC histogram to evaluate robustness; (b) IFCC histogram to evaluate the sensitivity

NotesNotes IFCC histogram between caption regions with same texts in adjacent frames, are

used to evaluate the robustness IFCC histogram between caption regions with different texts in adjacent frames,

are used to evaluate the sensitivity Four Resolutions: 720*576, 400*328, 352*288, 176*144 (a) IFCC histogram to evaluate robustness; (b) IFCC histogram to evaluate the sensitivity

(a) (b)

Experiment 1: Robustness and Sensitivity of IFCC

18

Experiment 2 : Detection Speed

Table 1. Detection Speed and Speed Nonstationarity

Algorithm Our Algorithm 1 Algorithm 2

Speed(Frames/s) 9.09 4.46 1.18

Nonstationarity (%) 5.13 11.54 12.69

Notes:Notes: 7 frame sequences with no scene/scrolling texts are selected 7 frame sequences with no scene/scrolling texts are selected

from test set II.from test set II. Nonstationarity is defined as:Nonstationarity is defined as: Smaller nonstationarity means the algorithm spends similar tiSmaller nonstationarity means the algorithm spends similar ti

me on simple and complex frame. me on simple and complex frame.

Notes:Notes: 7 frame sequences with no scene/scrolling texts are selected 7 frame sequences with no scene/scrolling texts are selected

from test set II.from test set II. Nonstationarity is defined as:Nonstationarity is defined as: Smaller nonstationarity means the algorithm spends similar tiSmaller nonstationarity means the algorithm spends similar ti

me on simple and complex frame. me on simple and complex frame.

7 72

1 1 1

Nonstationarity [( ) / ] /iN

ij ALL i ALLi j i

S S N S

19

Why we are faster

ISCC is calculated with 2D-separable filters ISCC is calculated with 2D-separable filters 2222

Works on LH and HL, less pixels to deal withWorks on LH and HL, less pixels to deal with1111

Simple but robust features in SVM classificationSimple but robust features in SVM classification4444

Adaptive global threshold is faster in removing Adaptive global threshold is faster in removing background than local thresholdingbackground than local thresholding 3333

Our Demo5555

20

Experiment 3: Scene/Scrolling Text Removal- Examples of Success

Notes:Notes: Using 9 videos from all 4 test sets with scene/scrolling texts Using 9 videos from all 4 test sets with scene/scrolling texts Captions can be well distinguished from scene/scrolling textsCaptions can be well distinguished from scene/scrolling texts

(a). Small scrolling texts on the bottom with simple background.(a). Small scrolling texts on the bottom with simple background.(b). small scrolling texts on the bottom with transparent background.(b). small scrolling texts on the bottom with transparent background.(c). Big scrolling texts above the caption with transparent background. (c). Big scrolling texts above the caption with transparent background. (d). scene texts(d). scene texts

Notes:Notes: Using 9 videos from all 4 test sets with scene/scrolling texts Using 9 videos from all 4 test sets with scene/scrolling texts Captions can be well distinguished from scene/scrolling textsCaptions can be well distinguished from scene/scrolling texts

(a). Small scrolling texts on the bottom with simple background.(a). Small scrolling texts on the bottom with simple background.(b). small scrolling texts on the bottom with transparent background.(b). small scrolling texts on the bottom with transparent background.(c). Big scrolling texts above the caption with transparent background. (c). Big scrolling texts above the caption with transparent background. (d). scene texts(d). scene texts

(a) (b) (c) (d)

21

Experiment 3: Scene/Scrolling Text Removal- Examples of Failures

Explanations:Explanations: Hard to distinguish captions from static scene text linesHard to distinguish captions from static scene text lines Better features is required for distinguishing text-like texturesBetter features is required for distinguishing text-like textures

Explanations:Explanations: Hard to distinguish captions from static scene text linesHard to distinguish captions from static scene text lines Better features is required for distinguishing text-like texturesBetter features is required for distinguishing text-like textures

(a) (b) (c) (d)

22

Experiment 4: Recall and False alarm rate

Table 2. Performance comparison in experiment 1 and experiment 2

Experiment 1 Experiment 2

AlgorithmRecall

(%)

False Alarm

Rate(%)

Recall (%)

False Alarm Rate(%)

Temporal Coverage

(%)

Our 90.66 28.98 90.66 28.98 100

Algorithm 1 82.11 38.17 82.11 38.17 100

Algorithm 2 88.68 37.49 55.99 36.93 65.02

O GRecall S /S O DFalse alarm rate =1-S /S

Detected area — SDetected area — SDD, GroundTruth area — S, GroundTruth area — SGG, Intersection area—S, Intersection area—SOO

23

Why we perform better

Adaptive global threshold makes the text edges “suppress” other edges.22

Moving text-like textures are removed from IFCC, thus a higher precision.11

Only a small number of parameters to adjust44

Selection of fixed percentage of pixels in algorithm 2 leads to additional false alarms in the frame with no texts.33

24

Conclusion

Better features aiming at distinguishing captions from textures22

Inter-frame correlation is useful for high speed caption detection11

Incrementally learning the parameter settings (automatic or Semiautomatic)44

Distributed architecture: test various parameters at several terms and merge results together33

25

ThanksQ&A?

Poster

Documents

Transcript of Poster