A New Approach for Video Text Detection and Localization

A New Approach for Video Text Detection and Localization

M. Cai, J. Song and M.R. Lyu

VIEW Technologies

The Chinese University of Hong Kong

Related work

Text Area Detection– Uncompressed domain methods

• Texture-based• Color-based• Edge-based

– Compressed domain methods• DCT coefficients• Number of intra-coded blocks on P- / B- frames

Text String Localization– Bottom-up scheme– Top-down scheme

Language-independent characteristics

Contrast– An adaptive contrast threshold according

to the background complexity

Color– Color bleeding caused by compression

Orientation– Well-defined size and orientation make it

easy to understand

Stationary location– Appear a certain long time

Language-dependent characteristics

English Chinese

Stroke density roughly similar varies dramatically

Min(Font size) 10-pixel high 20-pixel high

Min(Aspect ratio) Relatively large Relatively small

Stroke direction statistics

mainly vertical vertical horizontalLeft diagonalRight diagonal

Workflow

Sampling &color space conversion

Multi-frame comparison

Video text detection andlocalization on

every sampled frame

A sequential multi-resolution paradigm

Level = 2

Level = n-1

Original image

Edge map

Text regions

Original coordinates of text regions

Size/ f(l)Text areaDetection

Text stringLocalization Size f(l)

Level = 1

Edge map

Text regions

Original coordinates of text regions

Size/ f(l)Text areaDetection

Text stringLocalization Size f(l)

Level = n

Final text regions with original coordinates

Edge detection

Text detection

Edge detection– Sobel edge detector

Local thresholding– Adaptive to background complexity

Text-like area recovery– Enhance the density of text areas

Local Thresholding

Use a small kernel (gray) to scan the whole edge map row by row.

In the bigger window surrounding the kernel, check the background type: “Clear” or “Noisy”.

For Clear background and Noisy background, determined the local threshold by low and high parts, respectively, of the edge strength histogram in the bigger window.

3hh

Window

Kernel

(a) Concentric kernel and window

P1

P3h....

(b) A window on the multi-line text area and the horizontal projection in it.

(c) Local threshold selection MAX

Count

Edge strength 0

Low part High part

Thresholding result comparison

Video image Local thresholding resultsGlobal thresholding results

Labeling: Classify current edge pixels as “TEXT” and “NON_TEXT” based on its local density.Recovery/Suppression:– Bring back neighboring lower-strength edge pixels of

the TEXT edge pixels.– The NON_TEXT edge pixels are suppressed.

Text-like area recovery

Before recovery After recovery

Coarse-to-fine Text localization

Projection-based top-down localization.

To handle complex text layout.

Divisible? Horizontal projection

Vertical projection

Pop the first region from theprocessing array

Add to the processing array

InitializationThe whole edge map is the only region in the processing array.

Add to the resulting text regions

Y

N

Eachsub-region

The region

Sub-regions

Indivisible regions

Y

N

If the array is empty, terminate.

Divisible?

Check aspect ratio

Y

N

Discard false regions

Localization steps

(1)

(2)

(3)

(4)

Experimental results

Performance statistics

Statistics of 10 news videos:

Processing time per frame: 0.25 s (PIII 1G CPU)

Detection rate = = 93.6%

Detection accuracy =

= 87.2%

Localization accuracy

= > 90%

)regionstexttruthground(

)regionstextdetectedcorrectly(

Num

Num

)regionstextdetectedall(

)regionstextdetectedcorrectly(

Num

Num

)regionstexttruthground(

)regionstexttruthground()regionstextdetected(

Area

AreaArea

A New Approach for Video Text Detection and Localization

Documents

Transcript of A New Approach for Video Text Detection and Localization