A New Approach for Video Text Detection and Localization

15
A New Approach for Video Text Detection and Localization M. Cai, J. Song and M.R. Lyu VIEW Technologies The Chinese University of Hong Kong

description

A New Approach for Video Text Detection and Localization. M. Cai, J. Song and M.R. Lyu VIEW Technologies The Chinese University of Hong Kong. Related work. Text Area Detection Uncompressed domain methods Texture-based Color-based Edge-based Compressed domain methods DCT coefficients - PowerPoint PPT Presentation

Transcript of A New Approach for Video Text Detection and Localization

Page 1: A New Approach for Video Text Detection and Localization

A New Approach for Video Text Detection and Localization

M. Cai, J. Song and M.R. Lyu

VIEW Technologies

The Chinese University of Hong Kong

Page 2: A New Approach for Video Text Detection and Localization

Related work

Text Area Detection– Uncompressed domain methods

• Texture-based• Color-based• Edge-based

– Compressed domain methods• DCT coefficients• Number of intra-coded blocks on P- / B- frames

Text String Localization– Bottom-up scheme– Top-down scheme

Page 3: A New Approach for Video Text Detection and Localization

Language-independent characteristics

Contrast– An adaptive contrast threshold according

to the background complexity

Color– Color bleeding caused by compression

Orientation– Well-defined size and orientation make it

easy to understand

Stationary location– Appear a certain long time

Page 4: A New Approach for Video Text Detection and Localization

Language-dependent characteristics

English Chinese

Stroke density roughly similar varies dramatically

Min(Font size) 10-pixel high 20-pixel high

Min(Aspect ratio) Relatively large Relatively small

Stroke direction statistics

mainly vertical vertical horizontalLeft diagonalRight diagonal

Page 5: A New Approach for Video Text Detection and Localization

Workflow

Sampling &color space conversion

Multi-frame comparison

Video text detection andlocalization on

every sampled frame

Page 6: A New Approach for Video Text Detection and Localization

A sequential multi-resolution paradigm

Level = 2

Level = n-1

Original image

Edge map

Text regions

Original coordinates of text regions

Size/ f(l)Text areaDetection

Text stringLocalization Size f(l)

Level = 1

Edge map

Text regions

Original coordinates of text regions

Size/ f(l)Text areaDetection

Text stringLocalization Size f(l)

Level = n

Final text regions with original coordinates

Edge detection

Page 7: A New Approach for Video Text Detection and Localization

Text detection

Edge detection– Sobel edge detector

Local thresholding– Adaptive to background complexity

Text-like area recovery– Enhance the density of text areas

Page 8: A New Approach for Video Text Detection and Localization

Local Thresholding

Use a small kernel (gray) to scan the whole edge map row by row.

In the bigger window surrounding the kernel, check the background type: “Clear” or “Noisy”.

For Clear background and Noisy background, determined the local threshold by low and high parts, respectively, of the edge strength histogram in the bigger window.

3hh

Window

Kernel

(a) Concentric kernel and window

P1

P3h....

(b) A window on the multi-line text area and the horizontal projection in it.

(c) Local threshold selection MAX

Count

Edge strength 0

Low part High part

Page 9: A New Approach for Video Text Detection and Localization

Thresholding result comparison

Video image Local thresholding resultsGlobal thresholding results

Page 10: A New Approach for Video Text Detection and Localization

Labeling: Classify current edge pixels as “TEXT” and “NON_TEXT” based on its local density.Recovery/Suppression:– Bring back neighboring lower-strength edge pixels of

the TEXT edge pixels.– The NON_TEXT edge pixels are suppressed.

Text-like area recovery

Before recovery After recovery

Page 11: A New Approach for Video Text Detection and Localization

Coarse-to-fine Text localization

Projection-based top-down localization.

To handle complex text layout.

Divisible? Horizontal projection

Vertical projection

Pop the first region from theprocessing array

Add to the processing array

InitializationThe whole edge map is the only region in the processing array.

Add to the resulting text regions

Y

N

Eachsub-region

The region

Sub-regions

Indivisible regions

Y

N

If the array is empty, terminate.

Divisible?

Check aspect ratio

Y

N

Discard false regions

Page 12: A New Approach for Video Text Detection and Localization

Localization steps

(1)

(2)

(3)

(4)

Page 13: A New Approach for Video Text Detection and Localization

Experimental results

Page 14: A New Approach for Video Text Detection and Localization

Experimental results

Page 15: A New Approach for Video Text Detection and Localization

Performance statistics

Statistics of 10 news videos:

Processing time per frame: 0.25 s (PIII 1G CPU)

Detection rate = = 93.6%

Detection accuracy =

= 87.2%

Localization accuracy

= > 90%

)regionstexttruthground(

)regionstextdetectedcorrectly(

Num

Num

)regionstextdetectedall(

)regionstextdetectedcorrectly(

Num

Num

)regionstexttruthground(

)regionstexttruthground()regionstextdetected(

Area

AreaArea