Visual Attention based Region of Interest Coding for Video -telephony Applications

CSNDSP’06CSNDSP’06

Visual Attention based Region of Interest Coding for Video -

telephony Applications

Nicolas Tsapatsoulis

Computer Science Dept.University of Cyprus

CSNDSP’06CSNDSP’06 Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions

◊ Develop an algorithm for Region Of Interest (ROI) estimation based on Visual Attention

◊ Visual attention:

◊ Area studying the behavior of humans when observing a scene

◊ Visually important areas are expected to be the first areas humans fixate on:

◊ Such areas as selected as ROIs

◊ ROIs are encoded with higher accuracy than non-ROIs

◊ Possible application: Video telephony

◊ Low bit-rates required, while

◊ Visual quality needs to be preserved

Aim of this study


◊ ROI areas are computed based on a saliency map

◊ Saliency map combines:

◊ Intensity

◊ Orientation

◊ Color

◊ Skin (Face)

conspicuity maps

◊ All conspicuity maps are constructed based on the center-surround principle:

◊ Visually important regions are those that stand out from their surround in terms of intensity, orientation and color

◊ Skin map corresponds to the a-priori knowledge that faces are common in video-telephony applications and humans either implicitly or explicitly fixate on such areas

Overview Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions


◊ Stand-out areas are computed at various scales using a multiresolution approach:

◊ Both small and large objects can stand-out from their surround

◊ Combination of conspicuity maps into a final saliency map is achieved by using a sigmoid function

◊ Contribution of the various channels (intensity, orientation, color skin) is summed to indicate areas that moderately stand-out from their surround in several channels

◊ Areas that highly stand-out from their surround in a single channel can dominate (saturate) the combined aggregation preserving their importance

◊ Experiments:

◊ Non-ROI areas are smoothed and passed to the encoder. This results in better intra-frame encoding (concentrated DCT coefficients) and better prediction (inter-frame encoding)

Overview (II) Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions


◊ Feature Integration Theory (FIT) by Treisman et al:

◊ Visual features are registered early, automatically and in parallel along a number of separable dimensions (e.g. intensity, color, orientation, size, shape etc).

◊ The FIT theory was the basis of several visual attention algorithms and computational models that have been developed over the last two decades

◊ Saliency based model of Itti & Koch

◊ Low-level vision features (color channels tuned to red, green, blue and yellow hues, orientation and brightness) are extracted from the original color image at several spatial scales. Different spatial scales are produced using Gaussian pyramids, which consist of progressively low-pass filtering and sub-sampling the input image.

◊ Each feature is computed in a center-surround structure akin to visual receptive fields.

Visual Attention Models Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions


◊ RGB color model

◊ Orientation computed at four directions and summing the results

◊ Pyramid produced by Gaussian low-pass filtering and subsampling

◊ Center surround:

◊ Point by point differences of finer and coarser approximations, the latter being first interpolated.

◊ Normalize and add to create the saliency map.

◊ Winner Take All (WTA) architecture to model changes of fixation points

Model of Itti & Koch Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions


The proposed algorithm

◊ Based on Itti & Koch

◊ Add of a skin-branch to model prior knowledge (existence of faces in video telephony application)

◊ Wavelet based implementation of the pyramid.

◊ YCrCb color model to keep consistency with skin detection (skin color can be modeled via a small area in the Cr-Cb plane – NTSC broadcasting system of analog TV makes use of this property)

◊ Orientation computed as across-scale differences in detail bands (V,H,D)

◊ Combination of conspicuity maps through a sigmoid function to create the final saliency map

◊ Note: The assumption that a final saliency map is created in human brain has not been proved and remains a controversial issue among scientists

Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions

CSNDSP’06CSNDSP’06The proposed algorithm (II)


◊ Decomposition of Y, Cr, Cb color channels using Daubechie’s wavelets and filter coefficients (length 4)

◊

Image Sequence

Center-Surround

OrientationIntensityColors

Normalization

Normalization and Summation

Multiresolution SkinDetection

StaticSaliency Map

Application DependentStatic Saliency Map

Top-downInformation

mnj

Aj

D

mnjA

jV

mnjA

jH

mnjA

jA

nhnmYmhnmY

nhnmYmhnmY

nhnmYmhnmY

nhnmYmhnmY

22)1(

22)1(

22)1(

22)1(

)(),()(),(

)(),()(),(

)(),()(),(

)(),()(),(

CSNDSP’06CSNDSP’06The proposed algorithm (III)


)()(),( 22)1( nhmhnmYYI nmjA

jA

j

jb

jr

j CCC

)()(),( 22)1( nhmhnmCCC nmjb

jb

jb AA

)()(),( 22)1( nhmhnmCCC nmjr

jr

jr AA

◊ Center surround at scale j:

jH

jH

jV

jV

jD

jD

j YYYYYYO ˆˆˆ

)()(),(ˆ 22)1( nhmhnmYY nmjD

jD

)()(),(ˆ 22)1( nhmhnmYY nmjV

jV

)()(),(ˆ 22)1( nhmhnmYY nmjH

jH

,

,


The proposed algorithm (I) Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions

◊ Conspicuity maps:

◊ Interpolate I-j, O-j, C-j at the finest scale (j =0) and add center surround differences of all scales (for all j)

◊ Three conspicuity maps:

◊ Intensity (I)

◊ Color (C)

◊ Orientation (O)

◊ Plus skin map (F)

◊ Max depth of analysis Jmax:,

,

NJ 2log

2

1max ),min( CRN


◊ Top left figure:

◊ Original frame

◊ Top right figure:

◊ Skin Map

◊ Bottom left figure:

◊ Multiscale texture map (range filtering at various scales)

◊ Bottom right figure:

◊ Face map created by multiplying texture and skin maps

Face Map

◊ Skin probability computed at various scales

◊ 2D-Gaussian probability density function for skin

◊ Pseudoprobability computed based on Mahalanobis distance

◊ Face modeled as textured skin area



◊ Left figure:

◊ Original frame

◊ Center figure:

◊ Orientation Map

◊ Right figure:

◊ Intensity map (not enough for accurately identifying areas that stand-out from their surround due to orientation

Orientation Map

◊ Across scale differences of detail bands in illumination (Y) channel:

◊ V = Vertical detail: low pass filtering of rows, high pass filtering of columns

◊ H = Horizontal detail: high pass filtering of rows, low pass filtering of columns

◊ D = Diagonal detail: high pass filtering of rows, high pass filtering of columns


jH

jH

jV

jV

jD

jD

j YYYYYYO ˆˆˆ


Intensity Map

◊ Across scale difference of approximation band in illumination (Y) channel

◊ In the figures below the eyes of the newscaster are small areas that stand out form their surround

◊ Blouse and channel’s logo are larger areas that stand out form their surround

◊ The whole head of newscaster is a large area standing-out from its surround due to intensity.


)()(),( 22)1( nhmhnmYYI nmjA

jA

j


Color Map Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions

◊ Across scale differences of approximation bands in chromaticity channels (Cr, Cb) added together:

◊ Channel’s logo and newscaster’s hair are the areas with the most prominent difference from their surround

jb

jr

j CCC

)()(),( 22)1( nhmhnmCCC nmjb

jb

jb AA

)()(),( 22)1( nhmhnmCCC nmjr

jr

jr AA


Combination of conspicuity maps

◊ Combination of individual conspicuity maps (I,O,C,F) into the final saliency map (S) through the following sigmoid function:

◊ ROI computed by thresholding the saliency map (using Otsu’s method) and filling possible holes in the mask that is produced.

◊ Smooth by low pass filtering non-ROI areas and encode frames as usual (see figure to the right)


11

2

FCOIe

S


Experimental Results

◊ Aim:

◊ Check if deterioration in ROI encoded videos is observable (visual trial tests)

◊ Compute bit-rate gain

◊ 10 video clips with varying content, both indoor and outdoor

◊ Humans always present

◊ 10 human observers

◊ Non experts (students)

◊ 5 female, 5 male

◊ 60 second to watch video clips (ROI-based and standard MPEG-1 encoding)

◊ Select best

◊ Each video clip viewed twice (200 tests in total)



Content (selected frames)

grandma fashioneye_witnessnews_cast1


Selections per video clip and average bit rate

1. eye_witness

2. fashion

3. grandma

4. justice

5. lecturer

6. news_cast1

7. news_cast2

8. night_interview

9. old_man

10. soldier


Visual trials

Encoding Method Preferences Average Bit Rate (Kbps)

VA-ROI 95 224.4

Standard MPEG -1 105 308.1


Bit rate gain

Video Clip Encoding Method

Bit Rate (Kbps) Bit Rate Gain

eye_witness, VA-ROI 319

17 (%)Standard 386

FashionVA-ROI 296

16 (%)Standard 354

GrandmaVA-ROI 217

15 (%)Standard 256

JusticeVA-ROI 228

28 (%)Standard 318

lecturerVA-ROI 201

27 (%)Standard 274

news_cast1VA-ROI 205

31 (%)Standard 297

news_cast2 VA-ROI 170

37 (%)Standard 270

night_interviewVA-ROI 174

48 (%)Standard 335

old_man VA-ROI 241

25 (%)Standard 321

soldierVA-ROI 193

29 (%)Standard 270

AverageVA-ROI 224.4

27.2 (%)Standard 308.1


Conclusions - Further work

◊ Visual attention based ROI estimation can be used to indicate regions that need to be encoded with higher accuracy. In this way:

◊ Significant bit-rate gain, compared to MPEG-1, can be achieved, while

◊ the areas identified as visually important by the VA algorithm are in conformance with the ones identified by the human subjects, as it can be deducted by the visual trial tests,

◊ VA ROI based encoding leads to better compression of both Intra-coded and Inter coded frames though the former is higher.

◊ Further work includes

◊ conducting experiments to test the efficiency of the proposed method in the MPEG-4 framework.

◊ examining the effect of incorporating priority encoding by varying the quality factor of the DCT quantization table across VA-ROI and non-ROI frame blocks.


Visual Attention based Region of Interest Coding for Video -telephony Applications

Documents

Transcript of Visual Attention based Region of Interest Coding for Video -telephony Applications