Avaya Computer Telephony 1.2 Telephony Services PBX Driver ...
Visual Attention based Region of Interest Coding for Video -telephony Applications
description
Transcript of Visual Attention based Region of Interest Coding for Video -telephony Applications
CSNDSP’06CSNDSP’06
Visual Attention based Region of Interest Coding for Video -
telephony Applications
Nicolas Tsapatsoulis
Computer Science Dept.University of Cyprus
CSNDSP’06CSNDSP’06 Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
◊ Develop an algorithm for Region Of Interest (ROI) estimation based on Visual Attention
◊ Visual attention:
◊ Area studying the behavior of humans when observing a scene
◊ Visually important areas are expected to be the first areas humans fixate on:
◊ Such areas as selected as ROIs
◊ ROIs are encoded with higher accuracy than non-ROIs
◊ Possible application: Video telephony
◊ Low bit-rates required, while
◊ Visual quality needs to be preserved
Aim of this study
CSNDSP’06CSNDSP’06
◊ ROI areas are computed based on a saliency map
◊ Saliency map combines:
◊ Intensity
◊ Orientation
◊ Color
◊ Skin (Face)
conspicuity maps
◊ All conspicuity maps are constructed based on the center-surround principle:
◊ Visually important regions are those that stand out from their surround in terms of intensity, orientation and color
◊ Skin map corresponds to the a-priori knowledge that faces are common in video-telephony applications and humans either implicitly or explicitly fixate on such areas
Overview Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
CSNDSP’06CSNDSP’06
◊ Stand-out areas are computed at various scales using a multiresolution approach:
◊ Both small and large objects can stand-out from their surround
◊ Combination of conspicuity maps into a final saliency map is achieved by using a sigmoid function
◊ Contribution of the various channels (intensity, orientation, color skin) is summed to indicate areas that moderately stand-out from their surround in several channels
◊ Areas that highly stand-out from their surround in a single channel can dominate (saturate) the combined aggregation preserving their importance
◊ Experiments:
◊ Non-ROI areas are smoothed and passed to the encoder. This results in better intra-frame encoding (concentrated DCT coefficients) and better prediction (inter-frame encoding)
Overview (II) Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
CSNDSP’06CSNDSP’06
◊ Feature Integration Theory (FIT) by Treisman et al:
◊ Visual features are registered early, automatically and in parallel along a number of separable dimensions (e.g. intensity, color, orientation, size, shape etc).
◊ The FIT theory was the basis of several visual attention algorithms and computational models that have been developed over the last two decades
◊ Saliency based model of Itti & Koch
◊ Low-level vision features (color channels tuned to red, green, blue and yellow hues, orientation and brightness) are extracted from the original color image at several spatial scales. Different spatial scales are produced using Gaussian pyramids, which consist of progressively low-pass filtering and sub-sampling the input image.
◊ Each feature is computed in a center-surround structure akin to visual receptive fields.
Visual Attention Models Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
CSNDSP’06CSNDSP’06
◊ RGB color model
◊ Orientation computed at four directions and summing the results
◊ Pyramid produced by Gaussian low-pass filtering and subsampling
◊ Center surround:
◊ Point by point differences of finer and coarser approximations, the latter being first interpolated.
◊ Normalize and add to create the saliency map.
◊ Winner Take All (WTA) architecture to model changes of fixation points
Model of Itti & Koch Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
CSNDSP’06CSNDSP’06
The proposed algorithm
◊ Based on Itti & Koch
◊ Add of a skin-branch to model prior knowledge (existence of faces in video telephony application)
◊ Wavelet based implementation of the pyramid.
◊ YCrCb color model to keep consistency with skin detection (skin color can be modeled via a small area in the Cr-Cb plane – NTSC broadcasting system of analog TV makes use of this property)
◊ Orientation computed as across-scale differences in detail bands (V,H,D)
◊ Combination of conspicuity maps through a sigmoid function to create the final saliency map
◊ Note: The assumption that a final saliency map is created in human brain has not been proved and remains a controversial issue among scientists
Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
CSNDSP’06CSNDSP’06The proposed algorithm (II)
Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
◊ Decomposition of Y, Cr, Cb color channels using Daubechie’s wavelets and filter coefficients (length 4)
◊
Image Sequence
Center-Surround
OrientationIntensityColors
Normalization
Normalization and Summation
Multiresolution SkinDetection
StaticSaliency Map
Application DependentStatic Saliency Map
Top-downInformation
mnj
Aj
D
mnjA
jV
mnjA
jH
mnjA
jA
nhnmYmhnmY
nhnmYmhnmY
nhnmYmhnmY
nhnmYmhnmY
22)1(
22)1(
22)1(
22)1(
)(),()(),(
)(),()(),(
)(),()(),(
)(),()(),(
CSNDSP’06CSNDSP’06The proposed algorithm (III)
Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
)()(),( 22)1( nhmhnmYYI nmjA
jA
j
jb
jr
j CCC
)()(),( 22)1( nhmhnmCCC nmjb
jb
jb AA
)()(),( 22)1( nhmhnmCCC nmjr
jr
jr AA
◊ Center surround at scale j:
jH
jH
jV
jV
jD
jD
j YYYYYYO ˆˆˆ
)()(),(ˆ 22)1( nhmhnmYY nmjD
jD
)()(),(ˆ 22)1( nhmhnmYY nmjV
jV
)()(),(ˆ 22)1( nhmhnmYY nmjH
jH
,
,
CSNDSP’06CSNDSP’06
The proposed algorithm (I) Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
◊ Conspicuity maps:
◊ Interpolate I-j, O-j, C-j at the finest scale (j =0) and add center surround differences of all scales (for all j)
◊ Three conspicuity maps:
◊ Intensity (I)
◊ Color (C)
◊ Orientation (O)
◊ Plus skin map (F)
◊ Max depth of analysis Jmax:,
,
NJ 2log
2
1max ),min( CRN
CSNDSP’06CSNDSP’06
◊ Top left figure:
◊ Original frame
◊ Top right figure:
◊ Skin Map
◊ Bottom left figure:
◊ Multiscale texture map (range filtering at various scales)
◊ Bottom right figure:
◊ Face map created by multiplying texture and skin maps
Face Map
◊ Skin probability computed at various scales
◊ 2D-Gaussian probability density function for skin
◊ Pseudoprobability computed based on Mahalanobis distance
◊ Face modeled as textured skin area
Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
CSNDSP’06CSNDSP’06
◊ Left figure:
◊ Original frame
◊ Center figure:
◊ Orientation Map
◊ Right figure:
◊ Intensity map (not enough for accurately identifying areas that stand-out from their surround due to orientation
Orientation Map
◊ Across scale differences of detail bands in illumination (Y) channel:
◊ V = Vertical detail: low pass filtering of rows, high pass filtering of columns
◊ H = Horizontal detail: high pass filtering of rows, low pass filtering of columns
◊ D = Diagonal detail: high pass filtering of rows, high pass filtering of columns
Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
jH
jH
jV
jV
jD
jD
j YYYYYYO ˆˆˆ
CSNDSP’06CSNDSP’06
Intensity Map
◊ Across scale difference of approximation band in illumination (Y) channel
◊ In the figures below the eyes of the newscaster are small areas that stand out form their surround
◊ Blouse and channel’s logo are larger areas that stand out form their surround
◊ The whole head of newscaster is a large area standing-out from its surround due to intensity.
Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
)()(),( 22)1( nhmhnmYYI nmjA
jA
j
CSNDSP’06CSNDSP’06
Color Map Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
◊ Across scale differences of approximation bands in chromaticity channels (Cr, Cb) added together:
◊ Channel’s logo and newscaster’s hair are the areas with the most prominent difference from their surround
jb
jr
j CCC
)()(),( 22)1( nhmhnmCCC nmjb
jb
jb AA
)()(),( 22)1( nhmhnmCCC nmjr
jr
jr AA
CSNDSP’06CSNDSP’06
Combination of conspicuity maps
◊ Combination of individual conspicuity maps (I,O,C,F) into the final saliency map (S) through the following sigmoid function:
◊ ROI computed by thresholding the saliency map (using Otsu’s method) and filling possible holes in the mask that is produced.
◊ Smooth by low pass filtering non-ROI areas and encode frames as usual (see figure to the right)
Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
11
2
FCOIe
S
CSNDSP’06CSNDSP’06
Experimental Results
◊ Aim:
◊ Check if deterioration in ROI encoded videos is observable (visual trial tests)
◊ Compute bit-rate gain
◊ 10 video clips with varying content, both indoor and outdoor
◊ Humans always present
◊ 10 human observers
◊ Non experts (students)
◊ 5 female, 5 male
◊ 60 second to watch video clips (ROI-based and standard MPEG-1 encoding)
◊ Select best
◊ Each video clip viewed twice (200 tests in total)
Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
CSNDSP’06CSNDSP’06 Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
Content (selected frames)
grandma fashioneye_witnessnews_cast1
CSNDSP’06CSNDSP’06
Selections per video clip and average bit rate
1. eye_witness
2. fashion
3. grandma
4. justice
5. lecturer
6. news_cast1
7. news_cast2
8. night_interview
9. old_man
10. soldier
Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
Visual trials
Encoding Method Preferences Average Bit Rate (Kbps)
VA-ROI 95 224.4
Standard MPEG -1 105 308.1
CSNDSP’06CSNDSP’06 Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions
Bit rate gain
Video Clip Encoding Method
Bit Rate (Kbps) Bit Rate Gain
eye_witness, VA-ROI 319
17 (%)Standard 386
FashionVA-ROI 296
16 (%)Standard 354
GrandmaVA-ROI 217
15 (%)Standard 256
JusticeVA-ROI 228
28 (%)Standard 318
lecturerVA-ROI 201
27 (%)Standard 274
news_cast1VA-ROI 205
31 (%)Standard 297
news_cast2 VA-ROI 170
37 (%)Standard 270
night_interviewVA-ROI 174
48 (%)Standard 335
old_man VA-ROI 241
25 (%)Standard 321
soldierVA-ROI 193
29 (%)Standard 270
AverageVA-ROI 224.4
27.2 (%)Standard 308.1
CSNDSP’06CSNDSP’06
Conclusions - Further work
◊ Visual attention based ROI estimation can be used to indicate regions that need to be encoded with higher accuracy. In this way:
◊ Significant bit-rate gain, compared to MPEG-1, can be achieved, while
◊ the areas identified as visually important by the VA algorithm are in conformance with the ones identified by the human subjects, as it can be deducted by the visual trial tests,
◊ VA ROI based encoding leads to better compression of both Intra-coded and Inter coded frames though the former is higher.
◊ Further work includes
◊ conducting experiments to test the efficiency of the proposed method in the MPEG-4 framework.
◊ examining the effect of incorporating priority encoding by varying the quality factor of the DCT quantization table across VA-ROI and non-ROI frame blocks.
Aim Overview Visual Attention The proposed algorithm Combination of conspicuity maps Experimental Results Conclusions