Video Image Assessment With a Distortion-weighing
-
Upload
agapie-oana-georgiana -
Category
Documents
-
view
216 -
download
0
Transcript of Video Image Assessment With a Distortion-weighing
-
7/26/2019 Video Image Assessment With a Distortion-weighing
1/13
Video image assessment with a distortion-weighing
spatiotemporal visual attention model
Hua Zhang &Xiang Tian &Yaowu Chen
Published online: 29 January 2010# Springer Science+Business Media, LLC 2010
Abstract For the purpose of extracting attention regions from distorted videos, a
distortion-weighing spatiotemporal visual attention model is proposed. On the impact of
spatial and temporal saliency maps, visual attention regions are acquired directed in a
bottom-up manner. Meanwhile, the blocking artifact saliency map is detected according to
intensity gradient features. An attention selection is applied to identify one of visual
attention regions with more relatively serious blocking artifact as the Focus of Attention
(FOA) directed in a top-down manner. Experimental results show that the proposed modelcan not only accurately analyze the spatiotemporal saliency based on the intensity, the
texture, and the motion features, but also able to estimate the blocking artifact of distortions
in comparing with Walthers and Yous models.
Keywords Visual attention model. Focus of Attention (FOA). Saliency map.
Spatiotemporal. Distortion
Abbreviations
HVS Human Visual System
FOA Focus of Attention
LGN Lateral Geniculate Nucleus
hMT+ human Middle Temporal+
IPS Intra Parietal Sulcus
FEF Frontal Eye Field
Multimed Tools Appl (2011) 52:221233
DOI 10.1007/s11042-010-0470-x
The work was partially presented at the 2nd International Congress on Image and Signal Processing
(CISP09).
H. Zhang :X. Tian (*) :Y. ChenInstitute of Advanced Digital Technology and Instrumentation, Zhejiang University, Hangzhou 310027,Peoples Republic of China
e-mail: [email protected]
H. Zhang
e-mail: [email protected]
Y. Chen
e-mail: [email protected]
-
7/26/2019 Video Image Assessment With a Distortion-weighing
2/13
VQEG Video Quality Experts Group
HRC Hypothetical Reference Circuits
1 Introduction
Human beings have a remarkable ability to interpret complex scenes in video analysis.
Psychophysical evidences suggest that the Human Visual System (HVS) can preprocess
simple features in parallel over the entire visual field and pay the most of all the visual
attention to the object-selective region called the Focus of Attention (FOA) [4, 6, 13].
Hence, video analyses such as video summarization [9], video content re-composition [2],
and video quality assessment [10] are all being considered to model the visual attention
system to select the FOA.
A lot of studies concerning visual attention models have been carried out. A feature-
integration theory of attention was proposed by Treisman et al. [13]. It suggested that the
attention should be serially directed to each stimulus in the particular region whenever
conjunctions of separable features were needed to characterize the selective objects. A
computational visual attention model for images was proposed by Walther et al. [15] built
on a biologically plausible architecture which was proposed by Koch et al. [4]; this model
computed the multi-scale image features (e.g. intensity, color and orientation) and
combined into a visual saliency map, but it could only analyze the visual saliency for
static images, and temporal features were not taken into consideration. Later, a motion
attention model was constructed for video skimming by Ma et al . [5]; in the model, only
motion information was used to detect attention regions such that the results were not very
accurate because spatial properties in attention regions were not taken into account.Recently, several improved visual models, like You et al.s [16] and Rapantzikos et al.s[7]
which incorporated both spatial and temporal features were proposed, could locate the FOA
more accurately than ever before.
Nowadays, the resolution of videos is becoming more refined which leads to the
necessity to compress videos for storage and transmission. The end result is that videos are
distorted through coding by MPEG, H.26x or transmitted by source-channel with bit error.
However, distortions are not taken into consideration by all current visual attention models.
A spatiotemporal visual attention model which takes the blocking artifact of distortions into
consideration is proposed in this paper. Experimental results show that the proposed model
can not only accurately analyze the spatiotemporal saliency based on the intensity, thetexture, and the motion features, but also figure out the blocking artifact of distortions
which are more sever.
2 System model
The HVS controls the FOA in a rapid, bottom-up, saliency-driven, and task-independent
manner as well as in a slower, top-down, volition-controlled, and task-dependent manner
[6]. Psychophysical and physiological experiments reveal that the HVS which ishierarchically organized processes the visual information as follows: First, the early visual
areas such as Lateral Geniculate Nucleus (LGN) and V1-V4 in occipital visual cortex are
selected to code the low-level features or the basic combinations of features [8]. Then, the
area called human Middle Temporal+(hMT+) located at parietal cortex is selectively
activated by moving versus stationary stimuli which exhibits high contrast sensitivity [3];
Last, the latter visual areas like Intra Parietal Sulcus (IPS) in posterior parietal cortex which
222 Multimed Tools Appl (2011) 52:221233
-
7/26/2019 Video Image Assessment With a Distortion-weighing
3/13
is in charge of selecting features visually [8], and the Frontal Eye Field (FEF) which plays a
role in generating contra lateral saccades [8], are selected in completing the process of
visual attention and locating the FOA together.
Figure1illustrates the framework of the proposed spatiotemporal visual system model
which is directed by the HVS; the FOA of distorted videos are detected frame by frame.First of all, some spatial features; such as intensities and edge orientations, and some
temporal features; such as motion intensities and motion orientations, are extracted from the
distorted frame. Secondly, five saliency maps representing the intensity contrast, the texture
complexity, the motion intensity, the motion contrast coherence, and the motion spatial
coherence are jointly considered to produce the spatiotemporal saliency map and identify
the spatiotemporal visual attention regions. Meanwhile, the severity of blocking artifact for
each block is specially calculated according to intensity gradient features, and the blocking
artifact saliency map is produced. The procedures above are conducted in a bottom-up
manner, and an attention selection is applied to identify one of the spatiotemporal visual
attention regions with the more relatively serious blocking artifact as the FOA, which is
being controlled in a top-down manner.
3 Spatiotemporal and distorted saliency maps
3.1 Spatial saliency maps
Contrasts, such as intensity contrast [1], attract visual attention in HVS. Humans are usually
not sensitive to the local intensity itself, but are easily attracted by regions with higherintensity contrast. Assuming that a frame of a distorted video sequence I(x, y) is divided
into non-overlap blocks each with NN pixels and, let the standard deviation of the
Fig. 1 Framework of the visual
attention model
Multimed Tools Appl (2011) 52:221233 223
-
7/26/2019 Video Image Assessment With a Distortion-weighing
4/13
intensities in the (i, j)th block stands for the intensity contrast. Thus, the intensity contrast
saliency map Ic(i, j) becomes
Ic i;j ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1NN
XjNN
yjN1X
iNN
xiN1
I x;y I 2vuut 1The regions with higher intensity contrast have higher values of Ic(i, j).
As mentioned in [12], humans are usually more attracted by the regions with higher
spatial contrast and weaker correlations with the intensities of nearby image pixels which
are called the structure-texture region. The region of structure-texture is typically composed
of consistent long edges, while the region of randomly-texture is composed of small edges
in various orientations. The texture complexity saliency map is produced according to the
number of edge points and edge orientations in the local blocks.
The Sobel edge detector is applied to the frame I(x, y) and E(x, y) is the result of the
edge detection. The gradients orientation for each pixel is computed as
q x;y arctanGver x;y
Ghor x;y 2
Gver(x,y) and Ghor(x,y) are respectively the vertical and horizontal gradients of the pixel.
(x, y)is classified into 4 edge orientations :
q0 x;y 2 0=180; 45=225; 90=270; 135=315f g
Then, the number of edge orientations cd (different angles), and the number of edge
pointsne(the points where E(x,y) equal to 1) in the (i,j)thNNblock can be calculated. If
ne>ne* (where ne* is a given threshold), the edge flag ce is set to 1. Otherwise, the edge
flag ce is set to 0. The texture complexity saliency map Tc(i, j) can be defined as
Tci;j
0:5; if cd0 1:0; if cd1 2ce =2; if cd2 1ce =2; if cd3
0; else
8>>>>>>>:
3
Lastly, Tc(i, j) is smoothed by a 33 filter [0, 1, 0; 1, 2, 1; 0, 1, 0]. IfTc(i, j) were moreclose to 1, then the texture of the block would be more like the structure-texture, and the
region would be more attractive. On the other hand, ifTc(i,j) were more close to 0, then the
texture of the block would be more like random-texture, and the region would be less
attractive. Regions where Tc(i, j) equal to 0.5 would be flat.
3.2 Temporal saliency maps
The motion is the most salient feature which attracts visual attention especially for video
sequences [3]. Many of the current visual attention models are based on motion features
[5, 16]. In the proposed model, temporal saliency maps are built on the basis of motion
features in the scene, such as the motion intensity, the motion contrast coherence, and the
motion spatial coherence.
Motion vectors which are estimated by the full search algorithm form the foundation of
this process. However, video sequences are always captured with the camera panning or
zooming, which makes that motion features not be computed correctly by the original
224 Multimed Tools Appl (2011) 52:221233
-
7/26/2019 Video Image Assessment With a Distortion-weighing
5/13
motion vectors. Thus, the global motion estimation and the motion compensation are
indispensable, and a fast global motion estimation method based on symmetrical
eliminations and the difference of motion vectors should be adopted [18].
After the compensation of this global motion model, motion vectors are ready for
computing temporal saliency maps. Let the motion vector of the block indexed (i, j) be(MVx(i, j), MVy(i, j)), the motion intensity MI(i, j)of the block becomes
MI i;j
ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiMVx i;j
2 MVy i;j 2
q 4
The motion intensity saliency map Mi(i, j) represents the mean motion intensity inside a
spatial window whose width is (2W+1) blocks and centers on the (i, j)th;
Mi i;j
1
2W1 2XjW
njW
XiWmiW MI m; n 5
The motion contrast coherence saliency map Mc(i, j) measures the contrast of motion
intensities within the neighborhood of the (i, j)th block;
Mc i;j MImaxMImin
MImax; if MImin60
MImaxMI0max
; otherwise
( 6
where MImin and MImax are the minimum and the maximum motion intensities in thespatial window respectively, and MImaxis the maximum motion intensity in a larger spatial
window whose width is (10W+1) blocks and centers on the (i, j)th.
The motion spatial coherence saliency map Ms(i, j) is defined as the entropy of the
orientations of motion vectors inside the spatial window [11];
Ms i;j Xnsk1
pk log pk 7
It measures the consistency of the magnitudes of motion vectors in a region.
3.3 Blocking artifact coefficient map
Blocking artifact is the most significant distortion of all coding artifacts [17]: Block-wise
discrete cosine transform (DCT) is a popular approach for video coding since it is the
foundation of most of the current video coding standards (e.g. MPEG, and H.26x). When
the bit-rate is low, blocking artifacts caused by the coarse quantization of the DCT
coefficients are brought across block boundaries. Videos are also transmitted with packet
losses or bit errors, which could easily cause serious blocking artifacts. The blocking
artifact of a frame is detected as follows:
First of all, the Sobel edge detector is applied to the frame I(x, y). Gver(x, y) and Ghor(x, y)
are respectively the vertical and the horizontal intensity gradients.
Secondly, I(x, y) is divided into non-overlap blocks each with NN pixels. The
horizontal and the vertical blocking artifact of (i, j)th block Bhor(i, j) and Bver(i, j) are
respectively estimated within template blocks as shown in Fig. 2a and Fig. 2b. Each
template block is formed by NN pixels which is the exact DCT block. The light gray
Multimed Tools Appl (2011) 52:221233 225
-
7/26/2019 Video Image Assessment With a Distortion-weighing
6/13
points are boundary pixels and the dark gray points are inner pixels. The horizontal
blocking artifactBhor(i, j) is acquired:
Bhor i;j aGhb i;j ; if Ghin i;j 0
Ghb i;j
Ghin i;j
; otherwise( 8Ghb(i, j) and Ghin(i, j) are the average gradients of the block boundary and the block
interior respectively.
Finally, Bver(i, j) is estimated in an identical way using a template block as shown in
Fig.2b. The blocking artifact saliency map is given as follows:
B i;j mean Bhor i;j ; Bver i;j 9
4 FOA extraction
The spatiotemporal visual saliency map SMst(i, j) is fused by two spatial saliency maps; the
intensity contrastIc(i,j) and the texture complexity Tc(i,j), and three temporal saliency maps;
the motion intensity Mi(i, j), the motion contrastMc(i, j), and the motion spatial Ms(i, j).
SMst i;j l1Fs i;j Ft i;j ; if Mc i;j orMs i;j m
l2Fs i;j Ft i;j Mi i;j 2; otherwise
Fs i;j log a1 Ic i;j a2Tc i;j
2
Ft i;j a3Mc i;j a4Ms i;j
10
where five saliency maps are all normalized in the range of [0, 1] for computing and
comparing. The blocks whose Mc(i, j) or Ms(i, j) are higher than (which is set
experimentally) are considered as regions with complex motions, and motion intensities of
these blocks are lifted to the maximal value 1. l1 andl2 are the weights for the blocks with
different motion complexities.1,2,3,and4are used to adjust the influence on features.
a b
Fig. 2 The template blocks for the blocking artifacts:a forBhor(i, j) b forBver(i, j)
226 Multimed Tools Appl (2011) 52:221233
-
7/26/2019 Video Image Assessment With a Distortion-weighing
7/13
A modified spatiotemporal saliency map SMst(i, j) is acquired by using a 33 median
filter. As described in [16], the detection of boundary blocks of visual attention regions
whose SMst(i, j) change abruptly can be achieved by using the Canny edge detection
operator. The threshold for extracting the spatiotemporal visual attention region is
Sthreshold avg Sb max Sb ; min Sb =2
2 11
where Sb is a set of the visual saliency SMst(i, j) for boundary blocks. Regions where
SMst(i, j) is higher than the threshold Sthresholdare identified to visual attention regions.
VARst i;j 1; if SM0st i;j Sthreshold
0; otherwise
12
Then, connected regions are extracted from the VARst(i, j). The kth connected region
VARst
k
(i, j) (k = 1, 2, ... K) which has relatively larger area, is selected from candidateswhose close morphologic operator is used to fill holes.
Moreover, the serious blocking artifact in distorted videos is always identified as a
special kind of motion, so that they are included in the spatiotemporal visual attention
region VARstk(i, j). The average blocking artifact of each spatiotemporal visual attention
region is calculated by:
Bvark mean B s; t ; s; t 2VARstk i;j 13
The FOA is defined as a spatiotemporal visual attention region with relatively serious
blocking artifact. Visual attention regions VARstk(i, j) which include serious blocking
artifacts are sometimes more than one; their average blocking artifacts Bvar(k) are very close
to each other. Thus, not only the visual attention region with the maximum average
blocking artifact (max(Bvar(k))), but also regions which have a very close Bvar(k) to the
maximum are selected as the final FOA.
VARst dVARstl i;j ; l2 kjBvark "%maxBvarkf g 14
where VARstl(i,j) is thelth visual attention region whose average blocking artifactBvar(l) is
higher than e percent of the maximum average blocking artifact. e which is set
experimentally, is very close to 100.
5 Experiments and discussion
The proposed spatiotemporal visual attention model is modeled on the Simulink of
MATLAB (R2008a). All algorithms mentioned above are implemented in C- language
except edge detectors which are supplied by the Simulink library. Experiments were run on
the PC with Intel(R) Pentium(R) Duo CPU of 2.80 GHz and 1.50 GB memory. A set of
video sequences which are chosen from the Video Quality Experts Group (VQEG) Phase I
50 Hz datasets [14] is used afterwards. Different texture and motion complexities exist in
these videos.
Each frame is scaled to CIF (352288), and the luminance I(x, y) is divided into non-
overlap blocks each with 88. During calculating the texture complexity, the threshold for
the number of edge points ne* is set to 16 experimentally and the weight for the blocking
artifact coefficient map is set to 0.25. During calculating the spatiotemporal visual saliency
map, is set to 0.8,l1is set to 1, l2is set to 2, and 1,2,3and 4are set to 1, 1, 2, and
Multimed Tools Appl (2011) 52:221233 227
-
7/26/2019 Video Image Assessment With a Distortion-weighing
8/13
1 respectively in this experiment. Because temporal features in randomly-texture regions
(where Tc(i,j)0.5), are always not very sensitive, the Mi(i,j), Mc(i,j), and Ms(i,j) in these
regions will be restrained to 0. Regions with the Tc(i,j) value greater than 0.5 are defined as
the structure-texture regions; values of the temporal features will maintain at their original
values. which is used in the FOA extraction, is set to 95 experimentally.Src3 (Harp) belongs to a type of videos with high local complex textures and moderate
average motions, where the camera is zooming. Src3_hrc11_625 is produced by the no.11
Hypothetical Reference Circuits (HRC) which is coded by MPEG-2 I frame only and with
some bit errors [14]. The 72nd frame of this video is shown in Fig. 3a. The left corner of the
frame is zoomed out in Fig. 3b; an obvious distortion is near hands of the woman marked
with red circle. The shirt of the man playing the piano with a branch near him has high
complex textures which display darkness in Fig. 3c. Hence, the spatiotemporal saliency in
these regions is effectively restrained in Fig. 3d. Whereas the woman playing the harp has
the structure-texture displaying light, and hands of the woman with high motion intensities
take the most salient feature in Fig.3d. The threshold is detected near 0.233 (Fig.3e). Only
one connected region whose area is larger than 8 blocks is extracted. After filtered by an 8
8 2-D FIR, it is shown in Fig. 3f. The FOA is acquired, which both displays the hands and
the distortions shown in Fig. 3g.
Src4 (Moving graphic) belongs to a type of videos with low complex textures, low
average motions, and without the global motion. Src4_hrc12_625 is produced by the no.12
HRC which is coded by MPEG-2 and with some bit errors [14]. The 73rd frame of this
video is shown in Fig.4a. An obvious distortion is near the top icon in the first row, and the
second column which is zoomed out in Fig. 4b. The threshold is detected near 0.306
(Fig.4e), and the three connected regions whose areas are larger than 8 blocks are extracted.After filtered by an 88 2-D FIR, they are shown in Fig. 4f: The spider and the moving
characters are detected because of the high motion intensity; the distortion near the icon is
detected as a special kind of motion with high contrast of the motion intensity. The FOA of
this frame is located at the region with the most serious blocking artifact shown in Fig. 4g.
Src5 (Canoe) is with high complex textures, high average motions, and the camera is
following the canoe continuously. Src5_hrc12_625 is also produced by the no.12 HRC
Fig. 3 FOA extracted in 72nd frame in src3_hrc11_625:a distorted frameb local distortionc map ofTc(i,j)
d map of SMst(i, j) e the threshold plane fspatiotemporal visual attention regions VARstk(i, j) (g) FOA
VARst_d
228 Multimed Tools Appl (2011) 52:221233
-
7/26/2019 Video Image Assessment With a Distortion-weighing
9/13
[14]. The 174th frame of this video is shown in Fig. 5awith an obvious distortion under the
oar, and the canoe in the water which is being zoomed out in Fig. 5b. The threshold is
detected near 0.267 (Fig.5e), and the three connected regions whose areas are larger than
8 blocks are extracted. After filtered by an 88 2-D FIR, they are shown in Fig.5f: The
arm raised by the man is detected because of the high motion intensity, and the distortions
under the oar and canoe in the water are detected as a special kind of motion with high
contrast of the motion intensity. The FOA of this frame is located at the region with the
most serious blocking artifact shown in Fig. 5g.
To validate the proposed visual attention model for the distorted videos, Walthers
Saliency Toolbox 2.1 for images [15] and Yous visual attention model for videos [16] are
Fig. 4 FOA extracted in 73 rd frame in src4_hrc12_625:a distorted frameb local distortionc map ofTc(i,j)
d map of SMst(i, j) e the threshold plane fspatiotemporal visual attention regions VARstk(i, j) (g) FOA
VARst_d
Fig. 5 FOA extracted in 174th frame in src5_hrc12_625:a distorted frameb local distortionc map ofTc(i,j)
d map of SMst(i, j) e the threshold plane fspatiotemporal visual attention regions VARstk(i, j) (g) FOA
VARst_d
Multimed Tools Appl (2011) 52:221233 229
-
7/26/2019 Video Image Assessment With a Distortion-weighing
10/13
introduced to extract the FOA. In Fig. 6, the 1st column is the distorted frame; the 72nd
frame in src3_hrc11_625, the 73 rd frame in src4_hrc12_625, and the 174th frame in
src5_hrc12_625. The 2nd column is the first 4 FOA extracted by Walthers Saliency
Toolbox. The 3rd column is the FOA extracted by Yous model, and the last column is the
FOA extracted by the proposed model. Comparing the FOA extracted by different methods,
Walthers Saliency Toolbox and Yous model can accurately locate the hands of the woman
in Fig.6a, the spider and moving characters in Fig. 6b, and the arms raised in Fig. 6c, but
the serious distortion region which takes humans attention, can not be figured out.Moreover, Yous model can not be concerned with the influence of the texture which takes
the branch (in src3_hrc11_625) as visual attention region. The proposed model can not only
detect the regions as the referenced models (in Fig. 3f, Fig. 4f and Fig. 5f) which
successfully restrain the features in random-texture regions, but also select the region with
distortion as the final FOA (in Fig. 6).
The comparison of the time efficiency for FOA extraction by different visual attention
models is shown in Fig. 7: Walthers Saliency Toolbox is especially for images, and the
time consumed would be automatically produced by the toolbox itself, which are
respectively 320 ms, 405 ms and 360 ms for the given frames in Fig. 6. The timesconsumed by Yous and the proposed models are recorded by the Profile tool of Simulink,
which are respectively about 430 ms and 490 ms per frame. The comparison of the time
consumed for all computational modules of the FOA extraction is detailed in Table 1for
Yous and the proposed models respectively. The proposed model needs compute more
features (e.g. the texture complexity, the motion contrast coherence, and the motion spatial
coherence) than that of the Yous model; especially in computing the blocking artifact of the
Fig. 6 Comparison of FOA extracted by Walters, Yous and the proposed model: a 72nd frame in
src3_hrc11_625 b 73rd frame in src4_hrc12_625 c 174th frame in src5_hrc12_625
230 Multimed Tools Appl (2011) 52:221233
-
7/26/2019 Video Image Assessment With a Distortion-weighing
11/13
distorted frame which takes the distortion into account. It takes about 60 ms per frame more
than Yous model on average.
6 Conclusions
A spatiotemporal visual attention model for video analysis is proposed, which is directed both
in a bottom-up and a top-down manner. The intensity, the texture, and the motion features are
jointly considered to produce spatiotemporal visual attention regions. Meanwhile, the
blocking artifact saliency map is detected according to intensity gradient features. An attention
selection is applied to identify one of visual attention regions with the relatively serious
blocking artifact as the FOA. Experimental results show that in comparing with Walthers and
Yous models, the proposed model can not only accurately analyze the spatiotemporal
saliency, but also figure out distortions which are more attractive. However, the time consumed
for the proposed FOA extraction is about 490 ms per frame which is 60 ms more than the Yous
method. Applying the fast motion vector estimation algorithm and simplifying thecomputation of the temporal saliency become the key points for future studies.
Table 1 Comparison of the time consumed for all the computational modules of FOA extraction in Yous
and the proposed models
Modules Yous Model(ms/frame) Proposed Model(ms/frame)
Intensity contrast 0.97 0.97
Spatial position 1.01
Texture complexity
2.42Temporal features 6.05 50.45
Distortion 14.80
Motion vector estimation 420 420
Others about 1.00 about 1.00
Average totals about 429.48 about 489.64
src3_hrc11_625 src4_hrc12_625 src5_hrc12_6250
50
100
150
200
250
300
350400
450
500
TimeforFOAextraction(m
s/frame)
Distorted Sequence
Walther's Model You's Model Proposed ModelFig. 7 Comparison of the total
time consumed for FOA extrac-
tion by different models
Multimed Tools Appl (2011) 52:221233 231
-
7/26/2019 Video Image Assessment With a Distortion-weighing
12/13
Acknowledgements The authors would like to thank the editor and anonymous reviewers for their careful
reviews and valuable comments.
References
1. Aziz MZ, Mertsching B (2008) Fast and robust generation of feature maps for region-based visual
attention. IEEE Trans Image Process 17(5):633644
2. Chen WH, Wang CW, Wu JL (2007) Video adaptation for small display based on content recomposition.
IEEE Trans Circuits Syst Video Technol 17(1):4358
3. Kalanit GS, Rafael M (2004) The human visual cortex. Annu Rev Neurosci 27:649677
4. Koch C, Ullman S (1985) Shifts in selection in visual attention: toward the underlying neural circuitry.
Hum Neurobiol 4(4):219227
5. Ma YF, Zhang HJ (2002) A model of motion attention for video skimming. Proc Int Conf Image
Processing 1:2225
6. Niebur E, Koch C (1998) Computational architectures for attention. In: Parasuraman R (ed) The attentive
brain. MIT, Cambridge, pp 163
1867. Rapantzikos K, Tsapatsoulis N, Avrithis Y, Kollias S (2007) Bottom-up spatiotemporal visual attention
model for video analysis. IET Image Processing 1(2):237248
8. Serences JT, Yantis S (2006) Selective visual attention and perceptual coherence. TRENDS Cognit Sci
10(1):3845
9. Shih HC, Hwang JN, Huang CL (2009) Content-based attention ranking using visual and contextual
attention model for baseball videos. IEEE Trans Multimedia 11(2):244255
10. Stefan W, Praveen M (2008) The evolution of video quality measurement: from PSNR to hybrid metrics.
IEEE Trans Broadcast 54(3):660668
11. Tang CW (2007) Spatiotemporal visual considerations for video coding. IEEE Trans Multimedia 9(2):231238
12. Tang CW, Chen CH, Yu YH, Tsai CJ (2006) Visual sensitivity guided bit allocation for video coding.
IEEE Trans Multimedia 8(1):1118
13. Treisman AM, Gelade G (1980) A feature-integration theory of attention. Cogn Psychol 12(1):97
13614. VQEG (2000) Final Report from the Video Quality Expert Group on the Validation of Objective Models
of Video Quality assessment. VQEG. http://www.vqeg.org
15. Walther D, Koch C (2006) Modeling attention to salient proto-objects. Neural Netw 19:13951407
16. You JY, Liu GZ, Li HL (2007) A novel attention model and its application in video analysis. Appl Math
Comput 185:963975
17. Yuen M, Wu HR (1998) A survey of hybrid MC/DPCM/DCT video coding distortions. Signal
Processing 70:247278
18. Zheng YY (2008) Research on H.264 Region-of-Interest Coding Based on Visual Perception. PhD
Thesis, Zhejiang University. China (in Chinese)
Hua Zhangwas born in Zhejiang Province, China, in 1980. She received the B.Sc. and Ph.D. degrees from
Zhejiang University, Hangzhou, China, in 2003 and 2009, respectively. She is currently a lecturer of
Hangzhou Dianzi University, Hangzhou, China. Her major research field is image and video processing.
232 Multimed Tools Appl (2011) 52:221233
-
7/26/2019 Video Image Assessment With a Distortion-weighing
13/13
Xiang Tian was born in Anhui Province, China, in 1979. He received the B.Sc. and Ph.D. degrees from
Zhejiang University, Hangzhou, China, in 2001 and 2007, respectively. He is currently a post-doctor in the
Institute of Advanced Digital Technologies and Instrumentation, Zhejiang University. His major research
field is FPGA based high performance computing.
Yaowu Chenwas born in Liaoning Province, in China, in 1963. He received the Ph.D. degree from ZhejiangUniversity, Hangzhou, China, in 1998. He is currently a professor and the director of the Institute of
Advanced Digital Technologies and Instrumentation, Zhejiang University. His major research fields are
embedded system, networking multimedia system, and electronic instrumentation system.
Multimed Tools Appl (2011) 52:221233 233