Video Image Assessment With a Distortion-weighing

7/26/2019 Video Image Assessment With a Distortion-weighing

1/13

Video image assessment with a distortion-weighing

spatiotemporal visual attention model

Hua Zhang &Xiang Tian &Yaowu Chen

Published online: 29 January 2010# Springer Science+Business Media, LLC 2010

Abstract For the purpose of extracting attention regions from distorted videos, a

distortion-weighing spatiotemporal visual attention model is proposed. On the impact of

spatial and temporal saliency maps, visual attention regions are acquired directed in a

bottom-up manner. Meanwhile, the blocking artifact saliency map is detected according to

intensity gradient features. An attention selection is applied to identify one of visual

attention regions with more relatively serious blocking artifact as the Focus of Attention

(FOA) directed in a top-down manner. Experimental results show that the proposed modelcan not only accurately analyze the spatiotemporal saliency based on the intensity, the

texture, and the motion features, but also able to estimate the blocking artifact of distortions

in comparing with Walthers and Yous models.

Keywords Visual attention model. Focus of Attention (FOA). Saliency map.

Spatiotemporal. Distortion

Abbreviations

HVS Human Visual System

FOA Focus of Attention

LGN Lateral Geniculate Nucleus

hMT+ human Middle Temporal+

IPS Intra Parietal Sulcus

FEF Frontal Eye Field

Multimed Tools Appl (2011) 52:221233

DOI 10.1007/s11042-010-0470-x

The work was partially presented at the 2nd International Congress on Image and Signal Processing

(CISP09).

H. Zhang :X. Tian (*) :Y. ChenInstitute of Advanced Digital Technology and Instrumentation, Zhejiang University, Hangzhou 310027,Peoples Republic of China

e-mail: [email protected]

H. Zhang


Y. Chen



2/13

VQEG Video Quality Experts Group

HRC Hypothetical Reference Circuits

1 Introduction

Human beings have a remarkable ability to interpret complex scenes in video analysis.

Psychophysical evidences suggest that the Human Visual System (HVS) can preprocess

simple features in parallel over the entire visual field and pay the most of all the visual

attention to the object-selective region called the Focus of Attention (FOA) [4, 6, 13].

Hence, video analyses such as video summarization [9], video content re-composition [2],

and video quality assessment [10] are all being considered to model the visual attention

system to select the FOA.

A lot of studies concerning visual attention models have been carried out. A feature-

integration theory of attention was proposed by Treisman et al. [13]. It suggested that the

attention should be serially directed to each stimulus in the particular region whenever

conjunctions of separable features were needed to characterize the selective objects. A

computational visual attention model for images was proposed by Walther et al. [15] built

on a biologically plausible architecture which was proposed by Koch et al. [4]; this model

computed the multi-scale image features (e.g. intensity, color and orientation) and

combined into a visual saliency map, but it could only analyze the visual saliency for

static images, and temporal features were not taken into consideration. Later, a motion

attention model was constructed for video skimming by Ma et al . [5]; in the model, only

motion information was used to detect attention regions such that the results were not very

accurate because spatial properties in attention regions were not taken into account.Recently, several improved visual models, like You et al.s [16] and Rapantzikos et al.s[7]

which incorporated both spatial and temporal features were proposed, could locate the FOA

more accurately than ever before.

Nowadays, the resolution of videos is becoming more refined which leads to the

necessity to compress videos for storage and transmission. The end result is that videos are

distorted through coding by MPEG, H.26x or transmitted by source-channel with bit error.

However, distortions are not taken into consideration by all current visual attention models.

A spatiotemporal visual attention model which takes the blocking artifact of distortions into

consideration is proposed in this paper. Experimental results show that the proposed model

can not only accurately analyze the spatiotemporal saliency based on the intensity, thetexture, and the motion features, but also figure out the blocking artifact of distortions

which are more sever.

2 System model

The HVS controls the FOA in a rapid, bottom-up, saliency-driven, and task-independent

manner as well as in a slower, top-down, volition-controlled, and task-dependent manner

[6]. Psychophysical and physiological experiments reveal that the HVS which ishierarchically organized processes the visual information as follows: First, the early visual

areas such as Lateral Geniculate Nucleus (LGN) and V1-V4 in occipital visual cortex are

selected to code the low-level features or the basic combinations of features [8]. Then, the

area called human Middle Temporal+(hMT+) located at parietal cortex is selectively

activated by moving versus stationary stimuli which exhibits high contrast sensitivity [3];

Last, the latter visual areas like Intra Parietal Sulcus (IPS) in posterior parietal cortex which

222 Multimed Tools Appl (2011) 52:221233


3/13

is in charge of selecting features visually [8], and the Frontal Eye Field (FEF) which plays a

role in generating contra lateral saccades [8], are selected in completing the process of

visual attention and locating the FOA together.

Figure1illustrates the framework of the proposed spatiotemporal visual system model

which is directed by the HVS; the FOA of distorted videos are detected frame by frame.First of all, some spatial features; such as intensities and edge orientations, and some

temporal features; such as motion intensities and motion orientations, are extracted from the

distorted frame. Secondly, five saliency maps representing the intensity contrast, the texture

complexity, the motion intensity, the motion contrast coherence, and the motion spatial

coherence are jointly considered to produce the spatiotemporal saliency map and identify

the spatiotemporal visual attention regions. Meanwhile, the severity of blocking artifact for

each block is specially calculated according to intensity gradient features, and the blocking

artifact saliency map is produced. The procedures above are conducted in a bottom-up

manner, and an attention selection is applied to identify one of the spatiotemporal visual

attention regions with the more relatively serious blocking artifact as the FOA, which is

being controlled in a top-down manner.

3 Spatiotemporal and distorted saliency maps

3.1 Spatial saliency maps

Contrasts, such as intensity contrast [1], attract visual attention in HVS. Humans are usually

not sensitive to the local intensity itself, but are easily attracted by regions with higherintensity contrast. Assuming that a frame of a distorted video sequence I(x, y) is divided

into non-overlap blocks each with NN pixels and, let the standard deviation of the

Fig. 1 Framework of the visual

attention model

Multimed Tools Appl (2011) 52:221233 223


4/13

intensities in the (i, j)th block stands for the intensity contrast. Thus, the intensity contrast

saliency map Ic(i, j) becomes

Ic i;j ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1NN

XjNN

yjN1X

iNN

xiN1

I x;y I 2vuut 1The regions with higher intensity contrast have higher values of Ic(i, j).

As mentioned in [12], humans are usually more attracted by the regions with higher

spatial contrast and weaker correlations with the intensities of nearby image pixels which

are called the structure-texture region. The region of structure-texture is typically composed

of consistent long edges, while the region of randomly-texture is composed of small edges

in various orientations. The texture complexity saliency map is produced according to the

number of edge points and edge orientations in the local blocks.

The Sobel edge detector is applied to the frame I(x, y) and E(x, y) is the result of the

edge detection. The gradients orientation for each pixel is computed as

q x;y arctanGver x;y

Ghor x;y 2

Gver(x,y) and Ghor(x,y) are respectively the vertical and horizontal gradients of the pixel.

(x, y)is classified into 4 edge orientations :

q0 x;y 2 0=180; 45=225; 90=270; 135=315f g

Then, the number of edge orientations cd (different angles), and the number of edge

pointsne(the points where E(x,y) equal to 1) in the (i,j)thNNblock can be calculated. If

ne>ne* (where ne* is a given threshold), the edge flag ce is set to 1. Otherwise, the edge

flag ce is set to 0. The texture complexity saliency map Tc(i, j) can be defined as

Tci;j

0:5; if cd0 1:0; if cd1 2ce =2; if cd2 1ce =2; if cd3

0; else

8>>>>>>>:

3

Lastly, Tc(i, j) is smoothed by a 33 filter [0, 1, 0; 1, 2, 1; 0, 1, 0]. IfTc(i, j) were moreclose to 1, then the texture of the block would be more like the structure-texture, and the

region would be more attractive. On the other hand, ifTc(i,j) were more close to 0, then the

texture of the block would be more like random-texture, and the region would be less

attractive. Regions where Tc(i, j) equal to 0.5 would be flat.

3.2 Temporal saliency maps

The motion is the most salient feature which attracts visual attention especially for video

sequences [3]. Many of the current visual attention models are based on motion features

[5, 16]. In the proposed model, temporal saliency maps are built on the basis of motion

features in the scene, such as the motion intensity, the motion contrast coherence, and the

motion spatial coherence.

Motion vectors which are estimated by the full search algorithm form the foundation of

this process. However, video sequences are always captured with the camera panning or

zooming, which makes that motion features not be computed correctly by the original



5/13

motion vectors. Thus, the global motion estimation and the motion compensation are

indispensable, and a fast global motion estimation method based on symmetrical

eliminations and the difference of motion vectors should be adopted [18].

After the compensation of this global motion model, motion vectors are ready for

computing temporal saliency maps. Let the motion vector of the block indexed (i, j) be(MVx(i, j), MVy(i, j)), the motion intensity MI(i, j)of the block becomes

MI i;j

ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiMVx i;j

2 MVy i;j 2

q 4

The motion intensity saliency map Mi(i, j) represents the mean motion intensity inside a

spatial window whose width is (2W+1) blocks and centers on the (i, j)th;

Mi i;j

1

2W1 2XjW

njW

XiWmiW MI m; n 5

The motion contrast coherence saliency map Mc(i, j) measures the contrast of motion

intensities within the neighborhood of the (i, j)th block;

Mc i;j MImaxMImin

MImax; if MImin60

MImaxMI0max

; otherwise

( 6

where MImin and MImax are the minimum and the maximum motion intensities in thespatial window respectively, and MImaxis the maximum motion intensity in a larger spatial

window whose width is (10W+1) blocks and centers on the (i, j)th.

The motion spatial coherence saliency map Ms(i, j) is defined as the entropy of the

orientations of motion vectors inside the spatial window [11];

Ms i;j Xnsk1

pk log pk 7

It measures the consistency of the magnitudes of motion vectors in a region.

3.3 Blocking artifact coefficient map

Blocking artifact is the most significant distortion of all coding artifacts [17]: Block-wise

discrete cosine transform (DCT) is a popular approach for video coding since it is the

foundation of most of the current video coding standards (e.g. MPEG, and H.26x). When

the bit-rate is low, blocking artifacts caused by the coarse quantization of the DCT

coefficients are brought across block boundaries. Videos are also transmitted with packet

losses or bit errors, which could easily cause serious blocking artifacts. The blocking

artifact of a frame is detected as follows:

First of all, the Sobel edge detector is applied to the frame I(x, y). Gver(x, y) and Ghor(x, y)

are respectively the vertical and the horizontal intensity gradients.

Secondly, I(x, y) is divided into non-overlap blocks each with NN pixels. The

horizontal and the vertical blocking artifact of (i, j)th block Bhor(i, j) and Bver(i, j) are

respectively estimated within template blocks as shown in Fig. 2a and Fig. 2b. Each

template block is formed by NN pixels which is the exact DCT block. The light gray



6/13

points are boundary pixels and the dark gray points are inner pixels. The horizontal

blocking artifactBhor(i, j) is acquired:

Bhor i;j aGhb i;j ; if Ghin i;j 0

Ghb i;j

Ghin i;j

; otherwise( 8Ghb(i, j) and Ghin(i, j) are the average gradients of the block boundary and the block

interior respectively.

Finally, Bver(i, j) is estimated in an identical way using a template block as shown in

Fig.2b. The blocking artifact saliency map is given as follows:

B i;j mean Bhor i;j ; Bver i;j 9

4 FOA extraction

The spatiotemporal visual saliency map SMst(i, j) is fused by two spatial saliency maps; the

intensity contrastIc(i,j) and the texture complexity Tc(i,j), and three temporal saliency maps;

the motion intensity Mi(i, j), the motion contrastMc(i, j), and the motion spatial Ms(i, j).

SMst i;j l1Fs i;j Ft i;j ; if Mc i;j orMs i;j m

l2Fs i;j Ft i;j Mi i;j 2; otherwise

Fs i;j log a1 Ic i;j a2Tc i;j

2

Ft i;j a3Mc i;j a4Ms i;j

10

where five saliency maps are all normalized in the range of [0, 1] for computing and

comparing. The blocks whose Mc(i, j) or Ms(i, j) are higher than (which is set

experimentally) are considered as regions with complex motions, and motion intensities of

these blocks are lifted to the maximal value 1. l1 andl2 are the weights for the blocks with

different motion complexities.1,2,3,and4are used to adjust the influence on features.

a b

Fig. 2 The template blocks for the blocking artifacts:a forBhor(i, j) b forBver(i, j)



7/13

A modified spatiotemporal saliency map SMst(i, j) is acquired by using a 33 median

filter. As described in [16], the detection of boundary blocks of visual attention regions

whose SMst(i, j) change abruptly can be achieved by using the Canny edge detection

operator. The threshold for extracting the spatiotemporal visual attention region is

Sthreshold avg Sb max Sb ; min Sb =2

2 11

where Sb is a set of the visual saliency SMst(i, j) for boundary blocks. Regions where

SMst(i, j) is higher than the threshold Sthresholdare identified to visual attention regions.

VARst i;j 1; if SM0st i;j Sthreshold

0; otherwise

12

Then, connected regions are extracted from the VARst(i, j). The kth connected region

VARst

k

(i, j) (k = 1, 2, ... K) which has relatively larger area, is selected from candidateswhose close morphologic operator is used to fill holes.

Moreover, the serious blocking artifact in distorted videos is always identified as a

special kind of motion, so that they are included in the spatiotemporal visual attention

region VARstk(i, j). The average blocking artifact of each spatiotemporal visual attention

region is calculated by:

Bvark mean B s; t ; s; t 2VARstk i;j 13

The FOA is defined as a spatiotemporal visual attention region with relatively serious

blocking artifact. Visual attention regions VARstk(i, j) which include serious blocking

artifacts are sometimes more than one; their average blocking artifacts Bvar(k) are very close

to each other. Thus, not only the visual attention region with the maximum average

blocking artifact (max(Bvar(k))), but also regions which have a very close Bvar(k) to the

maximum are selected as the final FOA.

VARst dVARstl i;j ; l2 kjBvark "%maxBvarkf g 14

where VARstl(i,j) is thelth visual attention region whose average blocking artifactBvar(l) is

higher than e percent of the maximum average blocking artifact. e which is set

experimentally, is very close to 100.

5 Experiments and discussion

The proposed spatiotemporal visual attention model is modeled on the Simulink of

MATLAB (R2008a). All algorithms mentioned above are implemented in C- language

except edge detectors which are supplied by the Simulink library. Experiments were run on

the PC with Intel(R) Pentium(R) Duo CPU of 2.80 GHz and 1.50 GB memory. A set of

video sequences which are chosen from the Video Quality Experts Group (VQEG) Phase I

50 Hz datasets [14] is used afterwards. Different texture and motion complexities exist in

these videos.

Each frame is scaled to CIF (352288), and the luminance I(x, y) is divided into non-

overlap blocks each with 88. During calculating the texture complexity, the threshold for

the number of edge points ne* is set to 16 experimentally and the weight for the blocking

artifact coefficient map is set to 0.25. During calculating the spatiotemporal visual saliency

map, is set to 0.8,l1is set to 1, l2is set to 2, and 1,2,3and 4are set to 1, 1, 2, and



8/13

1 respectively in this experiment. Because temporal features in randomly-texture regions

(where Tc(i,j)0.5), are always not very sensitive, the Mi(i,j), Mc(i,j), and Ms(i,j) in these

regions will be restrained to 0. Regions with the Tc(i,j) value greater than 0.5 are defined as

the structure-texture regions; values of the temporal features will maintain at their original

values. which is used in the FOA extraction, is set to 95 experimentally.Src3 (Harp) belongs to a type of videos with high local complex textures and moderate

average motions, where the camera is zooming. Src3_hrc11_625 is produced by the no.11

Hypothetical Reference Circuits (HRC) which is coded by MPEG-2 I frame only and with

some bit errors [14]. The 72nd frame of this video is shown in Fig. 3a. The left corner of the

frame is zoomed out in Fig. 3b; an obvious distortion is near hands of the woman marked

with red circle. The shirt of the man playing the piano with a branch near him has high

complex textures which display darkness in Fig. 3c. Hence, the spatiotemporal saliency in

these regions is effectively restrained in Fig. 3d. Whereas the woman playing the harp has

the structure-texture displaying light, and hands of the woman with high motion intensities

take the most salient feature in Fig.3d. The threshold is detected near 0.233 (Fig.3e). Only

one connected region whose area is larger than 8 blocks is extracted. After filtered by an 8

8 2-D FIR, it is shown in Fig. 3f. The FOA is acquired, which both displays the hands and

the distortions shown in Fig. 3g.

Src4 (Moving graphic) belongs to a type of videos with low complex textures, low

average motions, and without the global motion. Src4_hrc12_625 is produced by the no.12

HRC which is coded by MPEG-2 and with some bit errors [14]. The 73rd frame of this

video is shown in Fig.4a. An obvious distortion is near the top icon in the first row, and the

second column which is zoomed out in Fig. 4b. The threshold is detected near 0.306

(Fig.4e), and the three connected regions whose areas are larger than 8 blocks are extracted.After filtered by an 88 2-D FIR, they are shown in Fig. 4f: The spider and the moving

characters are detected because of the high motion intensity; the distortion near the icon is

detected as a special kind of motion with high contrast of the motion intensity. The FOA of

this frame is located at the region with the most serious blocking artifact shown in Fig. 4g.

Src5 (Canoe) is with high complex textures, high average motions, and the camera is

following the canoe continuously. Src5_hrc12_625 is also produced by the no.12 HRC

Fig. 3 FOA extracted in 72nd frame in src3_hrc11_625:a distorted frameb local distortionc map ofTc(i,j)

d map of SMst(i, j) e the threshold plane fspatiotemporal visual attention regions VARstk(i, j) (g) FOA

VARst_d



9/13

[14]. The 174th frame of this video is shown in Fig. 5awith an obvious distortion under the

oar, and the canoe in the water which is being zoomed out in Fig. 5b. The threshold is

detected near 0.267 (Fig.5e), and the three connected regions whose areas are larger than

8 blocks are extracted. After filtered by an 88 2-D FIR, they are shown in Fig.5f: The

arm raised by the man is detected because of the high motion intensity, and the distortions

under the oar and canoe in the water are detected as a special kind of motion with high

contrast of the motion intensity. The FOA of this frame is located at the region with the

most serious blocking artifact shown in Fig. 5g.

To validate the proposed visual attention model for the distorted videos, Walthers

Saliency Toolbox 2.1 for images [15] and Yous visual attention model for videos [16] are

Fig. 4 FOA extracted in 73 rd frame in src4_hrc12_625:a distorted frameb local distortionc map ofTc(i,j)


VARst_d

Fig. 5 FOA extracted in 174th frame in src5_hrc12_625:a distorted frameb local distortionc map ofTc(i,j)


VARst_d



10/13

introduced to extract the FOA. In Fig. 6, the 1st column is the distorted frame; the 72nd

frame in src3_hrc11_625, the 73 rd frame in src4_hrc12_625, and the 174th frame in

src5_hrc12_625. The 2nd column is the first 4 FOA extracted by Walthers Saliency

Toolbox. The 3rd column is the FOA extracted by Yous model, and the last column is the

FOA extracted by the proposed model. Comparing the FOA extracted by different methods,

Walthers Saliency Toolbox and Yous model can accurately locate the hands of the woman

in Fig.6a, the spider and moving characters in Fig. 6b, and the arms raised in Fig. 6c, but

the serious distortion region which takes humans attention, can not be figured out.Moreover, Yous model can not be concerned with the influence of the texture which takes

the branch (in src3_hrc11_625) as visual attention region. The proposed model can not only

detect the regions as the referenced models (in Fig. 3f, Fig. 4f and Fig. 5f) which

successfully restrain the features in random-texture regions, but also select the region with

distortion as the final FOA (in Fig. 6).

The comparison of the time efficiency for FOA extraction by different visual attention

models is shown in Fig. 7: Walthers Saliency Toolbox is especially for images, and the

time consumed would be automatically produced by the toolbox itself, which are

respectively 320 ms, 405 ms and 360 ms for the given frames in Fig. 6. The timesconsumed by Yous and the proposed models are recorded by the Profile tool of Simulink,

which are respectively about 430 ms and 490 ms per frame. The comparison of the time

consumed for all computational modules of the FOA extraction is detailed in Table 1for

Yous and the proposed models respectively. The proposed model needs compute more

features (e.g. the texture complexity, the motion contrast coherence, and the motion spatial

coherence) than that of the Yous model; especially in computing the blocking artifact of the

Fig. 6 Comparison of FOA extracted by Walters, Yous and the proposed model: a 72nd frame in

src3_hrc11_625 b 73rd frame in src4_hrc12_625 c 174th frame in src5_hrc12_625



11/13

distorted frame which takes the distortion into account. It takes about 60 ms per frame more

than Yous model on average.

6 Conclusions

A spatiotemporal visual attention model for video analysis is proposed, which is directed both

in a bottom-up and a top-down manner. The intensity, the texture, and the motion features are

jointly considered to produce spatiotemporal visual attention regions. Meanwhile, the

blocking artifact saliency map is detected according to intensity gradient features. An attention

selection is applied to identify one of visual attention regions with the relatively serious

blocking artifact as the FOA. Experimental results show that in comparing with Walthers and

Yous models, the proposed model can not only accurately analyze the spatiotemporal

saliency, but also figure out distortions which are more attractive. However, the time consumed

for the proposed FOA extraction is about 490 ms per frame which is 60 ms more than the Yous

method. Applying the fast motion vector estimation algorithm and simplifying thecomputation of the temporal saliency become the key points for future studies.

Table 1 Comparison of the time consumed for all the computational modules of FOA extraction in Yous

and the proposed models

Modules Yous Model(ms/frame) Proposed Model(ms/frame)

Intensity contrast 0.97 0.97

Spatial position 1.01

Texture complexity

2.42Temporal features 6.05 50.45

Distortion 14.80

Motion vector estimation 420 420

Others about 1.00 about 1.00

Average totals about 429.48 about 489.64

src3_hrc11_625 src4_hrc12_625 src5_hrc12_6250

50

100

150

200

250

300

350400

450

500

TimeforFOAextraction(m

s/frame)

Distorted Sequence

Walther's Model You's Model Proposed ModelFig. 7 Comparison of the total

time consumed for FOA extrac-

tion by different models



12/13

Acknowledgements The authors would like to thank the editor and anonymous reviewers for their careful

reviews and valuable comments.

References

1. Aziz MZ, Mertsching B (2008) Fast and robust generation of feature maps for region-based visual

attention. IEEE Trans Image Process 17(5):633644

2. Chen WH, Wang CW, Wu JL (2007) Video adaptation for small display based on content recomposition.

IEEE Trans Circuits Syst Video Technol 17(1):4358

3. Kalanit GS, Rafael M (2004) The human visual cortex. Annu Rev Neurosci 27:649677

4. Koch C, Ullman S (1985) Shifts in selection in visual attention: toward the underlying neural circuitry.

Hum Neurobiol 4(4):219227

5. Ma YF, Zhang HJ (2002) A model of motion attention for video skimming. Proc Int Conf Image

Processing 1:2225

6. Niebur E, Koch C (1998) Computational architectures for attention. In: Parasuraman R (ed) The attentive

brain. MIT, Cambridge, pp 163

1867. Rapantzikos K, Tsapatsoulis N, Avrithis Y, Kollias S (2007) Bottom-up spatiotemporal visual attention

model for video analysis. IET Image Processing 1(2):237248

8. Serences JT, Yantis S (2006) Selective visual attention and perceptual coherence. TRENDS Cognit Sci

10(1):3845

9. Shih HC, Hwang JN, Huang CL (2009) Content-based attention ranking using visual and contextual

attention model for baseball videos. IEEE Trans Multimedia 11(2):244255

10. Stefan W, Praveen M (2008) The evolution of video quality measurement: from PSNR to hybrid metrics.

IEEE Trans Broadcast 54(3):660668

11. Tang CW (2007) Spatiotemporal visual considerations for video coding. IEEE Trans Multimedia 9(2):231238

12. Tang CW, Chen CH, Yu YH, Tsai CJ (2006) Visual sensitivity guided bit allocation for video coding.

IEEE Trans Multimedia 8(1):1118

13. Treisman AM, Gelade G (1980) A feature-integration theory of attention. Cogn Psychol 12(1):97

13614. VQEG (2000) Final Report from the Video Quality Expert Group on the Validation of Objective Models

of Video Quality assessment. VQEG. http://www.vqeg.org

15. Walther D, Koch C (2006) Modeling attention to salient proto-objects. Neural Netw 19:13951407

16. You JY, Liu GZ, Li HL (2007) A novel attention model and its application in video analysis. Appl Math

Comput 185:963975

17. Yuen M, Wu HR (1998) A survey of hybrid MC/DPCM/DCT video coding distortions. Signal

Processing 70:247278

18. Zheng YY (2008) Research on H.264 Region-of-Interest Coding Based on Visual Perception. PhD

Thesis, Zhejiang University. China (in Chinese)

Hua Zhangwas born in Zhejiang Province, China, in 1980. She received the B.Sc. and Ph.D. degrees from

Zhejiang University, Hangzhou, China, in 2003 and 2009, respectively. She is currently a lecturer of

Hangzhou Dianzi University, Hangzhou, China. Her major research field is image and video processing.



13/13

Xiang Tian was born in Anhui Province, China, in 1979. He received the B.Sc. and Ph.D. degrees from

Zhejiang University, Hangzhou, China, in 2001 and 2007, respectively. He is currently a post-doctor in the

Institute of Advanced Digital Technologies and Instrumentation, Zhejiang University. His major research

field is FPGA based high performance computing.

Yaowu Chenwas born in Liaoning Province, in China, in 1963. He received the Ph.D. degree from ZhejiangUniversity, Hangzhou, China, in 1998. He is currently a professor and the director of the Institute of

Advanced Digital Technologies and Instrumentation, Zhejiang University. His major research fields are

embedded system, networking multimedia system, and electronic instrumentation system.


Video Image Assessment With a Distortion-weighing

Documents

Transcript of Video Image Assessment With a Distortion-weighing