Video Image Assessment With a Distortion-weighing

download Video Image Assessment With a Distortion-weighing

of 13

Transcript of Video Image Assessment With a Distortion-weighing

  • 7/26/2019 Video Image Assessment With a Distortion-weighing

    1/13

    Video image assessment with a distortion-weighing

    spatiotemporal visual attention model

    Hua Zhang &Xiang Tian &Yaowu Chen

    Published online: 29 January 2010# Springer Science+Business Media, LLC 2010

    Abstract For the purpose of extracting attention regions from distorted videos, a

    distortion-weighing spatiotemporal visual attention model is proposed. On the impact of

    spatial and temporal saliency maps, visual attention regions are acquired directed in a

    bottom-up manner. Meanwhile, the blocking artifact saliency map is detected according to

    intensity gradient features. An attention selection is applied to identify one of visual

    attention regions with more relatively serious blocking artifact as the Focus of Attention

    (FOA) directed in a top-down manner. Experimental results show that the proposed modelcan not only accurately analyze the spatiotemporal saliency based on the intensity, the

    texture, and the motion features, but also able to estimate the blocking artifact of distortions

    in comparing with Walthers and Yous models.

    Keywords Visual attention model. Focus of Attention (FOA). Saliency map.

    Spatiotemporal. Distortion

    Abbreviations

    HVS Human Visual System

    FOA Focus of Attention

    LGN Lateral Geniculate Nucleus

    hMT+ human Middle Temporal+

    IPS Intra Parietal Sulcus

    FEF Frontal Eye Field

    Multimed Tools Appl (2011) 52:221233

    DOI 10.1007/s11042-010-0470-x

    The work was partially presented at the 2nd International Congress on Image and Signal Processing

    (CISP09).

    H. Zhang :X. Tian (*) :Y. ChenInstitute of Advanced Digital Technology and Instrumentation, Zhejiang University, Hangzhou 310027,Peoples Republic of China

    e-mail: [email protected]

    H. Zhang

    e-mail: [email protected]

    Y. Chen

    e-mail: [email protected]

  • 7/26/2019 Video Image Assessment With a Distortion-weighing

    2/13

    VQEG Video Quality Experts Group

    HRC Hypothetical Reference Circuits

    1 Introduction

    Human beings have a remarkable ability to interpret complex scenes in video analysis.

    Psychophysical evidences suggest that the Human Visual System (HVS) can preprocess

    simple features in parallel over the entire visual field and pay the most of all the visual

    attention to the object-selective region called the Focus of Attention (FOA) [4, 6, 13].

    Hence, video analyses such as video summarization [9], video content re-composition [2],

    and video quality assessment [10] are all being considered to model the visual attention

    system to select the FOA.

    A lot of studies concerning visual attention models have been carried out. A feature-

    integration theory of attention was proposed by Treisman et al. [13]. It suggested that the

    attention should be serially directed to each stimulus in the particular region whenever

    conjunctions of separable features were needed to characterize the selective objects. A

    computational visual attention model for images was proposed by Walther et al. [15] built

    on a biologically plausible architecture which was proposed by Koch et al. [4]; this model

    computed the multi-scale image features (e.g. intensity, color and orientation) and

    combined into a visual saliency map, but it could only analyze the visual saliency for

    static images, and temporal features were not taken into consideration. Later, a motion

    attention model was constructed for video skimming by Ma et al . [5]; in the model, only

    motion information was used to detect attention regions such that the results were not very

    accurate because spatial properties in attention regions were not taken into account.Recently, several improved visual models, like You et al.s [16] and Rapantzikos et al.s[7]

    which incorporated both spatial and temporal features were proposed, could locate the FOA

    more accurately than ever before.

    Nowadays, the resolution of videos is becoming more refined which leads to the

    necessity to compress videos for storage and transmission. The end result is that videos are

    distorted through coding by MPEG, H.26x or transmitted by source-channel with bit error.

    However, distortions are not taken into consideration by all current visual attention models.

    A spatiotemporal visual attention model which takes the blocking artifact of distortions into

    consideration is proposed in this paper. Experimental results show that the proposed model

    can not only accurately analyze the spatiotemporal saliency based on the intensity, thetexture, and the motion features, but also figure out the blocking artifact of distortions

    which are more sever.

    2 System model

    The HVS controls the FOA in a rapid, bottom-up, saliency-driven, and task-independent

    manner as well as in a slower, top-down, volition-controlled, and task-dependent manner

    [6]. Psychophysical and physiological experiments reveal that the HVS which ishierarchically organized processes the visual information as follows: First, the early visual

    areas such as Lateral Geniculate Nucleus (LGN) and V1-V4 in occipital visual cortex are

    selected to code the low-level features or the basic combinations of features [8]. Then, the

    area called human Middle Temporal+(hMT+) located at parietal cortex is selectively

    activated by moving versus stationary stimuli which exhibits high contrast sensitivity [3];

    Last, the latter visual areas like Intra Parietal Sulcus (IPS) in posterior parietal cortex which

    222 Multimed Tools Appl (2011) 52:221233

  • 7/26/2019 Video Image Assessment With a Distortion-weighing

    3/13

    is in charge of selecting features visually [8], and the Frontal Eye Field (FEF) which plays a

    role in generating contra lateral saccades [8], are selected in completing the process of

    visual attention and locating the FOA together.

    Figure1illustrates the framework of the proposed spatiotemporal visual system model

    which is directed by the HVS; the FOA of distorted videos are detected frame by frame.First of all, some spatial features; such as intensities and edge orientations, and some

    temporal features; such as motion intensities and motion orientations, are extracted from the

    distorted frame. Secondly, five saliency maps representing the intensity contrast, the texture

    complexity, the motion intensity, the motion contrast coherence, and the motion spatial

    coherence are jointly considered to produce the spatiotemporal saliency map and identify

    the spatiotemporal visual attention regions. Meanwhile, the severity of blocking artifact for

    each block is specially calculated according to intensity gradient features, and the blocking

    artifact saliency map is produced. The procedures above are conducted in a bottom-up

    manner, and an attention selection is applied to identify one of the spatiotemporal visual

    attention regions with the more relatively serious blocking artifact as the FOA, which is

    being controlled in a top-down manner.

    3 Spatiotemporal and distorted saliency maps

    3.1 Spatial saliency maps

    Contrasts, such as intensity contrast [1], attract visual attention in HVS. Humans are usually

    not sensitive to the local intensity itself, but are easily attracted by regions with higherintensity contrast. Assuming that a frame of a distorted video sequence I(x, y) is divided

    into non-overlap blocks each with NN pixels and, let the standard deviation of the

    Fig. 1 Framework of the visual

    attention model

    Multimed Tools Appl (2011) 52:221233 223

  • 7/26/2019 Video Image Assessment With a Distortion-weighing

    4/13

    intensities in the (i, j)th block stands for the intensity contrast. Thus, the intensity contrast

    saliency map Ic(i, j) becomes

    Ic i;j ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi1NN

    XjNN

    yjN1X

    iNN

    xiN1

    I x;y I 2vuut 1The regions with higher intensity contrast have higher values of Ic(i, j).

    As mentioned in [12], humans are usually more attracted by the regions with higher

    spatial contrast and weaker correlations with the intensities of nearby image pixels which

    are called the structure-texture region. The region of structure-texture is typically composed

    of consistent long edges, while the region of randomly-texture is composed of small edges

    in various orientations. The texture complexity saliency map is produced according to the

    number of edge points and edge orientations in the local blocks.

    The Sobel edge detector is applied to the frame I(x, y) and E(x, y) is the result of the

    edge detection. The gradients orientation for each pixel is computed as

    q x;y arctanGver x;y

    Ghor x;y 2

    Gver(x,y) and Ghor(x,y) are respectively the vertical and horizontal gradients of the pixel.

    (x, y)is classified into 4 edge orientations :

    q0 x;y 2 0=180; 45=225; 90=270; 135=315f g

    Then, the number of edge orientations cd (different angles), and the number of edge

    pointsne(the points where E(x,y) equal to 1) in the (i,j)thNNblock can be calculated. If

    ne>ne* (where ne* is a given threshold), the edge flag ce is set to 1. Otherwise, the edge

    flag ce is set to 0. The texture complexity saliency map Tc(i, j) can be defined as

    Tci;j

    0:5; if cd0 1:0; if cd1 2ce =2; if cd2 1ce =2; if cd3

    0; else

    8>>>>>>>:

    3

    Lastly, Tc(i, j) is smoothed by a 33 filter [0, 1, 0; 1, 2, 1; 0, 1, 0]. IfTc(i, j) were moreclose to 1, then the texture of the block would be more like the structure-texture, and the

    region would be more attractive. On the other hand, ifTc(i,j) were more close to 0, then the

    texture of the block would be more like random-texture, and the region would be less

    attractive. Regions where Tc(i, j) equal to 0.5 would be flat.

    3.2 Temporal saliency maps

    The motion is the most salient feature which attracts visual attention especially for video

    sequences [3]. Many of the current visual attention models are based on motion features

    [5, 16]. In the proposed model, temporal saliency maps are built on the basis of motion

    features in the scene, such as the motion intensity, the motion contrast coherence, and the

    motion spatial coherence.

    Motion vectors which are estimated by the full search algorithm form the foundation of

    this process. However, video sequences are always captured with the camera panning or

    zooming, which makes that motion features not be computed correctly by the original

    224 Multimed Tools Appl (2011) 52:221233

  • 7/26/2019 Video Image Assessment With a Distortion-weighing

    5/13

    motion vectors. Thus, the global motion estimation and the motion compensation are

    indispensable, and a fast global motion estimation method based on symmetrical

    eliminations and the difference of motion vectors should be adopted [18].

    After the compensation of this global motion model, motion vectors are ready for

    computing temporal saliency maps. Let the motion vector of the block indexed (i, j) be(MVx(i, j), MVy(i, j)), the motion intensity MI(i, j)of the block becomes

    MI i;j

    ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiMVx i;j

    2 MVy i;j 2

    q 4

    The motion intensity saliency map Mi(i, j) represents the mean motion intensity inside a

    spatial window whose width is (2W+1) blocks and centers on the (i, j)th;

    Mi i;j

    1

    2W1 2XjW

    njW

    XiWmiW MI m; n 5

    The motion contrast coherence saliency map Mc(i, j) measures the contrast of motion

    intensities within the neighborhood of the (i, j)th block;

    Mc i;j MImaxMImin

    MImax; if MImin60

    MImaxMI0max

    ; otherwise

    ( 6

    where MImin and MImax are the minimum and the maximum motion intensities in thespatial window respectively, and MImaxis the maximum motion intensity in a larger spatial

    window whose width is (10W+1) blocks and centers on the (i, j)th.

    The motion spatial coherence saliency map Ms(i, j) is defined as the entropy of the

    orientations of motion vectors inside the spatial window [11];

    Ms i;j Xnsk1

    pk log pk 7

    It measures the consistency of the magnitudes of motion vectors in a region.

    3.3 Blocking artifact coefficient map

    Blocking artifact is the most significant distortion of all coding artifacts [17]: Block-wise

    discrete cosine transform (DCT) is a popular approach for video coding since it is the

    foundation of most of the current video coding standards (e.g. MPEG, and H.26x). When

    the bit-rate is low, blocking artifacts caused by the coarse quantization of the DCT

    coefficients are brought across block boundaries. Videos are also transmitted with packet

    losses or bit errors, which could easily cause serious blocking artifacts. The blocking

    artifact of a frame is detected as follows:

    First of all, the Sobel edge detector is applied to the frame I(x, y). Gver(x, y) and Ghor(x, y)

    are respectively the vertical and the horizontal intensity gradients.

    Secondly, I(x, y) is divided into non-overlap blocks each with NN pixels. The

    horizontal and the vertical blocking artifact of (i, j)th block Bhor(i, j) and Bver(i, j) are

    respectively estimated within template blocks as shown in Fig. 2a and Fig. 2b. Each

    template block is formed by NN pixels which is the exact DCT block. The light gray

    Multimed Tools Appl (2011) 52:221233 225

  • 7/26/2019 Video Image Assessment With a Distortion-weighing

    6/13

    points are boundary pixels and the dark gray points are inner pixels. The horizontal

    blocking artifactBhor(i, j) is acquired:

    Bhor i;j aGhb i;j ; if Ghin i;j 0

    Ghb i;j

    Ghin i;j

    ; otherwise( 8Ghb(i, j) and Ghin(i, j) are the average gradients of the block boundary and the block

    interior respectively.

    Finally, Bver(i, j) is estimated in an identical way using a template block as shown in

    Fig.2b. The blocking artifact saliency map is given as follows:

    B i;j mean Bhor i;j ; Bver i;j 9

    4 FOA extraction

    The spatiotemporal visual saliency map SMst(i, j) is fused by two spatial saliency maps; the

    intensity contrastIc(i,j) and the texture complexity Tc(i,j), and three temporal saliency maps;

    the motion intensity Mi(i, j), the motion contrastMc(i, j), and the motion spatial Ms(i, j).

    SMst i;j l1Fs i;j Ft i;j ; if Mc i;j orMs i;j m

    l2Fs i;j Ft i;j Mi i;j 2; otherwise

    Fs i;j log a1 Ic i;j a2Tc i;j

    2

    Ft i;j a3Mc i;j a4Ms i;j

    10

    where five saliency maps are all normalized in the range of [0, 1] for computing and

    comparing. The blocks whose Mc(i, j) or Ms(i, j) are higher than (which is set

    experimentally) are considered as regions with complex motions, and motion intensities of

    these blocks are lifted to the maximal value 1. l1 andl2 are the weights for the blocks with

    different motion complexities.1,2,3,and4are used to adjust the influence on features.

    a b

    Fig. 2 The template blocks for the blocking artifacts:a forBhor(i, j) b forBver(i, j)

    226 Multimed Tools Appl (2011) 52:221233

  • 7/26/2019 Video Image Assessment With a Distortion-weighing

    7/13

    A modified spatiotemporal saliency map SMst(i, j) is acquired by using a 33 median

    filter. As described in [16], the detection of boundary blocks of visual attention regions

    whose SMst(i, j) change abruptly can be achieved by using the Canny edge detection

    operator. The threshold for extracting the spatiotemporal visual attention region is

    Sthreshold avg Sb max Sb ; min Sb =2

    2 11

    where Sb is a set of the visual saliency SMst(i, j) for boundary blocks. Regions where

    SMst(i, j) is higher than the threshold Sthresholdare identified to visual attention regions.

    VARst i;j 1; if SM0st i;j Sthreshold

    0; otherwise

    12

    Then, connected regions are extracted from the VARst(i, j). The kth connected region

    VARst

    k

    (i, j) (k = 1, 2, ... K) which has relatively larger area, is selected from candidateswhose close morphologic operator is used to fill holes.

    Moreover, the serious blocking artifact in distorted videos is always identified as a

    special kind of motion, so that they are included in the spatiotemporal visual attention

    region VARstk(i, j). The average blocking artifact of each spatiotemporal visual attention

    region is calculated by:

    Bvark mean B s; t ; s; t 2VARstk i;j 13

    The FOA is defined as a spatiotemporal visual attention region with relatively serious

    blocking artifact. Visual attention regions VARstk(i, j) which include serious blocking

    artifacts are sometimes more than one; their average blocking artifacts Bvar(k) are very close

    to each other. Thus, not only the visual attention region with the maximum average

    blocking artifact (max(Bvar(k))), but also regions which have a very close Bvar(k) to the

    maximum are selected as the final FOA.

    VARst dVARstl i;j ; l2 kjBvark "%maxBvarkf g 14

    where VARstl(i,j) is thelth visual attention region whose average blocking artifactBvar(l) is

    higher than e percent of the maximum average blocking artifact. e which is set

    experimentally, is very close to 100.

    5 Experiments and discussion

    The proposed spatiotemporal visual attention model is modeled on the Simulink of

    MATLAB (R2008a). All algorithms mentioned above are implemented in C- language

    except edge detectors which are supplied by the Simulink library. Experiments were run on

    the PC with Intel(R) Pentium(R) Duo CPU of 2.80 GHz and 1.50 GB memory. A set of

    video sequences which are chosen from the Video Quality Experts Group (VQEG) Phase I

    50 Hz datasets [14] is used afterwards. Different texture and motion complexities exist in

    these videos.

    Each frame is scaled to CIF (352288), and the luminance I(x, y) is divided into non-

    overlap blocks each with 88. During calculating the texture complexity, the threshold for

    the number of edge points ne* is set to 16 experimentally and the weight for the blocking

    artifact coefficient map is set to 0.25. During calculating the spatiotemporal visual saliency

    map, is set to 0.8,l1is set to 1, l2is set to 2, and 1,2,3and 4are set to 1, 1, 2, and

    Multimed Tools Appl (2011) 52:221233 227

  • 7/26/2019 Video Image Assessment With a Distortion-weighing

    8/13

    1 respectively in this experiment. Because temporal features in randomly-texture regions

    (where Tc(i,j)0.5), are always not very sensitive, the Mi(i,j), Mc(i,j), and Ms(i,j) in these

    regions will be restrained to 0. Regions with the Tc(i,j) value greater than 0.5 are defined as

    the structure-texture regions; values of the temporal features will maintain at their original

    values. which is used in the FOA extraction, is set to 95 experimentally.Src3 (Harp) belongs to a type of videos with high local complex textures and moderate

    average motions, where the camera is zooming. Src3_hrc11_625 is produced by the no.11

    Hypothetical Reference Circuits (HRC) which is coded by MPEG-2 I frame only and with

    some bit errors [14]. The 72nd frame of this video is shown in Fig. 3a. The left corner of the

    frame is zoomed out in Fig. 3b; an obvious distortion is near hands of the woman marked

    with red circle. The shirt of the man playing the piano with a branch near him has high

    complex textures which display darkness in Fig. 3c. Hence, the spatiotemporal saliency in

    these regions is effectively restrained in Fig. 3d. Whereas the woman playing the harp has

    the structure-texture displaying light, and hands of the woman with high motion intensities

    take the most salient feature in Fig.3d. The threshold is detected near 0.233 (Fig.3e). Only

    one connected region whose area is larger than 8 blocks is extracted. After filtered by an 8

    8 2-D FIR, it is shown in Fig. 3f. The FOA is acquired, which both displays the hands and

    the distortions shown in Fig. 3g.

    Src4 (Moving graphic) belongs to a type of videos with low complex textures, low

    average motions, and without the global motion. Src4_hrc12_625 is produced by the no.12

    HRC which is coded by MPEG-2 and with some bit errors [14]. The 73rd frame of this

    video is shown in Fig.4a. An obvious distortion is near the top icon in the first row, and the

    second column which is zoomed out in Fig. 4b. The threshold is detected near 0.306

    (Fig.4e), and the three connected regions whose areas are larger than 8 blocks are extracted.After filtered by an 88 2-D FIR, they are shown in Fig. 4f: The spider and the moving

    characters are detected because of the high motion intensity; the distortion near the icon is

    detected as a special kind of motion with high contrast of the motion intensity. The FOA of

    this frame is located at the region with the most serious blocking artifact shown in Fig. 4g.

    Src5 (Canoe) is with high complex textures, high average motions, and the camera is

    following the canoe continuously. Src5_hrc12_625 is also produced by the no.12 HRC

    Fig. 3 FOA extracted in 72nd frame in src3_hrc11_625:a distorted frameb local distortionc map ofTc(i,j)

    d map of SMst(i, j) e the threshold plane fspatiotemporal visual attention regions VARstk(i, j) (g) FOA

    VARst_d

    228 Multimed Tools Appl (2011) 52:221233

  • 7/26/2019 Video Image Assessment With a Distortion-weighing

    9/13

    [14]. The 174th frame of this video is shown in Fig. 5awith an obvious distortion under the

    oar, and the canoe in the water which is being zoomed out in Fig. 5b. The threshold is

    detected near 0.267 (Fig.5e), and the three connected regions whose areas are larger than

    8 blocks are extracted. After filtered by an 88 2-D FIR, they are shown in Fig.5f: The

    arm raised by the man is detected because of the high motion intensity, and the distortions

    under the oar and canoe in the water are detected as a special kind of motion with high

    contrast of the motion intensity. The FOA of this frame is located at the region with the

    most serious blocking artifact shown in Fig. 5g.

    To validate the proposed visual attention model for the distorted videos, Walthers

    Saliency Toolbox 2.1 for images [15] and Yous visual attention model for videos [16] are

    Fig. 4 FOA extracted in 73 rd frame in src4_hrc12_625:a distorted frameb local distortionc map ofTc(i,j)

    d map of SMst(i, j) e the threshold plane fspatiotemporal visual attention regions VARstk(i, j) (g) FOA

    VARst_d

    Fig. 5 FOA extracted in 174th frame in src5_hrc12_625:a distorted frameb local distortionc map ofTc(i,j)

    d map of SMst(i, j) e the threshold plane fspatiotemporal visual attention regions VARstk(i, j) (g) FOA

    VARst_d

    Multimed Tools Appl (2011) 52:221233 229

  • 7/26/2019 Video Image Assessment With a Distortion-weighing

    10/13

    introduced to extract the FOA. In Fig. 6, the 1st column is the distorted frame; the 72nd

    frame in src3_hrc11_625, the 73 rd frame in src4_hrc12_625, and the 174th frame in

    src5_hrc12_625. The 2nd column is the first 4 FOA extracted by Walthers Saliency

    Toolbox. The 3rd column is the FOA extracted by Yous model, and the last column is the

    FOA extracted by the proposed model. Comparing the FOA extracted by different methods,

    Walthers Saliency Toolbox and Yous model can accurately locate the hands of the woman

    in Fig.6a, the spider and moving characters in Fig. 6b, and the arms raised in Fig. 6c, but

    the serious distortion region which takes humans attention, can not be figured out.Moreover, Yous model can not be concerned with the influence of the texture which takes

    the branch (in src3_hrc11_625) as visual attention region. The proposed model can not only

    detect the regions as the referenced models (in Fig. 3f, Fig. 4f and Fig. 5f) which

    successfully restrain the features in random-texture regions, but also select the region with

    distortion as the final FOA (in Fig. 6).

    The comparison of the time efficiency for FOA extraction by different visual attention

    models is shown in Fig. 7: Walthers Saliency Toolbox is especially for images, and the

    time consumed would be automatically produced by the toolbox itself, which are

    respectively 320 ms, 405 ms and 360 ms for the given frames in Fig. 6. The timesconsumed by Yous and the proposed models are recorded by the Profile tool of Simulink,

    which are respectively about 430 ms and 490 ms per frame. The comparison of the time

    consumed for all computational modules of the FOA extraction is detailed in Table 1for

    Yous and the proposed models respectively. The proposed model needs compute more

    features (e.g. the texture complexity, the motion contrast coherence, and the motion spatial

    coherence) than that of the Yous model; especially in computing the blocking artifact of the

    Fig. 6 Comparison of FOA extracted by Walters, Yous and the proposed model: a 72nd frame in

    src3_hrc11_625 b 73rd frame in src4_hrc12_625 c 174th frame in src5_hrc12_625

    230 Multimed Tools Appl (2011) 52:221233

  • 7/26/2019 Video Image Assessment With a Distortion-weighing

    11/13

    distorted frame which takes the distortion into account. It takes about 60 ms per frame more

    than Yous model on average.

    6 Conclusions

    A spatiotemporal visual attention model for video analysis is proposed, which is directed both

    in a bottom-up and a top-down manner. The intensity, the texture, and the motion features are

    jointly considered to produce spatiotemporal visual attention regions. Meanwhile, the

    blocking artifact saliency map is detected according to intensity gradient features. An attention

    selection is applied to identify one of visual attention regions with the relatively serious

    blocking artifact as the FOA. Experimental results show that in comparing with Walthers and

    Yous models, the proposed model can not only accurately analyze the spatiotemporal

    saliency, but also figure out distortions which are more attractive. However, the time consumed

    for the proposed FOA extraction is about 490 ms per frame which is 60 ms more than the Yous

    method. Applying the fast motion vector estimation algorithm and simplifying thecomputation of the temporal saliency become the key points for future studies.

    Table 1 Comparison of the time consumed for all the computational modules of FOA extraction in Yous

    and the proposed models

    Modules Yous Model(ms/frame) Proposed Model(ms/frame)

    Intensity contrast 0.97 0.97

    Spatial position 1.01

    Texture complexity

    2.42Temporal features 6.05 50.45

    Distortion 14.80

    Motion vector estimation 420 420

    Others about 1.00 about 1.00

    Average totals about 429.48 about 489.64

    src3_hrc11_625 src4_hrc12_625 src5_hrc12_6250

    50

    100

    150

    200

    250

    300

    350400

    450

    500

    TimeforFOAextraction(m

    s/frame)

    Distorted Sequence

    Walther's Model You's Model Proposed ModelFig. 7 Comparison of the total

    time consumed for FOA extrac-

    tion by different models

    Multimed Tools Appl (2011) 52:221233 231

  • 7/26/2019 Video Image Assessment With a Distortion-weighing

    12/13

    Acknowledgements The authors would like to thank the editor and anonymous reviewers for their careful

    reviews and valuable comments.

    References

    1. Aziz MZ, Mertsching B (2008) Fast and robust generation of feature maps for region-based visual

    attention. IEEE Trans Image Process 17(5):633644

    2. Chen WH, Wang CW, Wu JL (2007) Video adaptation for small display based on content recomposition.

    IEEE Trans Circuits Syst Video Technol 17(1):4358

    3. Kalanit GS, Rafael M (2004) The human visual cortex. Annu Rev Neurosci 27:649677

    4. Koch C, Ullman S (1985) Shifts in selection in visual attention: toward the underlying neural circuitry.

    Hum Neurobiol 4(4):219227

    5. Ma YF, Zhang HJ (2002) A model of motion attention for video skimming. Proc Int Conf Image

    Processing 1:2225

    6. Niebur E, Koch C (1998) Computational architectures for attention. In: Parasuraman R (ed) The attentive

    brain. MIT, Cambridge, pp 163

    1867. Rapantzikos K, Tsapatsoulis N, Avrithis Y, Kollias S (2007) Bottom-up spatiotemporal visual attention

    model for video analysis. IET Image Processing 1(2):237248

    8. Serences JT, Yantis S (2006) Selective visual attention and perceptual coherence. TRENDS Cognit Sci

    10(1):3845

    9. Shih HC, Hwang JN, Huang CL (2009) Content-based attention ranking using visual and contextual

    attention model for baseball videos. IEEE Trans Multimedia 11(2):244255

    10. Stefan W, Praveen M (2008) The evolution of video quality measurement: from PSNR to hybrid metrics.

    IEEE Trans Broadcast 54(3):660668

    11. Tang CW (2007) Spatiotemporal visual considerations for video coding. IEEE Trans Multimedia 9(2):231238

    12. Tang CW, Chen CH, Yu YH, Tsai CJ (2006) Visual sensitivity guided bit allocation for video coding.

    IEEE Trans Multimedia 8(1):1118

    13. Treisman AM, Gelade G (1980) A feature-integration theory of attention. Cogn Psychol 12(1):97

    13614. VQEG (2000) Final Report from the Video Quality Expert Group on the Validation of Objective Models

    of Video Quality assessment. VQEG. http://www.vqeg.org

    15. Walther D, Koch C (2006) Modeling attention to salient proto-objects. Neural Netw 19:13951407

    16. You JY, Liu GZ, Li HL (2007) A novel attention model and its application in video analysis. Appl Math

    Comput 185:963975

    17. Yuen M, Wu HR (1998) A survey of hybrid MC/DPCM/DCT video coding distortions. Signal

    Processing 70:247278

    18. Zheng YY (2008) Research on H.264 Region-of-Interest Coding Based on Visual Perception. PhD

    Thesis, Zhejiang University. China (in Chinese)

    Hua Zhangwas born in Zhejiang Province, China, in 1980. She received the B.Sc. and Ph.D. degrees from

    Zhejiang University, Hangzhou, China, in 2003 and 2009, respectively. She is currently a lecturer of

    Hangzhou Dianzi University, Hangzhou, China. Her major research field is image and video processing.

    232 Multimed Tools Appl (2011) 52:221233

  • 7/26/2019 Video Image Assessment With a Distortion-weighing

    13/13

    Xiang Tian was born in Anhui Province, China, in 1979. He received the B.Sc. and Ph.D. degrees from

    Zhejiang University, Hangzhou, China, in 2001 and 2007, respectively. He is currently a post-doctor in the

    Institute of Advanced Digital Technologies and Instrumentation, Zhejiang University. His major research

    field is FPGA based high performance computing.

    Yaowu Chenwas born in Liaoning Province, in China, in 1963. He received the Ph.D. degree from ZhejiangUniversity, Hangzhou, China, in 1998. He is currently a professor and the director of the Institute of

    Advanced Digital Technologies and Instrumentation, Zhejiang University. His major research fields are

    embedded system, networking multimedia system, and electronic instrumentation system.

    Multimed Tools Appl (2011) 52:221233 233