Video Quality Metric improvement using motion and spatial...

UPTEC F 16002

Examensarbete 30 hp29 Januari 2016

Video Quality Metric improvement using motion and spatial masking

Henrik Näkne

Teknisk- naturvetenskaplig fakultet UTH-enheten Besöksadress: Ångströmlaboratoriet Lägerhyddsvägen 1 Hus 4, Plan 0 Postadress: Box 536 751 21 Uppsala Telefon: 018 – 471 30 03 Telefax: 018 – 471 30 00 Hemsida: http://www.teknat.uu.se/student

Abstract

Video Quality Metric improvement using motion andspatial masking

Henrik Näkne

Objective video quality assessment is of great importance in video compression and other video processing applications. In today's encoders Peak Signal to Noise Ratio or Sum of Absolute Differences are often used, though these metrics have limited correlation to perceived quality. In this paper other block-based quality measures are evaluated with superior performance on compression distortion when evaluating correlation with Mean Opinion Scores. The major results are that Block-based Visual Information Fidelity with optical flow and intra-frame Gaussian weighting outperforms PSRN, VIF, and SSIM. Also, a block-based weighted Mean Squared Error method is proposed that performs better than PSRN and SSIM, however not VIF and BB-VIF, with the advantage of high locality, which is useful in video encoding. The previously mentioned weighting methods have not been evaluated with SSIM, which is proposed for further studies.

ISSN: 1401-5757, UPTEC F16 002Examinator: Tomas NybergÄmnesgranskare: Anders HastHandledare: Jack Enhorn

Populärvetenskaplig sammanfattning

Objektiv bedömning av videokvalitet har många användningsområden, då videokvalitetär ett subjektivt mått som inte kan modelleras. Detta är viktigt inom bland annatvideokomprimering då upplevd kvalitet inte är detsamma som de matematiska mått,t.ex. absoluta skillnader mellan pixlar, som går enkelt att räkna på och som används ivideokodare idag. Subjektiva kvalitetsmått kan erhållas genom att låta många människorkvalitetsbedömma videor av olika kvalitet, där genomsnittsbedömningen för varje videoanses motsvara dess kvalitet. Av uppenbara skäl går det inte att göra varje gång man villveta kvalitet hos en videosekvens. Önskvärt är instället att hitta objektiva matematiskamått som korrelerar med genomförda genomsnittsbedömningar. Då detta arbete gjortsför att i förlängningen kunna förbättra videokomprimeringsalgoritmer, har det i dettaarbete använts genomsnittsbedömningar gjorda på komprimerad video (≈ 600 st.) avolika kvalitet.

I detta arbete har olika metoder för att bedöma videokvalitet undersökts. Ett tillvä-gagångssätt har varit att undersöka hur existerande metoder, framförallt Visual Infor-mation Fidelity (VIF), kan förbättras genom att utnyttja maskeringseffekter från rörelseoch textur. Maskeringseffekter uppstår till exempel om ett objekt rör sig väldigt fort ien video, man hinner då inte uppfatta objektet lika tydligt och kan då spara data påatt inte komprimera denna del lika väl. Detta har inneburit att olika delar (block) ibilden viktats beorende på hur mycket rörelse de innehåller, detta med goda resultat.Vidare har viktning baserad på den relativa kvaliteten mellan blocken prövats, dettaunder antagandet att enskilda block av dålig kvalitet påverkar kvatlitetsuppfattningenmer än block av god kvalitet. Dessa försök har inte utfallit tillfredställande, men harinte kunnat förkastats då de gett bra resultat på annan data, samt att VIF är ett, påblocknivå, volatilt kvalitetsmått.

Utöver detta har en metod baserad på medelkvadratfel (Mean Squared Error) studer-ats. Denna metod har inte haft lika bra korrelation som VIF, men är intressant då den harhög lokalitet (oberoende block), vilket är önskvärt inom videokomprimering eftersom dettillåter parallellisering av komprimeringen, vilket minskar tidsåtgången. Denna metodskorrelation är markant bättre än korrelationen hos de algoritmer som idag är standard.Även en metod för att bedöma bildkvalitet (digitala foton), Directional Statistivs basedColor Similarity Index (DSCSI), har testats på video. Detta då denna metod, till skill-nad från de flesta andra metoder, tar hänsyn till alla färgkanaler hos en bild/video. Desskorrelation var dock dålig.

iii

Acknowledgements

I would like to thank my supervisor Jack Enhorn at Ericsson for valuable advice andcontributions to completing this project. I would also like to thank my subject reviewerAnders Hast at Uppsala for advice and input. Finally I would also like to thank Ericssonand the Visualization department for providing me with an interesting project.

iv

Contents

1 Introduction 11.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Theory 32.1 Digital Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1.1 RGB Colour Space . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.2 YCbCr Colour Space . . . . . . . . . . . . . . . . . . . . . . . . . . 32.1.3 CIELAB/S-CIELAB Colour Space . . . . . . . . . . . . . . . . . . 4

2.2 Video Quality Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2.1 Visual Information Fidelity . . . . . . . . . . . . . . . . . . . . . . 42.2.2 Block-Based VIF . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2.3 Peak Signal to Noise Ratio . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Evaluation of Video Quality Metrics . . . . . . . . . . . . . . . . . . . . . 62.3.1 Pearson’s Correlation Coefficient . . . . . . . . . . . . . . . . . . . 62.3.2 Spearman’s Rank Order Correlation Coefficient . . . . . . . . . . . 7

2.4 Quaternions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.4.1 Quaternion Fourier Transform . . . . . . . . . . . . . . . . . . . . . 7

3 Method 93.1 OpenCV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.2 Temporal Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93.3 Saliency Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103.4 Fovea filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.5 Temporal Hysteresis Model . . . . . . . . . . . . . . . . . . . . . . . . . . 123.6 Intra-frame Gaussian Weighting . . . . . . . . . . . . . . . . . . . . . . . . 133.7 Xu-et-al-method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143.8 BB-VIF with optical flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 163.9 Directional Statistics based Color Similarity Index . . . . . . . . . . . . . 183.10 Evaluation Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Results 204.1 Xu-et-al-method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.2 DSCSI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3 Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.4 Saliency Map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.5 Fovea filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.6 Temporal Hysteresis Model . . . . . . . . . . . . . . . . . . . . . . . . . . 214.7 Intra-frame Gaussian Weighting . . . . . . . . . . . . . . . . . . . . . . . . 224.8 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

5 Discussion 24

v

6 Conclusions and Future Research 27

Bibliography 28

vi

List of Figures

2.1 Illustration of the block-based VIF values (left) for a compressed frame(right). The BB-VIF value of this frame is 0.1791. . . . . . . . . . . . . . 5

2.2 Graph of the data in the left image in Figure 2.1, sorted by the VIF-valueof each block to illustrate the volatility of the individual values. . . . . . . 5

3.1 Illustration of a saliency map, with block based saliency values (left)for a video frame (right). The saliency values have been mapped from[SAmin, SAmax] to [1, 5]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2 The fovea filter function, f(rp) as a function of the viewing distanceper pixel, rp (left), and its block-wise, normalized implementation, f(rb)(right), where rb is the viewing distance per block (16× 16 pixels). . . . . 12

3.3 Illustration of weighting function used for the Temporal Hysteresis Model.K = 200, which corresponds to a frame rate of 50 when τ = 2 seconds. . . 14

3.4 Illustration of the normalized weighting vector used for intra-frame Gaus-sian filtering, with standard deviation α = 2K − 1. . . . . . . . . . . . . . 15

3.5 Illustration of the block wise optical flow vectors for a video frame. Greenarrows indicates weight 1 and red arrows weight γ in Equation 3.22. . . . 17

3.6 Enlargement of area in Figure 3.5 . . . . . . . . . . . . . . . . . . . . . . . 173.7 Illustration of how SROCC and PCC values correspond to data, here for

PSNR (top) and Curve Fitted (see Section 2.3) PSNR (bottom) qualityvalues with respective MOS values from a 480p dataset consisting of 90videos. SROCC and PCC of the unmodified data is 0.7993 and 0.8393,for the Curve Fitted 0.7993 and 0.8834. . . . . . . . . . . . . . . . . . . . 19

vii

List of Tables

4.1 Performance with and without the Temporal Hysteresis Model. OF refersto Optical Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.2 Performance of the DSCSI method. . . . . . . . . . . . . . . . . . . . . . . 204.3 Performances of different measures with and without Optical Flow weight-

ing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.4 Performances of different measures with and without Saliency Map fea-

ture. OF refers to Optical Flow. . . . . . . . . . . . . . . . . . . . . . . . . 214.5 Performances with and without Fovea filtering. OF refers to Optical Flow. 214.6 Performance with and without the Temporal Hysteresis Model. OF refers

to Optical Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.7 Performance with and without the Intra-frame Gaussian Weighting. . . . . 224.8 Abbreviations for metrics present in Table 4.9 . . . . . . . . . . . . . . . . 224.9 Summary of the results of different metrics, explanation of abbrevations

in Table 4.8. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

viii

BB-VIF Block Based Visual Information FidelityDSCSI Directional Statistics based Color Similarity IndexDQFT Discrete Quaternion Fourier TransformHVS Human Visual SystemITU International Telecommunication UnionMSE Mean Squared ErrorMOS Mean Opinion ScorePCC Pearson’s Correlation CoefficientPSNR Peak Signal to Noise RatioQFT Quaternion Fourier TransformSROCC Spearmans Rank Order Correlation CoefficientSSIM Structure SIMilarityVIF Visual Information FidelityVQA Video Quality AssessmentVQM Video Quality MetricVQEG Video Quality Experts Group

ix

Chapter 1

Introduction

The consumption of video material have increased greatly over the last decades. Itsrole in both information and entertainment distribution has grown, and is expected tocontinue to do so. By 2019, the estimation is that 80-90 % of all consumer based trafficwill be video [1], up from ≈ 60 %. And this in a time the total Ip traffic will exceed onezetabyte, (1021 bytes, tripled from 2014), of which 80 % is consumer traffic. The termvideo then both include streaming services and peer-to-peer (P2P) traffic. Good videocompression will then become increasingly important in order to optimize the usage ofbandwidth.

One of the problems in video encoding is to determine the quality of compressedvideo in order to optimize the ratio between quality and bit rate. This report is a studyinvestigating how performance of Video Quality Metrics (VQM’s) can be improved toenhance video encoders. Primarily by examining if and how optical flow (motion in video)can be used as input to existing VQM’s. Secondarily, new algorithms usage of colourchannels should be investigated to see if they can also be used to improve performance.

1.1 Background

One of the problems in video encoding is to determine the quality of compressed video.Whereas it is trivial to tell if a compressed video is smaller than another in terms ofdata storage, it is not trivial to tell which one is being experienced as having the bestquality. This is because video quality is a subjective measure, and the goal of thedevelopment of video encoders is to optimize the perceived quality per used data bit.The Human Visual System (HVS) is one of the most complex signal processes known,and its properties, which are both biological and psychological, are not fully understood.One way to determine the quality of a compressed video is to have a big enough numberof people view the video and rate its quality, thus approximating the subjective opinionof an average human observer, a Mean Opinion Score (MOS). For obvious reasons itis not feasible to do this each time a quality rating of a video is needed, and thereforeobjective quality metrics that mimics MOS are required. The search for good VQM’s isa field of research of its own, Video Quality Assessment (VQA), and the evaluation ofsuch metrics have been standardized by the Video Quality Experts Group (VQEG) [2].

There are different purposes for VQM’s, from a digital perspective its main applica-tion is to optimize trade-offs between resources and visual output quality, regardless ifthe content is streamed or stored. Other usages is the reproduction of damaged video.Today there exists several good VQM’s, such as Structure Similarity Index (SSIM) [3]and Visual Information Fidelity (VIF) [4], but the methods implemented in today’s en-coders are often Peak Signal to Noise Ratio (PSNR), Mean Squared Error (MSE) orabsolute error. This because of their mathematical convenience, and that they requirelow computational effort, despite the fact that they do not perform as well [5]. It is

1

Chapter 1. Introduction 2

however mentionable that as late as 2000, when VQEG examined the current state ofthe art methods, none of them were statistically significantly better than PSNR [6].

VQA is also closely related to image quality assessment, a field in which better resultshave been presented throughout the years. It has even been argued that the problem ofimage quality assessment has been solved [7]. Since a video is a series of images (frames),all image quality metrics can be applied individual frames and thus form the basis ofa video quality metric. The opposite is in general not possible since VQA takes intoaccount temporal and other spatial masking effects.

The rest of this thesis is organized as follows: first the background to the subject ispresented in Section 1.1. Then the theory on which the thesis builds upon is presented inSection 2. Thereafter, three main approaches, VQA measures, are presented in Section3 together with a description of how they were evaluated. In Section 4 the results arepresented, and in Sections chapters 5 and 6 the results are discussed and suggestions forfuture research are made.

1.2 Limitations

This thesis is limited to the assessment of VQM’s with the purpose of video compression.This means that the work and the assumptions made have been done with the intentionto solve the problem for compression distortions, rather than the general video qualityproblem, i.e. a VQM that handles all kinds of distortions well. Within the VQA researchfield, the general problem is often the one that is attempted to be solved. This distin-guishes the results presented in this report from the results in many of the publicationsin this area.

Chapter 2

Theory

2.1 Digital Video

A digital video is a set of frames (images) that played sequentially simulates a movement.It is typically a projection of a thee dimensional movement in two dimensions. Digitally,a frame is represented as a grid/matrix with two dimensions, where each element, pixel,represents a colour. The number of pixels in each dimension is the resolution of thevideo. Examples of resolutions are 480p (480 × 856), 720p (720 × 1280) and 1080p(1080 × 1920). The colour can be represented in different ways, if the frame is blackand white, the element has only one value. If the frame is a coloured, the colour of thepixel is represented as three intensity values. These three values can have a differentrepresentation depending on the colour space used. Each of these colour spaces willspan its own 3D space, where each colour has a unique set of coordinates. One wayof compressing an image/video is to reduce the number of bits assigned to each colourvalue. For example an 8 times compression is obtained using 3 8-bit values to store apixel compared to 3 64-bit values1.

2.1.1 RGB Colour Space

The RGB colour space is most commonly used within computers. In the RBG colourspace the colour of a pixel is represented by its primary spectral components of Red,Green and Blue. For example, in this colour space the colour red will have the coordinate(1, 0, 0), black (0, 0, 0) and white (1, 1, 1), assuming normalized values.

2.1.2 YCbCr Colour Space

In the YCrCb colour space, a pixel is represented by its luma (Y), and chroma (U andV) values. The chroma values are the blue (U) and red (V) difference from the lumavalue, whereas the Y value is the intensity of the pixel. In other words, only displayingthe Y component of an image will display the image in grey scale. The advantages withthis technique is that the eye is more sensitive to changes to intensity, and thereforeit is common to use fewer bits to store the chroma values. This is done by encodingvideos to different formats such as 422 or 420. Then neighbouring pixels share chroma(not intensity) values to various extent. Historically this format was also useful whenproceeding from black and white to colour TV, since it only required the chroma channelsto be “added” to the signal.

1The difference between a unsigned char and double value in C++ using a 64-bit machine.

3

Chapter 2. Theory 4

2.1.3 CIELAB/S-CIELAB Colour Space

CIELAB [8] is a colour space defined by the International Commission on Illumina-tion [9], and it represents a colour value in three components, Lightness (L), red/greendifference (a), and blue/yellow difference (b). This representation mimics the colour per-ception of the human eye in the way that the Euclidean distance between two points inthe CIELAB colour space corresponds to the perceived colour difference between them.The S-CIELAB [10] colour space is a Spatial extension of the CIELAB colour space.Essentially, each colour channel is filtered with different Gaussian kernels to mimic blureffects related to viewing distance, and input to the filtering process is the number of pix-els per viewing angle, which can be calculated if the screen resolution, size and viewingdistance is known.

2.2 Video Quality Metrics

This section contains a brief explanation for the Visual Information Fidelity (VIF) andPSNR VQM’s, from which the methods and results presented in this report are builtupon and compared with. For information about the other VQM’s discussed, see [11](MOVIE) and [3] (SSIM).

2.2.1 Visual Information Fidelity

The Visual Information Fidelity (VIF) is a VQA model, also applicable to images, pre-sented by Hamid Rahim Sheikh and Alan C. Bovik [4]. The model is based on NaturalScene Statistics [12] and models the reference and test image as a Gaussian Scale Mix-ture. The metric has two parts, the quantification of how much information that canbe extracted from the original image, and the quantification of how much of this infor-mation that can be retrieved from the distorted image. The resulting metric is a ratio,generally in the range [0, 1], with the possibility of taking values > 1 if the distortedimage contains more information, e.g. from contrast enhancement. First, the images aredecomposed into wavelet sub bands, then the sub bands are divided into blocks, and foreach block in each sub band, the mutual information [13, p. 19 ff.] is calculated. Theratio between sum of the mutual information of each block of each sub band of the twoframes is then calculated, and the average of this ratio over all frames in the video is thevideo quality.

VIF(Ir, Id) =

∑k

∑b log2

(1 +

g2k,b(σrk,b)2

σdk,b−g

2k,b(σr

k,b)2+σ2N

)∑

k

∑b log2

(1 +

(σrk,b)2

σ2N

) (2.1)

where k and b denotes sub bands and blocks, gk,b = σr,dk,b/(σrk,b)

2, where σr,dk,b is thecovariance between the original and test image in block b of sub band k, σrk,b and σtk,bdenotes respective variance, and σN is a parameter to model HVS. The quality of a videowith N frames is

Qvideo =1

N

N∑i=1

VIF(Ir,i, Id,i) (2.2)

Chapter 2. Theory 5

2.2.2 Block-Based VIF

It has been discovered at Ericsson that VIF is more accurate (SROCC improvement of0.0015, see Table 4.9) if the sampling of sub bands in the numerator in Equation 2.1 isdisregarded. This technique have been named Block-Based VIF (BB-VIF), although inpractice the difference is not very big and can be written as

VIF(Ir, Id) =1

B

B∑b=1

log2

(1 +

g2b (σrb )2

σdb−g

2b (σr

b )2+σ2N

)∑

k

∑b log2

(1 +

(σrk,b)2

σ2N

) (2.3)

with functions and indexing as in Equation 2.1. An example of the BB-VIF values canbe seen in Figure 2.1, and the an overview of the values is also displayed in Figure 2.2.The volatility of the individual values can be noted. This block-based approach also hasthe purpose of simulating optimization on a block level, which is how it is done in anencoder, and to allow for VIF to be combined with other algorithms.

20 40 60 80 100 120

10

20

30

40

50

60 0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

500 1000 1500

200

400

600

800

1000

Figure 2.1: Illustration of the block-based VIF values (left) for a com-pressed frame (right). The BB-VIF value of this frame is 0.1791.

Blocks

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

VIF

0

0.2

0.4

0.6

0.8

1

Figure 2.2: Graph of the data in the left image in Figure 2.1, sorted bythe VIF-value of each block to illustrate the volatility of the individual

values.

Chapter 2. Theory 6

2.2.3 Peak Signal to Noise Ratio

Peak Signal to Noise Ratio (PSNR), is a mathematical measure used to estimate theerror between an original and a distorted data set, typically used in image and videoprocessing. The measure is based on the Mean Squared Error (MSE) between the two.The MSE for an image I of size M ×N , with a noisy approximation K, is

MSE =1

MN

M∑i=1

N∑j=1

(I(i, j)−K(i, j))2 . (2.4)

PSNR is in turn defined as

PSNR = 10 log10

(MAX2

I

MSE

)(2.5)

where MAXI is the maximum value of a pixel, for example with 8-bit representationMAXI = 255. The output value is in decibel. For colour images the PSNR is calculatedon all colour channels, for which a weighted average of some kind can be applied, oftenwith more importance given to the luminance channel, since the HVS is more sensitiveto distortions in that channel.

2.3 Evaluation of Video Quality Metrics

As mentioned in Chapter 1, a VQA method is measured by its correlation with MOS. TheVQA method is applied to distorted video clips (with known MOS) with a metric valueas output. Then, the correlation between the metric values and the MOS is calculated,using two different correlation coefficients, the Pearson’s Correlation Coefficient (PCC)and the Spearman’s Rank Order Correlation Coefficient (SROCC). PCC is a measure ofhow well two sets of data can be seen as linearly dependent, based on their covariance andstandard deviation. SROCC is a measure of how well the dependence of two variablescan be described by a monotonic function. Within this thesis, metrics have mainly beenevaluated by their SROCC to a curve fit to logistic function [14, p. 161 ff.]. To acquirea test set with distorted video clips, one can either download an existing one from apublic database, such as the LIVE Database [15], or create one. Given a set of distortedvideo clips, their MOS are determined using a group of test persons who rate the qualityof the videos in the test set, a process that have been standardized (viewing distances,angles, etc) by the International Telecommunication Union (ITU), an example of such arecommendation is ITU-T Rec P.910 [16].

2.3.1 Pearson’s Correlation Coefficient

The PCC, often simply referred to as ρ in statistics since it is so commonly used, isdefined as

PCC =COV(X,Y )

σXσY(2.6)

where the Covariance between two data sets is

COV(X,Y ) =< (X − E(X))(Y − E(Y )) >=

N∑i=1

(xi − x)(yi − y)

N(2.7)

Chapter 2. Theory 7

and σ is the standard deviation, defined as

σX =√

COV(X,X) (2.8)

2.3.2 Spearman’s Rank Order Correlation Coefficient

The SROCC, rs, between two variables X and Y , with N paired samples of each isdefined as

rs = 1− 6N∑i=1

d2

N(N2 − 1)(2.9)

where d is the difference in the statistical rank between samples xi and yi. For exampleif xi is the 5th largest X-sample, and yi is the 4th largest Y -sample, d = 1.

2.4 Quaternions

This section briefly presents quaternions and the Discrete Quaternion Fourier Transform(DQFT), to provide the theory that is the basis for saliency maps, presented in Sec-tion 3.3. Quaternions is a subclass of the hypercomplex numbers (of which the complexnumber a+ bi is another subclass), and are used both in applied and theoretical math-ematics. This includes image and video processing, as the three imaginary parts allowsfor representation of colour pixels as one unit, which in turn is convenient for, amongother things, filtering operations.

A quaternion q ∈ H can be written as q = a + bi + cj + dk, where a, b, c, d ∈ R. i,j and k are often referred to as the complex components, and they have the propertiesthat ij = k, jk = i, ji = −k and so on. Furthermore

i2 = j2 = k2 = ijk = −1, (2.10)

which makes the root of −1 a unit sphere in the 3-dimensional complex space. The innerproduct for two quaternions, q, p ∈ H is

< q, p >= aqap + bqbp + cqcp + dqdp (2.11)

and the absolute value is |q| = √< q, q >. Also, it should be noted that a quaternion canbe written polar form using Euler’s formula, which generalizes to hypercomplex form,

q = |q|eµΦ = cos Φ + µ sin Φ (2.12)

where µ is a unit pure quaternion (only complex components with |q| = 1). The angleΦ is sometimes referred to the phase of q.

2.4.1 Quaternion Fourier Transform

The Fourier transform also generalizes to quaternions (QFT). For a 2D array f =f(m,n), with size M × N , the 2 dimensional Discrete Quaternion Fourier Transform(QDFT) is defined [17] as

F [u, v] =1√MN

M∑m=0

N∑n=0

e−µ2π((mv/M)+(nu/N))f(m,n) (2.13)

Chapter 2. Theory 8

the inverse transformed is consequently defined as

f [m,n]1√MN

M∑u=0

N∑v=0

eµ2π((mv/M)+(nu/N))F (u, v). (2.14)

where µ is a unit pure quaternion. Furthermore, a quaternion q = a+ bi+ cj + dk canbe rewritten as

q = A+Bj (2.15)

whereA = a+bi andB = c+di. The generalization also holds for q = a+b′µ1+c′µ2+d′µ3,where µ1, µ2 and µ3 form a normalized orthogonal space. Since the Fourier Transformis a linear operation, the DFT can rewritten as

F [u, v] = F1[u, v] + F2[u, v]µ2 (2.16)

where

Fi[u, v] =1√MN

M∑m=0

N∑n=0

e−µ2π((mv/M)+(nu/N))fi(m,n) (2.17)

i ∈ 1, 2 and f(m,n) = f1(m,n) + f2(m,n)µ2. In practice, the benefit of this rewrite isthat existing 2D FFT software can be used to compute DQFTs.

Chapter 3

Method

This section presents the methods used in this report. Sections sections 3.2 to 3.6 describedifferent features (to mimic spatial and temporal masking) and their implementation,and Section sections 3.7 and 3.9 describes two VQ’s and how they are implemented.These two metrics have together with BB-VIF (Section 2.2.2) have been implementedand tested together with different combinations of the features mentioned, and the resultsfor different combinations are displayed in Section 4. Because other features are usedtogether with BB-VIF with Optical Flow weighing, this metric is presented by itself inSection 3.8 The implementation of the metrics have been done in C++ using VisualStudio 2012. An existing Visual Studio project (application), MetricApp, was available,in which VQM’s were already implemented as separate classes. For this application, aPython [18] framework had been implemented that calculates the sought correlations foran implemented VQM. The specifications for the computer and software libraries usedto perform the simulations presented in Section 4 were

• OS Windows 7 Enterprise

• Processor 2.00GHz CPU E5-2620 (6 cores and 12 threads)

• Memory 8 Gb

• OpenCV 2.4.9 [19]

• FFTW 3.3.4 [20]

3.1 OpenCV

OpenCV (Open Source Computer Vision Library) [19] is an “open source computer visionand machine learning software library”, which contains optimized implementations of awide range of image processing algorithms, of which many are state of the art withintheir respective area. Since is is released under a BDS licence ([21]), which makes it“free for both academic and commercial use”, it is the go-to library for image processing.OpenCV have C++, C, Python and Java interfaces, and supports Windows, Linux andMac OS, and also iOS and Android for cellphones/tablets.

3.2 Temporal Content

To measure the temporal content of a frame, the Optical Flow, the OpenCV imple-mentation of “Two-Frame Motion Estimation Based on Polynomial Expansion” [22], thefunction calcOpticalFlowFarneback, was used. The constants used for the functioncall werecalcOpticalFlowFarneback(I(i), I(i+1), OpticalFlow, 0.5, 3, 15, 3, 5, 1.2, 0),

9

Chapter 3. Method 10

where I(i) is the ith frame in the video, OpticalFlow is the output containing x andy components of the Optical Flow. The rest of the input are constants are chosen sothat the Optical Flow field is calculated with good precision, enough to disregard itas a major source of error. For more information on these variables, see the OpenCVdocumentation [23].

3.3 Saliency Map

In [24], it is proposed that a saliency map can be utilized as a part in a temporal weightingscheme. Such a saliency map was introduced using DQFT, see Section 2.4, with amore explicit explanation in [25]. The saliency map (for one frame) was constructed byperforming a 2D DQFT on the frame, represented quaternion form. Each pixel (m,n)is represented in quaternion form using the optical flow components,

q(m,n) = l(m,n) + PE(m,n)µ1 +MVx(m,n)µ2 +MVy(m,n)µ3 (3.1)

where l is the luminance component of pixel (m,n), MVx and MVy is the respectiveOptical Flow components. In the original implementation, PE is the predicted error.However, the Optical Flow implementation [22] used here does not provide such estima-tions, and consequently this term has been set to zero. This is a source of error that hasnot been evaluated. The quaternion is then rewritten in the form presented in equation2.15, q = A+Bµ2. DQFT is then performed on each of the components A and B sepa-rately, as in equation 2.16. From the resulting frequency response, the phase spectrump(u, v), Φ in equation 2.12, is extracted as it contains sufficient information to obtaina saliency map [26]. IDQFT is then applied to the exponential of the phase spectrum,eµ1 p(u,v), and finally a 2D Gaussian filtering is performed on the squared absolute valueof the output to create the saliency map. The standard deviation of this Gaussian filteris 8, as proposed in [26]. For image I = I(l,MVx,MVy) the process can be written as

f = F (I)

p = phase(f)

SA = g ∗ ||F−1(eµ1p)||2(3.2)

where g is the 2D Gaussian kernel. Block based weights were computed by first averagingall saliency values within each block and them remapping them from [SAmin, SAmax](max and min value of all blocks) to a [βl, βt] with the formula

new value = βl + (βt − βl)(old value− SAmin)

SAmax − SAmin(3.3)

This rescaling is performed to avoid the weighting from saliency to be close to zero. Theblock based averaging makes the algorithm more robust to outliers. An example of asaliency map and the corresponding grey scaled image is seen in Figure 3.1. The actualDQFT and IDQFT were implemented using the FFTW library [20], which is a C libraryavailable under GNU General Public License [27], by implementing the steps describedin [17] using the complex DFT provided in the FFTW library.


10 20 30 40 50 60 70 80

5

10

15

20

25

30

35

40

45 1

1.5

2

2.5

3

3.5

4

4.5

5

200 400 600 800 1000 1200

100

200

300

400

500

600

700

Figure 3.1: Illustration of a saliency map, with block based saliencyvalues (left) for a video frame (right). The saliency values have been

mapped from [SAmin, SAmax] to [1, 5].

3.4 Fovea filtering

In [28], a method of integrating perceptual aspects into VQA measures is presented. Oneof these aspects is modelled with a spatial filter, a fovea filter, which mimics propertiesof the fovea1. These properties are that the HVS can not see an entire frame at once,and thus depending on where the focus of the eyes are, not all the distortions can be seenat once. This attention to distortions decreases the further away from the centre of thefocus. The output after filtering a frame becomes, for each pixel, a quality measure giventhat the focus is on that pixel. In [28], the quality metric for the frame is the smallestquality value for all pixels in the frame, whereas an average over all blocks is implementedhere. This because the underlying quality metric implemented together with the foveafilter is block based. This filter is radially symmetric, based on a cone density functionwhich is a least squares curve fit to the cone density in different directions of the eye.The function is defined as

f(r) =

{10a1r

3+a2r2+a3r+a4 if r > 0.16

200000 else(3.4)

where r is measured in degrees of viewing angle. The constants are defined as

a1 = −0.00891, a2 = 0.107

a3 = −0.519, a4 = 5.30.(3.5)

The resulting function can be seen in Figure 3.2. r, the degrees of viewing angle iscalculated using trigonometry. This is possible since the viewing distance is defined inthe documentation for the used video test sets (see Section 3.10), as 3 times the screenheight. Assuming that the eyes of the viewer are placed in the centre of the screen, theviewing angle-span from the bottom to the top of the screen is two times the angle fromthe centre of the screen to the top. That angle is calculated by the inverse tangent ofthe triangle with sides H/2 (opposite) and 3H (adjacent),

2 arctan

(H/2

3H

)= 2 arctan

(1

6

)= 18.9246 (3.6)

from which the viewing angle per pixel can be calculated for a given screen resolutionand used to construct the filter. Note that since f(r) decreases relatively fast, the filter

1The part of the eye where the rods and cones are located.


itself is smaller than the resolution for the assessed video. As mentioned, the underlyingquality metric is block based, and thus a transformation into a filter with one value foreach block in the original filter is done to make the filter applicable to the block basedvalues. This is done by creating a new block based filter, Fb, with |k ∈ Fp| elements(n× n), which have the values

Fb,k = max (Fp ∈ block k) ∀k ∈ Fp (3.7)

where Fp is the pixel based filter described previously. Furthermore, Fb is normalized sothat

|Fb|∑i=1

Fb(i) = 1. (3.8)

The normalization is performed to maintain the mathematical property of the underlyingquality metric that if the distorted and original video are identical the output metric valueis 1.

f(r

p)

×105

0

0.5

1

1.5

2

rp

-200 -100 0 100 200

10

rb

0

-10-10

0

0

0.02

0.04

10

f(r

b)

Figure 3.2: The fovea filter function, f(rp) as a function of the viewingdistance per pixel, rp (left), and its block-wise, normalized implementa-tion, f(rb) (right), where rb is the viewing distance per block (16 × 16

pixels).

3.5 Temporal Hysteresis Model

Another subject of examination is how the HVS handles variances in quality over time.Most VQA measures only accounts for temporal factors within frames, for exampleweighting based on Optical Flow. As discussed more extensively in [29], temporal weight-ing is also performed by the HVS over time. An illustration of this is that if when exposedto bad quality, for example from package loss when streaming video content, this willbe remembered and reduce the experienced quality for a time, a hysteresis effect. Thisbehaviour is not modelled when the video quality is evaluated as an average of the frame-wise values for all frames in a video. Therefore the temporal hysteresis model presentedin [29] was implemented and examined.

Based on empirical studies of time varying quality experiences, [29] proposes a tem-poral pooling strategy. The pooling has two components, a memory component, x, anda current quality component, y. The memory component is defined recursively for eachtime instant ti, using quality scores Q(t) from previous time steps over the previous t = τseconds,


x(1) = Q(t1) (3.9)x(ti) = min[Q(t)], t = {max[1, ti − τ ], ti} (3.10)

This effectively sets the memory value to the minimum of the quality values from theprevious τ seconds for which there exists quality values. The current quality componentat time ti is constructed using the quality values for the frames in the coming τ seconds.To model the effect of viewers responding strongly to drops in quality, the quality valuesfrom the time frame [ti, ti + τ ] are sorted in ascending order and then weighted usinghalf a Gaussian weighting function (the descending half). The standard deviation ofthis Gaussian function is set to (2K − 1)/12, where K is the number of elements in theweighting function, i.e. τ ·frame rate. An illustration of this function with K = 200 canbe seen in Figure 3.3. The algorithm used to calculate the current quality value can bewritten as

~v = sort(Q(t)), t = {ti,min(ti + τ, Tj} (3.11)

yti =

K∑k=1

vkwk, k = 1, 2, . . . , k (3.12)

The final metric value is obtained by performing a linear combination of the memoryand current quality components, depending on the variable α, 0 < α < 1.

Q′(ti) = αy(ti) + (1− α)x(ti) (3.13)

Qvideo =1

T

∑t

Q′(t) (3.14)

The constants λ and τ are set to 0.8 and 2 (seconds) respectively, as in [29].

3.6 Intra-frame Gaussian Weighting

As suggested in [28] (and implicitly whenever weighting is suggested), not all parts ofa frame are as important when mimicking HVS. One property of the HVS is that moreattention is put towards things that stand out, in the context of video compression anexample can be pixels/blocks with extraordinary high distortion. An attempt to mimicthese HVS properties was implemented by weighting the blocks within a frame using aGaussian filter, inspired by the filtering in Section 3.5. The purpose of this is to givemore importance to highly distorted blocks, under the assumption that blocks with highquality are unable to compensate for blocks with low quality. This was implemented by,for each frame in the assessed video, storing the block based distortion values, sort themin descending order,2 and then weight them with half a normalized Gaussian function(the descending part) with a given standard deviation α. This can be written as

Qframe =∑k∈I

wk · dk (3.15)

2Ascending order when high values correspond to low distortion.


Index

0 50 100 150 200

Weight

0

0.005

0.01

0.015

0.02

0.025

Figure 3.3: Illustration of weighting function used for the TemporalHysteresis Model. K = 200, which corresponds to a frame rate of 50

when τ = 2 seconds.

where Qframe is the quality value for a frame, dk is the distortion value for block k inimage I . The dk values are sorted in descending order, dk ≥ dk+1, so that the mostweight is given to blocks with a low quality. wk is the corresponding Gaussian weight.The Gaussian sum has the properties that

∑Kk wk = 1 and wk > wk+1. The standard

deviation of this Gaussian function, which was subject to experiments, inspired by thestandard deviation in the Gaussian function in 3.5, was finally set to (2K − 1). Anillustration of this function for K = 1600, 3600 and K = 8100 is seen in Figure 3.4.These K values corresponds to the number of 16× 16 block in frames of resolution 480,720 and 1080p.

3.7 Xu-et-al-method

This VQA measure is based on the method presented in [24]. It has not (to the author’sknowledge) been given a name, and the version presented here is referred to as “Xu-et-al-method”. The luminance (Y) channel of each frame in the YCbCr-format (see Section2.1.2) is used, i.e. the colour channels are disregarded. The input to the measure is anoriginal and distorted (compressed) video clip, from which the Y-channels are extracted.The output is the average of image quality for all frames (Qframe) in the video. Thequality of the distorted image is a weighted MSE, performed at block level, using blocksof 16×16 pixels. For each block k in a frame, a weight, wk, is calculated using informationfrom the original frame, which is multiplied with the MSE of the block to form a qualitymeasure, Qk. The average Qk for all blocks in the frame is the quality estimation of that


Array indices mapped to [0, 1]0 0.2 0.4 0.6 0.8 1

Weight

×10-3

0

0.2

0.4

0.6

0.8

1

K = 1600

K = 3600

K = 8100

Figure 3.4: Illustration of the normalized weighting vector used forintra-frame Gaussian filtering, with standard deviation α = 2K − 1.

frame,

Qframe =1

N

N∑k

MSEk · wk (3.16)

The block weights consists of a spatial and two temporal components. The spatialweighting, SPwk

factor is constructed using the spatial gradients, Ix and Iy, and mimicsthe HVS in the way that a texture block is less sensitive to distortion than a plain oredge block. SPwk

is calculated as

SPwk=∑

(m,n)

√α(Ix(m,n)2 + Iy(m,n)2) + β, (3.17)

with α = 0.5 and β = 1. Larger spatial gradients implies a more complicated blockwhich in turn generates a HVS masking effect. Ix and Iy are approximated using Scharrfilters [30],

∂

∂x=

1

32

−3 0 3−10 0 10−3 0 3

, ∂

∂y=

1

32

−3 −10 −30 0 03 10 3

. (3.18)

and not as in [24], where they are approximated as

Ix(m,n) = I(m,n)− I(m− 2, n) (3.19)Iy(m,n) = I(m,n)− I(m,n− 2) (3.20)

From the temporal perspective, masking effects occur due to motion. And it “is argued”[24] that the HVS is not as sensitive to very slow or fast motion as to intermediate


motion. Therefore, the motion strength, MSk, of each block is calculated as

MSk = 0.5√I2t,x + I2

t,y, (3.21)

where It,x and It,y are the average of the x and y components of the time derivative withina block. It,x and It,y are approximated using the Farnebäck algorithm [22]. It shouldbe noted that this differs from the original implementation where MSk is “weighted bythe largest MS value in the frame”, however it is not specified exactly how this is done.This information is used to assign each block with a temporal weighting factor, MSwk

,defined as

MSwk=

{1 if 0.5 < MSk < 1.5

γ else(3.22)

where 0 ≤ γ ≤ 1. An illustration of this is seen in Figures figs. 3.5 and 3.6, where thevector field is the block wise average Optical Flow. The green arrows are those withweight 1 and the red ones are those with weight γ. The wheel in the enlarged area inFigure 3.6 is rotating, and it can be seen fairly clearly what “intermediate” motion is.It should also be noted that MSk is calculated without regard to the frame rate. Theconstants 0.5 and 1.5 are those for 720p resolution, for other resolutions they are scaledlinearly with the side size (assuming side ratios constant), e.g. for 480p the constantsare scaled with a factor 480/720 (= 0.666 . . . ).

Further temporal weighting is also performed using a Saliency Map, calculated asdescribed in Section 3.3, where each block is assigned with a weight, Awk

. All blockbased saliency values, SAwk

are mapped to [1, 5]. Given this spatial and temporalinformation, wk is calculated as

wk =MSwk

· SAwk

SPwk

(3.23)

and the block based quality value as

Qk = MSEk wk (3.24)

The output metric value is calculated as the average of all frame-wise values, which inturn are the average of Qk within that frame,

Qframe =1

N

N∑k=1

Qk (3.25)

Qvideo =1

|frames|∑

Qframe (3.26)

When evaluated without utilizing the information in the saliency map, the calculationsare performed in the same way, except that

wk =MSwk

SPwk

3.8 BB-VIF with optical flow

Another method assessed is the combination of BB-VIF and block based optical flow, aspresented in Section 3.7. This was done by calculating the BB-VIF and Optical flow, It


for each block in each frame, and assigning them weights according to Equation 3.22. Inan additional step, MSwk

, are normalized to MSwk, norm such that

1

K

∑|K|

MSwk, norm = 1. (3.27)

before used as weights. This to maintain the the BB-VIF’s mathematical property thatthe quality measure performed on the original video (reference video is original video)has 1 as output, which is not the case if the weighting is not normalized. The outputmetric value is the average of all frame-wise values, which is turn are

Qframe =1

N

N∑k=1

VIFkMSwk,norm (3.28)

Qvideo =1

|frames|∑

Qframe (3.29)

Figure 3.5: Illustration of the block wise optical flow vectors for avideo frame. Green arrows indicates weight 1 and red arrows weight γ in

Equation 3.22.

Figure 3.6: Enlargement of area in Figure 3.5


3.9 Directional Statistics based Color Similarity Index

Another quality measure that was investigated is the Directional Statistics based ColorSimilarity Index (DSCSI) [31]. It utilizes all colour channels, which is something thatVQA measures generally do not. DSCSI is a full-reference image quality measure whichattempts to mimic SSIM [reference to come] by applying similar measures to the colourchannels of an image For the detailed implementation, see [31].

The DSCI measure is performed in three steps. In a first step, the image is trans-formed into the S-CIELAB colour space (see Section 2.1.3). In a second step, local coloursimilarity measures are applied to each of the three colour channels, the Hue channel, thechroma and the lightness channel. For each of the two channels, two similarity measuresare obtained. For the hue channel, the hue mean similarity measure, Hl and the huedispersion similarity measure, Hc, are calculated. For the Chroma channel, the chromamean similarity measure, Cl, and the chroma contrast similarity, Cc are calculated. Andfor the lightness channel, the lightness contrast similarity measure, Lc and the lightnessstructural similarity measure, Ls, are calculated. Each of these six measures maps to ameasure that lies in the range [0, 1]. In the third step, these six individual measures arecombined into a final measure, first by combining them into two scores, the chromaticsimilarity measure, SC , and achromatic similarity measure, SA,

SC = Hl ·Hc · Cl · Cc (3.30)SA = Lc · Ls (3.31)

which in turned are combined into a final quality measure, Qframe, as

Qframe = SA · (SC)λ (3.32)

where λ is a weighting constant. To use this algorithm to evaluate video quality, theaverage of all frames f in video V was used,

Qvideo =1

|V |∑f∈V

Qframe (3.33)

All constants used in the calculations of Hl, Hc, Cl, Cc, Lc and Ls are those proposedin Table III in [31], and λ was set to 0.8, also proposed in the original report.

3.10 Evaluation Method

The evaluation for the methods presented in this section has been done as presented inSection 2.3. As discussed in Section 1.2, this thesis is limited to the study of distortionfrom video compression. For this purpose, a test set consisting of videos with MOS-scoreshave been produced by Ericsson according to the ITU-T Rec P.910 [16] recommendation.It contains around 600 distorted videos in YCrCb-format, compressed with differentbit-rates (as well as the original files) with a length of ≈ 10 seconds. They are ofdifferent characteristics, containing sub sets with HDTV clips and cellphone/tablet videochats, characteristic for when video compression is used. This differentiates the testset from many of the publicly available ones (discussed in Section 2.3). Furthermore,they were arranged into 8 sub sets in the formats 480p, 720p and 1080p. The factthat different resolutions are evaluated also differentiates this test set from many of thepublicly available, which often only contains videos of resolution ≈ 480 × 856. The


performance of each algorithm is measured both by its SROCC and PCC, although onlythe SROCC is used for comparison of performance. This since a good SROCC impliesthat a mapping can be done from the measure to a function whose values are proportionalto MOS scores, allowing a video coding algorithm to optimize quality gains per used bitin compression. An illustration of SROCC and PCC for a sub set of the test set is seenin Figure 3.7.

MOS1 1.5 2 2.5 3 3.5 4 4.5

PSNR

[dB]

30

35

40

45

50

MOS1 1.5 2 2.5 3 3.5 4 4.5

CurveFittedValue

1

2

3

4

5

Figure 3.7: Illustration of how SROCC and PCC values correspondto data, here for PSNR (top) and Curve Fitted (see Section 2.3) PSNR(bottom) quality values with respective MOS values from a 480p datasetconsisting of 90 videos. SROCC and PCC of the unmodified data is

0.7993 and 0.8393, for the Curve Fitted 0.7993 and 0.8834.

Chapter 4

Results

This chapter contains the results from the evaluation of the different VQA measures. Theevaluation has been performed as described in Section 3.10, and the different measuresas well as features mentioned are described in Sections sections 3.2 to 3.9.

4.1 Xu-et-al-method

In this Section, the results for Xu-et-al-method are presented. This method is describedin Section 3.7.

Table 4.1: Performance with and without the Temporal HysteresisModel. OF refers to Optical Flow.

Metric SROCC PCC CommentsPSNR 0.6957 0.7801Xu-et-al-method 0.7944 0.8519 γ = 0.5

4.2 DSCSI

In this Section, the result for the DSCSI method is presented. This method is describedin Section 3.9.

Table 4.2: Performance of the DSCSI method.

Metric SROCC PCCPSNR 0.6957 0.7801DSCSI 0.5403 0.6349

4.3 Optical Flow

The performance of Optical Flow for different configurations is seen in Table 4.3. Thescaling with frame rate was implemented by scaling the Optical Flow value (MSk) andthe limits in Equation 3.22 with the frame rate. For the PSNR implementation, ablock based average of the PSNR within the block was calculated by weighing the pixelcontribution to the MSE sum with a weight (γ or 1). For more information, see Sections2.2.3 and 3.7. The γ value has not been optimized for the PSNR algorithm.

20

Chapter 4. Results 21

Table 4.3: Performances of different measures with and without OpticalFlow weighting.

Metric SROCC PCC CommentsBB-VIF 0.8424 0.8917BB-VIF with Optical Flow 0.8556 0.8970 γ = 0.5BB-VIF with Optical Flow 0.8544 0.8973 γ = 0.55BB-VIF with Optical Flow scaled with Frame Rate 0.8449 0.8951 γ = 0.55PSNR 0.6957 0.7801PSNR with Optical Flow 0.6967 0.7909 γ = 0.5

4.4 Saliency Map

In Table 4.4 the performance for Xu-et-al-method and BB-VIF with Optical Flow, withand without the Saliency Map feature described in Section 3.4, are presented.

Table 4.4: Performances of different measures with and withoutSaliency Map feature. OF refers to Optical Flow.

Metric SROCC PCC CommentsXu-et-al-method 0.7944 0.8519 γ = 0.5Xu-et-al-method w. Saliency Map 0.7542 0.8092 γ = 0.6, Saliency map ∈ [1, 2]BB-VIF w. OF 0.8556 0.8970 γ = 0.5BB-VIF w. OF & Saliency Map 0.8550 0.8979 γ = 0.5, Saliency map ∈ [1, 2]BB-VIF w. OF & Saliency Map 0.8428 0.8878 γ = 0.5, Saliency map ∈ [1, 5]

4.5 Fovea filtering

In Table 4.5 the results for BB-VIF with Optical Flow with and without the fovea filteringfeature, described in Section 3.4, are presented.

Table 4.5: Performances with and without Fovea filtering. OF refers toOptical Flow.

Metric SROCC PCC CommentsBB-VIF w. OF 0.8544 0.8973 γ = 0.55BB-VIF w. OF & Fovea Filtering 0.8521 0.8937 γ = 0.55

4.6 Temporal Hysteresis Model

In Table 4.6 the results for BB-VIF with Optical Flow with and without the temporalhysteresis feature, described in Section 3.5, are presented.

Table 4.6: Performance with and without the Temporal HysteresisModel. OF refers to Optical Flow.

Metric SROCC PCC CommentsBB-VIF w. OF 0.8544 0.8973 γ = 0.55BB-VIF w. OF & Hysteresis modelling 0.8518 0.8886 γ = 0.55


4.7 Intra-frame Gaussian Weighting

In this section, the results for the Intra-frame Gaussian weighting method, presented inSection 3.6. In Table 4.7 the results are presented, both for the method with and withoutthe Optical Flow feature.

Table 4.7: Performance with and without the Intra-frame GaussianWeighting.

Metric SROCC PCC CommentsBB-VIF 0.8427 0.8917BB-VIF w. Intra-frame Gaussian weighting 0.8434 0.8901BB-VIF w. OF 0.8544 0.8973 γ = 0.55BB-VIF w. OF & Intra-frame Gaussian weighting 0.8559 0.8977 γ = 0.55

4.8 Summary of results

This section provides a summary of the most relevant results already presented as wellas that for SSIM and MOVIE, for comparison and to provide an overview.

Table 4.8: Abbreviations for metrics present in Table 4.9

Abbr. MetricPSNR Peak Signal to Noise Ratio.SSIM Structural Similarity index.MOVIE MOtion-based Video Integrity Evaluation indexVIF Visual Information Fidelity.BB-VIF Block Based VIF.P1 BB-VIF with Optical Flow.P2 BB-VIF with Optical Flow scaled with frame rate.P3 BB-VIF with Optical Flow and intra-frame Gaussian weighting.P4 BB-VIF with Optical Flow and Hysteresis implementation.P5 BB-VIF with Intra-frame Gaussian weighting.P6 BB-VIF with Optical Flow and Fovea filtering.R1 Xu-et-al-method with Saliency Map.R2 Xu-et-al-method without Saliency Map.DSCSI DSCSI, described in Section 3.9.


Table 4.9: Summary of the results of different metrics, explanation ofabbrevations in Table 4.8.

Metric SROCC PCC Simulation-specific constantsPSNR 0.6957 0.7801SSIM 0.7877 0.8521MOVIE 0.8015 0.8638VIF 0.8413 0.8973BB-VIF 0.8427 0.8917P1 0.8544 0.8973 γ = 0.55P2 0.8449 0.8951 γ = 0.55P3 0.8559 0.8977 γ = 0.55, α = (2K − 1)P4 0.8518 0.8886 γ = 0.55P5 0.8434 0.8901 α = (2K − 1)P6 0.8521 0.8937 γ = 0.55R1 0.7542 0.8092 γ = 0.6,R2 0.7944 0.8519 γ = 0.5DSCSI 0.5403 0.6349

Chapter 5

Discussion

Since the metric that performs the best from the beginning is BB-VIF, most of the effortwere put into enhancing BB-VIF rather than implementing all of the features on otherpre-existing metrics. One property of BB-VIF is that while the average measure ratherwell estimates video quality, the variation between different block values can be ratherbig (an example of which can be seen in Figure 2.2). This might have been a limitationwhen evaluating some of the proposed features.

Results can be divided into two part, testing of features that improves already ex-isting metric and the testing of (two) new metrics. Optical Flow weighting is such anexample of the former. As seen in Table 4.3, the implementation of Optical Flow weightsas presented in Equation 3.22 improved the correlation to a fair amount. This increase of≈ 0.01 might not seem very good at first. However, the fact that the unmodified BB-VIFalready has a correlation of 0.8427 on this data set implies that it is likely going to behard to achieve major improvements correlation. Since the Optical Flow weighting alsoimproved performance on PSNR (I would like to stress the fact that the weight γ was ar-bitrary chosen for the result in Table 4.3), it became the new “standard” when evaluatingperformance, and therefore most of the other features were implemented on top of the“BB-VIF with Optical Flow” measure. Also, and this applies to all types of weighting,the improvement in performance is restricted by the underlying metric’s performance.Even with a perfect model of the HVS’s temporal and spatial weighting, correlation islimited by how well the metric measures the distortion. Another notable thing with thecurrent implementation of Optical Flow masking is that it improves performance muchmore when not scaled with the frame rate, which perhaps contradicts intuition. Thisimplies that the HVS is sensitive to the displacement between frames (which the imple-mented optical flow algorithm measures) rather than the actual movement speed in theframe.

Another feature investigated was the intra-frame Gaussian weighting, the results ofwhich can be seen in Table 4.7. This weighing was implemented to mimic the fact thatthe attention of the eye is more likely to be put on areas with a higher distortion. Per-formance was improved slightly, although not insignificantly as the feature itself is notvery complex. The lack of locality makes it unsuitable for implementation in an encoder.The standard deviation, visualized in Figure 3.4, was quite sensitive and enhancementsin performance was only obtained with low (2K − 1) standard deviation for the Gaus-sian. This might be an effect of the previously mentioned property of BB-VIF, that thedeviation of the block based values is quite big, and when given too much importanceto highly distorted blocks the (weighted) average no longer predicts quality as well.

The Temporal Hysteresis Model, presented in Section 3.5, did not improve perfor-mance , see Table 4.6 (slightly worse, 0.8518 compared to 0.8544). This was unexpectedsince the results presented in [29] were good. It is possible that one reason for this liesin the proposed problem. As mentioned in 1.2, the problem investigated here is limited

24

Chapter 5. Discussion 25

to distortion from compression, whereas [29] improves performance on a test set withvideos subject to packet losses (together with compression distortion). This will producevideos with more drastic changes in quality than those in the test set used here, andthis might be something that increases performance. The sub sets in the test set usedfor this report (see Section 3.10) have quite high MOS scores (3.88 > average > 3.29),and SROCC is harder to achieve on such a test set compared to a test set with a bigspread in MOS. These results should not be considered enough to disregard TemporalHysteresis modelling. Nonetheless, for the purposes of evaluating compression distortionis has not proven useful.

Fovea filtering, proposed in [28], was examined as it provided an interesting approachto spatial weighting, and the results presented in the article were good. However, as seenin Table 4.5, the results were slightly worse than with fovea filtering than without. Thefact that the results are so similar is not very surprising when considering in which waythis implementation differs from the original, with an average of all blocks is used asquality estimation for each frame rather than the worst pixel/block. Theoretically, withperiodic boundaries, the results of such an average would be the same as without filteringat all. The reason why the averaging was done was because the property mentioned abovewith a large deviation in block-wise quality values, which in practice means that therealmost always are blocks with quality values close to zero (quality value ∈ [0, 1]), whichmade the original method useless. Had there been more time, this method should havebeen examined further, for example with another way of extracting a frame-quality value.

The saliency map was part of the VQA measure presented in [24] (Xu-et-al-method,although not first presented there) which did not turn out to work very well. Thereforeit was (for the purposes of this report) disregarded after testing it with with both Xu-et-al-method and BB-VIF, the results can be seen in Table 4.4. Because of the poorperformance using the implementation in [24], the mapping described in Section 3.3 wasimplemented and tested. This because the saliency values could differ with ratios asbig as 10 000 within a single frame (smaller values being very small, close to zero). Ascan be seen in Table 4.4 the results are better the smaller the top/low ratio is (lessimportance given to the saliency map), which implies that the SROCC is converging tothe SROCC for the methods without the saliency map. These results are contradictoryto those presented in [25] and [24], where the saliency map improves performance. Theonly way the calculations differ (up to the mapping/no mapping) is that no predictederror is present here, which makes one of the imaginary parts zero in the quaternionrepresentation (see Section 3.3), the good precision chosen using the Farnebäck algorithmshould however justify this. It is possible that the method used to calculate the opticalflow in the original report (the method is not specified) produces a flow with propertiesmore suited for the implementation in [24], although this is not likely. Implementationwise the DQFT have been compared to existing MATLAB implementations [32] of DQFTto verify the transformation.

The Xu-et-al-method is alternated version of the VQA measure presented in [24],with MSE as a base on top of which a weight is assigned to mimic spatial and temporalmasking effects. These masking effects were the main reason why it was examined, andthe results can be seen in Table 4.1. Although the SROCC is not as good as methodsbased on BB-VIF, it still provides a substantial improvement compared to PSNR (of≈ 0.1). It is also slightly better than SSIM and slightly worse than MOVIE, see Table4.9. The reason why the performance is better without using a saliency map has, asmentioned, not been accounted for. This does however reduce the complexity of themetric, and the block quality value calculation is isolated compared to with the usageof a saliency map. If this is a desired property, as in video compressors (this metric

Chapter 5. Discussion 26

is proposed for this usage) this is an interesting method.This method have not beencombined with the other features as a good SROCC were prioritized at the time.

The DSICS algorithm have been assessed because the results in [31] seemed promis-ing, and it is also based on all three colour channels rather than being applied to theluminance channel only, which is the case for the other metrics investigated. The results,seen in Table 4.9, are very poor. It is even outperformed by PSNR, which is one of themost basic VQA methods available. There can be several combination of sources to thispoor performance. Firstly, DSCSI is an image quality measure and not a VQA measure,and it is not certain that a good image quality metric will be good for videos. Further-more, it is possible that the colour distortions that DSCSI captures are not (as) presentin compression. An indication of that this might be a factor is that the only one of the 6measures that composes DSCSI, the Hue and Chroma similarities typically differs from1 first in the third digit (i.e. 0.99xxx) whereas the Lightness similarity varies more.

Chapter 6

Conclusions and Future Research

The results presented in this report shows that performance of the examined VQA mea-sures can be improved by using the optical flow weighting scheme seen in Section 3.7.The fact that optical flow can be used to mimic masking effects of the HVS is in itselfnothing new, what is notable is that it can be used to improve the metric that performsthe best on compression distortion, as BB-VIF performs better than SSIM and MOVIE(which uses motion masking) which are considered state of the art today. The way theweighting have been implemented here also differs from how it has been implemented in[24], making this method local, i.e. the blocks are independent of each other, which iswell suited for parallellization, for example in an encoder.

No VQM that exploits the colour channels in a satisfactory way have been found.The DSCSI image quality metric was implemented, its performance was however poor,even worse than that of PSNR.

As the time frame of this thesis allowed for further investigations, it was decided thatother ways improving VQA’s that had been encountered and seemed interesting shouldbe examined. Further improvement have been obtained using the intra-frame Gaussianweighting proposed in Section 3.6. This have only been tested on the BB-VIF measure.Other features did not provide improvement, although the decreases in performance forthe Temporal Hysteresis Modelling and the Fovea filtering are not significant enough todisregarded them completely, as they have proven useful in other works ([29] and [28]).

Furthermore, a variant of the metric in [24], Xu-et-al-method, have been proposedand tested. The results of this metric are not as good as with BB-VIF with opticalflow and intra-frame Gaussian weighting, the fact that it outperforms PSNR and SSIM(although not MOVIE) together with the locality does however make it interesting forencoding purposes.

Future studies should determine if the weighting mechanisms proposed/studied canbe used together with other existing VQM’s, such as SSIM or Xu-et-al-method. Espe-cially if the block based values of the tested metric are not as volatile as the those inBB-VIF. This is particularly interesting if the purpose is encoding. If proceeding withthe methods proposed in this report, it might also be of interest to try to optimize theconstants further, as the execution times of the correlation tests have been a limitingfactor.

27

Bibliography

[1] Cisco. Cisco Visual Networking Index: Forecast and Methodology, 2014-2019 WhitePaper. url: http://www.cisco.com/c/en/us/solutions/collateral/service-provider/ip-ngn-ip-next-generation-network/white_paper_c11-481360.html (visited on 10/10/2015).

[2] Video Quality Experts Group. Video Quality Experts Group Home Page. 2015. url:http://www.its.bldrdoc.gov/vqeg/vqeg-home.aspx (visited on 11/09/2015).

[3] Zhou Wang et al. “Image quality assessment: from error visibility to structural sim-ilarity”. English. In: IEEE Transactions on Image Processing 13.4 (2004), pp. 600–612.

[4] H. R. Sheikh, A. C. Bovik, and G. de Veciana. “An information fidelity criterionfor image quality assessment using natural scene statistics”. English. In: IEEETransactions on Image Processing 14.12 (2005), pp. 2117–2128.

[5] Nikolay Ponomarenko et al. “Metrics performance comparison for color imagedatabase”. In: Fourth international workshop on video processing and quality met-rics for consumer electronics. Vol. 27. 2009.

[6] Ann M. Rohaly et al. “Video Quality Experts Group: current results and futuredirections”. English. In: vol. 4067. 2000. Chap. 1. isbn: 0277-786X.

[7] Electronic Imaging Plenaries 2011. Video Presentation by Al Bovik, "New Dimen-sions in Visual Quality". url: river-valley.zeeba.tv/media/conferences/sda-2011/0201-Al-Bovik (visited on 12/15/2015).

[8] Commission Internationale de l’Eclairage. CIE L*A*B* Color Space. 2015. url:http://www.cie.co.at/index.php/index.php?i_ca_id=485 (visited on11/09/2015).

[9] Commission Internationale de l’Eclairage. International Commission on Illumina-tion. 2015. url: http://www.cie.co.at/ (visited on 12/07/2015).

[10] X. Zhang and B. A. Wandell. “A spatial extension of CIELAB for digital color-image reproduction”. English. In: Journal of the Society for Information Display5.1 (1997), pp. 61–63.

[11] K. Seshadrinathan and A. C. Bovik. “Motion Tuned Spatio-Temporal Quality As-sessment of Natural Videos”. English. In: IEEE Transactions on Image Processing19.2 (2010), pp. 335–350.

[12] E. P. Simoncelli and B. A. Olshausen. “Natural image statistics and neural repre-sentation”. English. In: Annual review of neuroscience 24 (2001), p. 1193.

[13] S. EDWARDS. “Thomas M. Cover and Joy A. Thomas, Elements of InformationTheory (2nd ed.), John Wiley & Sons, Inc. (2006)”. English. In: Information Pro-cessing & Management 44.1 (2008), pp. 400–401.

[14] Switzerland International Telecommunication Union Geneva. Tutorial to Objectiveperceptual assessment of video quality: Full reference television. 2004. url: https://www.itu.int/ITU-T/studygroups/com09/docs/tutorial_opavc.pdf.

28

http://www.cisco.com/c/en/us/solutions/collateral/service-provider/ip-ngn-ip-next-generation-network/white_paper_c11-481360.html



http://www.its.bldrdoc.gov/vqeg/vqeg-home.aspx

river-valley.zeeba.tv/media/conferences/sda-2011/0201-Al-Bovik

river-valley.zeeba.tv/media/conferences/sda-2011/0201-Al-Bovik

http://www.cie.co.at/index.php/index.php?i_ca_id=485

http://www.cie.co.at/

https://www.itu.int/ITU-T/studygroups/com09/docs/tutorial_opavc.pdf

https://www.itu.int/ITU-T/studygroups/com09/docs/tutorial_opavc.pdf

BIBLIOGRAPHY 29

[15] Laboratory for Image and Video Engineering (LIVE) at The University of Texas atAustin. url: http://live.ece.utexas.edu.

[16] Switzerland International Telecommunication Union Geneva. Recommendation ITU-T Rec P.910, "Subjective video quality assessment methods for multimedia ap-plications". url: https://www.itu.int/rec/T- REC- P.910/en (visited on12/18/2015).

[17] T. A. Ell and S. J. Sangwine. “Hypercomplex Fourier Transforms of Color Images”.English. In: IEEE Transactions on Image Processing 16.1 (2007), pp. 22–35.

[18] Python Software Foundation. Python Language Reference, version 2.7.10. 2015.url: https://www.python.org (visited on 12/14/2015).

[19] Itseez. Open CV Kernel Description. 2015. url: http://opencv.org (visited on10/07/2015).

[20] Matteo Frigo and Steven G. Johnson. “The Design and Implementation of FFTW3”.In: Proceedings of the IEEE 93.2 (2005). Special issue on “Program Generation,Optimization, and Platform Adaptation”, pp. 216–231.

[21] Open Source Initiative. The BSD 3-Clause License. url: http://opensource.org/licenses/BSD-3-Clause (visited on 10/07/2015).

[22] Gunnar Farnebäck. “Two-frame motion estimation based on polynomial expan-sion”. English. In: Lecture Notes in Computer Science (including subseries LectureNotes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2749 (2003),pp. 363–370.

[23] Itseez. calcOpticalFlowFarneback Documentation. 2015. url: http://docs.opencv.org/2.4/modules/video/doc/motion_analysis_and_object_tracking.html(visited on 12/09/2015).

[24] Long Xu et al. “Visual quality metric for perceptual video coding”. English. In:IEEE, 2013, pp. 1–5.

[25] Lin Ma, Songnan Li, and King N. Ngan. “Motion trajectory based visual saliencyfor video quality assessment”. English. In: 2011, pp. 233–236. isbn: 1522-4880.

[26] Q. Ma C. Guo and L. Zhang. “Spatio-temporal Saliency detection using phasespectrum of quaternion fourier transform”. English. In: 2000, pp. 1–8. isbn: 978-1-4244-2243-2.

[27] GNU General Public License. url: http://www.gnu.org/licenses/gpl-3.0.html.

[28] Marcus Barkowsky et al. “Perceptually motivated spatial and temporal integrationof pixel based video quality measures”. English. In: ACM, 2007, pp. 1–7. isbn:1595937609;9781595937605;

[29] Kalpana Seshadrinathan and Alan C. Bovik. “Temporal hysteresis model of timevarying subjective video quality”. English. In: 2011, pp. 1153–1156. isbn: 1520-6149.

[30] Joachim Weickert and Hanno Scharr. “A Scheme for Coherence-Enhancing Diffu-sion Filtering with Optimized Rotation Invariance”. English. In: Journal of VisualCommunication and Image Representation 13.1 (2002), pp. 103–118.

[31] Dohyoung Lee and Konstantinos N. Plataniotis. “Towards a Full-Reference Qual-ity Assessment for Color Images Using Directional Statistics”. English. In: IEEETransactions on Image Processing 24.11 (2015), pp. 3950–3965.

http://live.ece.utexas.edu

https://www.itu.int/rec/T-REC-P.910/en

https://www.python.org

http://opencv.org

http://opensource.org/licenses/BSD-3-Clause

http://opensource.org/licenses/BSD-3-Clause

http://docs.opencv.org/2.4/modules/video/doc/motion_analysis_and_object_tracking.html

http://docs.opencv.org/2.4/modules/video/doc/motion_analysis_and_object_tracking.html

http://www.gnu.org/licenses/gpl-3.0.html

http://www.gnu.org/licenses/gpl-3.0.html

BIBLIOGRAPHY 30

[32] Nicolas Le Bihan and Steve Sangwine. Quaternion toolbox for Matlab, Version2.0.0. url: http://sourceforge.net/projects/qtfm/.

http://sourceforge.net/projects/qtfm/

Video Quality Metric improvement using motion and spatial...

Documents

Transcript of Video Quality Metric improvement using motion and spatial...