Video is a Cube: Multidimensional Analysis and Video Quality...

8
1 Video is a Cube: Multidimensional Analysis and Video Quality Metrics Christian Keimel, Martin Rothbucher, Hao Shen and Klaus Diepold Institute for Data Processing, Technische Universit¨ at M¨ unchen Arcisstr. 21, 80333 M¨ unchen Abstract—Quality of Experience is becoming increasingly im- portant in signal processing applications. In taking inspiration from chemometrics, we provide an introduction to the design of video quality metrics by using data analysis methods, which are different from traditional approaches. These methods do not necessitate a complete understanding of the human visual system. We use multidimensional data analysis, an extension of well established data analysis techniques, allowing us to exploit higher dimensional data better. In the case of video quality metrics, it enables us to exploit the temporal properties of video more properly, the complete three dimensional structure of the video cube is taken into account in metrics’ design. Starting with the well known principal component analysis and an introduction to the notation of multi-way arrays, we then present their multidimensional extensions, delivering better quality prediction results. Although we focus on video quality, the presented design principles can easily be adapted to other modalities and to even higher dimensional datasets as well. Q UALITY OF EXPERIENCE (QoE) is a relatively new concept in signal processing that aims to describe how video, audio and multi-modal stimuli are perceived by human observers. In the field of video quality assessment, it is often of interest for researchers how the overall experience is in- fluenced by different video coding technologies, transmission errors or general viewing conditions. The focus is no longer on measurable physical quantities, but rather on how the stimuli are subjectively experienced and whether they are perceived to be of acceptable quality from a subjective point of view. QoE is in contrast to the well-established Quality of Service (QoS). There, we measure the signal fidelity, i.e. how much a signal is degraded during processing by noise or other disturbances. This is usually done by comparing the distorted with the original signal, which then gives us a measure of the signal’s quality. To understand the reason why QoS is not sufficient for capturing the subjective perception of quality, let us take a quick look at the most popular metric in signal processing to measure the QoS, the mean squared error (MSE). It is known that the MSE does not correlate very well with the human perception of quality, as we just determine the difference between pixel values in both images. The example in Fig. 1 illustrates this problem. Both images on the left have the same MSE with respect to the original image. Yet, we perceive the upper image distorted by coding artefacts to be of worse visual quality, than the lower image, where we just changed the contrast slightly. Further discussions of this problem can be found in [1]. I. HOW TO MEASURE QUALITY OF EXPERIENCE How then can we measure QoE? The most direct way is to conduct tests with human observers, who judge the visual quality of video material and provide thus information about the subjectively perceived quality. However, we face a problem in real-life: these tests are time consuming and quite expensive. The reason for this is that only a limited number of subjects can take part in a test at the same time, but also because a multitude of different test cases have to be considered. Apart from these more logistical problems, subjective tests are usually not suitable if the video quality is required to be monitored in real time. To overcome this difficulty, video quality metrics are de- signed and used. The aim is to approximate the human quality perception as good as possible with objectively measurable properties of the videos. Obviously, there is no single mea- surable quantity that by itself can represent the perceived QoE. Nevertheless, we can determine some aspects which are expected or shown to have a relation to the perception of quality and use these to design an appropriate video quality metric. II. DESIGN OF VIDEO QUALITY METRICS -THE TRADITIONAL APPROACH In the traditional approach, the quality metrics aim to implement the spatial and temporal characteristics of the human visual system (HVS) as well as possible in our metric. Many aspects of the HVS are not sufficiently understood and therefore a comprehensive model of the HVS is hardly possible to be built. Nevertheless, at least parts of the HVS can be de- scribed sufficiently enough in order to utilize these properties in video quality metrics. In general, there are two different ways to exploit these properties according to Winkler [2]: either a psychophysical or an engineering approach. The psychophysical approach relies primarily on a (partial) model of the HVS and tries to exploit known psychophysical effects, e.g. masking effects, adaption and contrast sensitivity. One advantage of this approach is that we are not limited to a specific coding technology or application scenario, as we implement an artificial observer with properties of the HVS. In Daly’s visual differences predictor [3], for example, the adaption of the HVS at different light levels is taken into account, followed by an orientation dependent contrast sensitivity function and finally models of the HVS’s different detection mechanism are applied. This full-reference predictor

Transcript of Video is a Cube: Multidimensional Analysis and Video Quality...

Page 1: Video is a Cube: Multidimensional Analysis and Video Quality …mediatum.ub.tum.de/doc/1082908/902432.pdf · 2011-08-19 · perception as good as possible with objectively measurable

1

Video is a Cube: MultidimensionalAnalysis and Video Quality Metrics

Christian Keimel, Martin Rothbucher, Hao Shen and Klaus DiepoldInstitute for Data Processing, Technische Universitat Munchen

Arcisstr. 21, 80333 Munchen

Abstract—Quality of Experience is becoming increasingly im-portant in signal processing applications. In taking inspirationfrom chemometrics, we provide an introduction to the designof video quality metrics by using data analysis methods, whichare different from traditional approaches. These methods do notnecessitate a complete understanding of the human visual system.We use multidimensional data analysis, an extension of wellestablished data analysis techniques, allowing us to exploit higherdimensional data better. In the case of video quality metrics,it enables us to exploit the temporal properties of video moreproperly, the complete three dimensional structure of the videocube is taken into account in metrics’ design. Starting with thewell known principal component analysis and an introductionto the notation of multi-way arrays, we then present theirmultidimensional extensions, delivering better quality predictionresults. Although we focus on video quality, the presented designprinciples can easily be adapted to other modalities and to evenhigher dimensional datasets as well.

QUALITY OF EXPERIENCE (QoE) is a relatively newconcept in signal processing that aims to describe how

video, audio and multi-modal stimuli are perceived by humanobservers. In the field of video quality assessment, it is oftenof interest for researchers how the overall experience is in-fluenced by different video coding technologies, transmissionerrors or general viewing conditions. The focus is no longer onmeasurable physical quantities, but rather on how the stimuliare subjectively experienced and whether they are perceivedto be of acceptable quality from a subjective point of view.

QoE is in contrast to the well-established Quality of Service(QoS). There, we measure the signal fidelity, i.e. how mucha signal is degraded during processing by noise or otherdisturbances. This is usually done by comparing the distortedwith the original signal, which then gives us a measure ofthe signal’s quality. To understand the reason why QoS is notsufficient for capturing the subjective perception of quality,let us take a quick look at the most popular metric in signalprocessing to measure the QoS, the mean squared error (MSE).It is known that the MSE does not correlate very well withthe human perception of quality, as we just determine thedifference between pixel values in both images. The examplein Fig. 1 illustrates this problem. Both images on the lefthave the same MSE with respect to the original image. Yet,we perceive the upper image distorted by coding artefacts tobe of worse visual quality, than the lower image, where wejust changed the contrast slightly. Further discussions of thisproblem can be found in [1].

I. HOW TO MEASURE QUALITY OF EXPERIENCE

How then can we measure QoE? The most direct way isto conduct tests with human observers, who judge the visualquality of video material and provide thus information aboutthe subjectively perceived quality. However, we face a problemin real-life: these tests are time consuming and quite expensive.The reason for this is that only a limited number of subjectscan take part in a test at the same time, but also becausea multitude of different test cases have to be considered.Apart from these more logistical problems, subjective testsare usually not suitable if the video quality is required to bemonitored in real time.

To overcome this difficulty, video quality metrics are de-signed and used. The aim is to approximate the human qualityperception as good as possible with objectively measurableproperties of the videos. Obviously, there is no single mea-surable quantity that by itself can represent the perceivedQoE. Nevertheless, we can determine some aspects which areexpected or shown to have a relation to the perception ofquality and use these to design an appropriate video qualitymetric.

II. DESIGN OF VIDEO QUALITY METRICS- THE TRADITIONAL APPROACH

In the traditional approach, the quality metrics aim toimplement the spatial and temporal characteristics of thehuman visual system (HVS) as well as possible in our metric.Many aspects of the HVS are not sufficiently understood andtherefore a comprehensive model of the HVS is hardly possibleto be built. Nevertheless, at least parts of the HVS can be de-scribed sufficiently enough in order to utilize these propertiesin video quality metrics. In general, there are two differentways to exploit these properties according to Winkler [2]:either a psychophysical or an engineering approach.

The psychophysical approach relies primarily on a (partial)model of the HVS and tries to exploit known psychophysicaleffects, e.g. masking effects, adaption and contrast sensitivity.One advantage of this approach is that we are not limitedto a specific coding technology or application scenario, aswe implement an artificial observer with properties of theHVS. In Daly’s visual differences predictor [3], for example,the adaption of the HVS at different light levels is takeninto account, followed by an orientation dependent contrastsensitivity function and finally models of the HVS’s differentdetection mechanism are applied. This full-reference predictor

Christian
Schreibmaschinentext
© 2011 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Page 2: Video is a Cube: Multidimensional Analysis and Video Quality …mediatum.ub.tum.de/doc/1082908/902432.pdf · 2011-08-19 · perception as good as possible with objectively measurable

2

n

y

n sequences

subjective testing

extract m features

n

X

m

y = Xb

b

m

weights b

same MSE

Fig. 1: Images with same MSE, but different visual quality (left) and how a model is built with data analysis: subjective testingand feature extraction for each video sequence (right)

then delivers a measure for the perceptual difference in thesame areas of two images. Further representatives of thisapproach are Lubin’s visual discrimination model [4], theSarnoff Just Noticeable Difference (JND) [5] and Winkler’sperceptual distortion metric [6].

In the engineering approach, the properties of the HVSare not implemented directly, but rather it determines featuresknown to be correlated to the perceived visual quality. Thesefeatures are then extracted and used in the video quality metric.As no in-depth understanding of the HVS is needed, this typeof metric is commonly used in current research. In contrastto the psychophysical approach, however, we are limited topredefined coding technologies or application scenarios, aswe do not construct an artificial observer, but rather derivethe features from artifacts introduced by the processing ofthe videos. One example for such a feature is blocking asseen in Fig. 1. This feature is well known, as it is especiallynoticeable in highly compressed video. It is caused by block-based transforms such as the Discrete Cosine Transform(DCT) or integer transform in the current video encodingstandards MPEG-2 and H.264/AVC. With this feature, weexploit the knowledge, that human perception is sensitive toedges, and assume therefore that artificial edges introduced bythe encoding results in a degraded, perceived quality. Usually,more than one feature is extracted and these features arethen combined into one value, by using assumptions aboutthe HVS. Typical representatives of this approach are thewidely used Structural Similarity (SSIM) index by Wang etal. [7], the video quality metric by Wolf and Pinson [8]and Watson’s digital video quality metric [9]. Moreover wecan distinguish between full-reference, reduced-reference andno-reference metrics, where we have either the undistortedoriginal, some meta information about the undistorted originalor only the distorted video available, respectively. We referto [10] for further information on the HVS, and [11], [12] foran overview of current video quality metrics.

In general, the exploitation of more properties of the HVSor their corresponding features in a metric allows us to modelthe perception of quality better. However, since the HVS

is not understood completely, and consequently, no explicitmodel of the HVS describing all its aspects is available inthe community, it is not obvious how the features should becombined. But do we really need to know a-priori how tocombine the features?

III. AN ALTERNATIVE METHOD: DATA ANALYSIS

Sometimes it is helpful to look at other disciplines. Videoquality estimation is not the only application area in whichwe want to quantify something that is not directly accessiblefor measurement. Similar problems often occur in chemistryand related research areas. In food science, for example,researchers face a comparable problem: they want to quantifythe taste of samples, but taste is not directly measurable.The classic example is about the determination of the perfectmixture for hot chocolate that tastes best. One can measuremilk, sugar or cocoa content, but there is not an a-prioriphysical model that allows us to define the resulting taste.To solve this problem, a data-driven approach is applied, i.e.instead of making explicit assumptions of the overall systemand relationship between the dependent variable, e.g. taste andthe influencing variables e.g. milk, sugar and cocoa, the inputand output variable are analyzed. In this way we obtain modelspurely via the analysis of the data.

In chemistry, this is known as chemometrics and has beenapplied successfully to many problems in this field for thelast three decades. It provides a powerful tool to tackle theanalysis and prediction of systems that are understood only toa limited degree. So far this method is not well known in thecontext of video quality assessment or even multimedia qualityassessment in general. A good introduction into chemometricscan be found in [13].

By applying this multivariate data analysis to video quality,we now consider the HVS as a black box and therefore do notassume a complete understanding of it. The input correspondsto features we can measure and the output of the box to theperceived visual quality obtained in subjective tests. Firstly,we extract m features from an image or video frame I ,resulting in a 1 × m row vector x. While this is similar to

Page 3: Video is a Cube: Multidimensional Analysis and Video Quality …mediatum.ub.tum.de/doc/1082908/902432.pdf · 2011-08-19 · perception as good as possible with objectively measurable

3

the engineering approach described in the previous section, animportant difference is that we do not make any assumptionabout the relationship between the features themselves, butalso not about how they are combined into a quality value.

In general, we should not limit the number of selectedfeatures unnecessarily. Or to quote Martens and Martens [13],Beware of wishful thinking! As we do not have a completeunderstanding of the underlying system, it can be fatal if weexclude some features before conducting any analysis, becausewe consider them to be irrelevant. On the other hand, data thatcan be objectively extracted, like the features in our case, isusually cheap or in any case less expensive to generate thansubjective data gained in tests. If some features are irrelevantto the quality, we will find out during the analysis. Of courseit is only sensible to select features that have some verifiedor at least some suspected relation to the human perceptionof visual quality. For example, we could measure the roomtemperature, but it is highly unlikely that room temperaturehas any influence in our case.

For n different video sequences, we extract a correspondingfeature vector x for each sequence and thus get an n × mmatrix X, where each row describes a different sequence orsample and each column describes a different feature as shownin Fig. 1. We generate a subjective quality value for each ofthe n sequences by subjective testing and get an n×1 columnvector y that will represent our ground truth. Based on thisdataset, a model can be generated to explain the subjectivelyperceived quality with objectively measurable features. Ouraim is now to find an m× 1 column vector b that relates thefeatures in X to our ground truth in y or provides the weightsfor each feature to get the corresponding visual quality. Thisprocess is called calibration or training of the model, andthe used sequences are the training set. We can use b to alsopredict the quality of new, previously unknown sequences. Thebenefit of using this approach is that we are able to combinetotally different features into one metric without knowing theirproportional contribution to the overall perception of qualitybeforehand.

IV. CLASSIC AND WELL KNOWN:LINEAR REGRESSION

One classic approach to estimate the weight vector b is viaa simple multiple linear regression model, i.e.

y = Xb+ ε, (1)

where ε is the error term. Without loss of generality, the datamatrix X can be assumed to be centered, namely with zeromeans, and consequently the video quality values y are alsocentered.

Using a least squares estimation, we are given an estimationof b as

b = (X>X)+X>y, (2)

where Z+ denotes the More-Penrose pseudo-inverse of matrixZ. We use the pseudo-inverse, as we can not assume thatcolumns of X representing the different features are linearlyindependent and therefore X>X can be rank deficient. For anunknown video sequence VU and the corresponding feature

vector xu, we are then able to predict its visual quality yuwith

yu = xub. (3)

Yet, this simple approach has a drawback: we assume implic-itly in the estimation process of the weights that all featuresare equally important. Clearly, this will not always be the case,as some features may have a larger variance than others.

V. AN IMPROVEMENT:PRINCIPAL COMPONENT REGRESSION

We can address the aforementioned issue by selecting theweights in the model, so that they take into account theinfluence of the individual features on the variance in thefeature matrix X. We are therefore looking for so-called latentvariables, that are not directly represented by the measuredfeatures themselves, but rather by a hidden combination ofthem. In other words, we aim to reduce the dimensionality ofour original feature space into a more compact representation,more fitting for our latent variables. One well known methodis the Principal Component Analysis (PCA), which extractsthe latent variables as the Principal Components (PCs). Thevariance of the PCs is expected to preserve the variance of theoriginal data. We then perform a regression on some of thesePCs leading to the principal component regression (PCR). AsPCA is a well known method, we just briefly recap somebasics.

Let X be a (centered) data matrix, we define r =min {n,m} and using a singular value decomposition (SVD)we get the following factorization:

X = UDP>, (4)

where U is an n × r matrix with r orthonormal columns,P is an m × r matrix with r orthonormal columns and Dis a r × r diagonal matrix. P is called the loadings matrixand its columns p1, . . . ,pr are called loadings. They representthe eigenvectors of X>X. Furthermore we define the scoresMatrix

T = UD = XP. (5)

The basic idea behind PCR is to approximate X by onlyusing the first g columns of T and P, representing the g largesteigenvalues of X>X and also the first g PCs. We herebyassume that the combination of the largest g eigenvaluesdescribe the variance in our data matrix X sufficiently andthat we can therefore discard the smaller eigenvalues. If g issmaller than r, the model can be built with a reduced rank.Usually we aim to explain at least 80-90% of the variance inX. But also other selection criteria are possible.

Our regression model with the first g PCs can thus be writtenas

y = Tgc, (6)

where Tg represents a matrix with the first g columns of Tand c the (unknown) weight vector. Once again, we perform amultiple linear regression. We estimate c with the least squaresmethod as

c = (T>g Tg)−1T>g y. (7)

Page 4: Video is a Cube: Multidimensional Analysis and Video Quality …mediatum.ub.tum.de/doc/1082908/902432.pdf · 2011-08-19 · perception as good as possible with objectively measurable

4

t frames of a video

t m

t feature vectors x

t

pooling function

m

feature vector x

Fig. 2: Temporal pooling

In the end, we are interested in the weights, so that we candirectly calculate the visual quality. Therefore we determinethe estimated weight vector b as

b = Pgc, (8)

with Pg representing the matrix with the first g columns ofP. We can predict the visual quality for an unknown videosequence VU and the corresponding feature vector xu with(3).

PCA was firstly used in the design of video quality metricsby Miyahara in [14]. We refer to [15] for further informationon PCA and PCR. A more sophisticated method often used inchemometrics is the partial least squares regression (PLSR).This method also takes the variance in the subjective qualityvector y into account as well as the variance in the featurematrix X. PLSR has been used in the design of video qualitymetrics in e.g. [16]. Further information on PLSR itself canbe found in [13] and [17].

VI. VIDEO IS A CUBE

The temporal dimension is the main difference between stillimages and video. In the previous section we assumed that weextract the feature vector only from one image or one videoframe, which is a two dimensional matrix. In other words,video was considered to be just a simple extension of stillimages. This is not a unique omission only in this article sofar, but the temporal dimension is quite often neglected inmany contributions in the field of video quality metrics. Theadditional dimension is usually managed by temporal pooling.Either the features themselves are temporally pooled into onefeature value for the whole video sequence or the metric isapplied to each frame of the video separately and then themetric’s values are pooled temporally over all frames to gainone value, as illustrated in Fig. 2.

Pooling is mostly done by averaging, but also other simplestatistical functions are employed such as standard deviation,10/90% percentiles, median or minimum/maximum. Even if ametric considers not only the current frame, but also precedingor succeeding frames, e.g. with a 3D filter [18] or spatio-temporal tubes [19], the overall pooling is still done with oneof the above functions. But this arbitrary pooling, especiallyaveraging, obscures the influence of temporal distortions onthe human perception of quality, as intrinsic dependenciesand structures in the temporal dimension are disregarded. Theimportance of video features’ temporal properties in the designof video quality metrics was recently shown in [20].

Omitting the temporal pooling step and introducing theadditional temporal dimension directly in the design of thevideo quality metrics can improve the prediction performance.We propose therefore to consider video in its natural threedimensional structure as a video cube. Extending the dataanalysis approach, we add an additional dimension to ourdataset and thus arrive at multidimensional data analysis, anextension of the two dimensional data analysis. In doing so,we gain a better understanding of the video’s properties andwill thus be able to interpret the extracted features better. Weno longer employ an a-priori temporal pooling step, but usethe whole video cube to generate the prediction model for thevisual quality and thus consider the temporal dimension ofvideo more appropriately.

VII. TENSOR NOTATION

Before moving on to the multidimensional data analysis, weshortly introduce the notation for handling multi-way arraysor tensors.

In general, our video cube can be presented as a three-wayu×v× t array V(:, :, :), where the u and v are the frame size,and t is the number of frames. Similarly, we can extend thetwo dimensional feature matrix X into the temporal dimensionas a n×m×t three-way array or feature cube. Both are shownin Fig. 3. In this work, we denote X(i, j, k) as the (i, j, k)-thentry of X, X(i, j, :) as the vector with a fixed pair of (i, j)of X, referred to as tensor fiber, and X(i, :, :) the matrix of Xwith a fixed index i, referred to as tensor slice. The differentfibers and slices are shown in Fig. 4. For more informationabout tensors and multi-way arrays, see [21] and for multi-waydata analysis refer to [22], [23].

VIII. UNFOLDING TIME

The easiest way to apply conventional data analysis methodsfor analyzing tensor data, is to represent tensors as matrices.It transforms the elements of a tensor or multi-way array into

t

u

v t

n m

Fig. 3: Video cube and feature cube

Page 5: Video is a Cube: Multidimensional Analysis and Video Quality …mediatum.ub.tum.de/doc/1082908/902432.pdf · 2011-08-19 · perception as good as possible with objectively measurable

5

column (mode-1)fibre

row (mode-2)fibre

tube (mode-3)fibre

horizontal slice

lateral slice

frontal slice

X(:,2,1)

X(1,:,2)

X(3,3,:)

X(1,:,:) X(:,2,:)

X(:,:,1)

Fig. 4: Tensor notation: fibre (left) and slice (right)

entries of a matrix. Such a process is known as unfolding,matricization, or flattening.

In our setting, we are interested in the temporal dimensionand therefore perform the mode-1 unfolding of our three-way array X(i, j, k). Thus we obtain a new n × (m · t)matrix Xunfold, whose columns are arranged mode-1 fibers ofX(i, j, k). For simplicity, we assume that the temporal orderis maintained in Xunfold. The structure of this new matrix isshown in Fig. 5. We then perform a PCR on this matrix asdescribed previously and obtain a model of the visual quality.Finally, we can predict the visual quality of an unknown video

… t

m

m∙t

n X(:,:,1) X(:,:,2) X(:,:,t)

1 2 t

Fig. 5: Unfolding of the feature cube

sequence VU with its feature vector xu by using (3).Note, that xu is now of the dimension 1 × (m · t). One

disadvantage during the model building step with PCR isthat the SVD must be performed on a rather large matrix.Depending on the frames in the video sequence, the timeneeded for model building can increase by a factor of 103 orhigher. But more importantly, we still lose some informationabout the variance by unfolding and thus destroying thetemporal structure.

IX. 2D PRINCIPAL COMPONENT REGRESSION

Instead of unfolding we can include the temporal dimensiondirectly in the data analysis via performing a multidimensionaldata analysis. We use the two-dimensional extension of thePCA, the (2D-PCA), recently proposed by Yang et al. [24], incombination with a least squares regression as 2D-PCR. For avideo sequence with t frames, we can carve the n×m× t fea-ture cube into t slices, where each slice represents one frame.Without loss of generality, we can compute the covariance or

scatter matrix as

XSct =1

t

t∑i=1

X(:, :, i)>X(:, :, i), (9)

where, by abusing the notation, X(:, :, i) denotes the centereddata matrix. It describes therefore the average covarianceover the temporal dimension t. Then we perform the SVDperformed on XSct to extract the PCs, similar to the previouslydescribed one dimensional PCR in (4).

Instead of a scores matrix T, we now have a three-wayn× g × t scores array T(:, :, :), with each slice defined as

T(:, :, i) = X(:, :, i)P. (10)

Similar to (7), we then estimate a g× 1× t prediction weightfor each slice with the first g principal components as

C(:, :, i) =(Tg(:, :, i)

>Tg(:, :, i))+

Tg(:, :, i)>y(i), (11)

before expressing the weights in our original feature spacewith a m× 1× t three-way array

B(:, :, i) = PgC(:, :, i), (12)

comparable to (8) for the one dimensional PCR. Note, that theweights are now represented by a (rotated) matrix.

A quality prediction for the i-th slice can then be performedin the same manor as in (3), i.e.

yu(i) = Xu(:, :, i)B(:, :, i), (13)

where Xu represents a 1 × m × t feature matrix for onesequence and yu(i) the 1 × t predicted quality vector. Wecan now use this quality prediction individually for each sliceor generate one quality value for the whole video sequenceby pooling. 2D-PCR has been used so far for video qualitymetrics in [25].

X. CROSS VALIDATION

In general data analysis methods require a training phase ora training set. One important aspect is to employ a separatedata set for the validation of the designed metric. Using thesame data set for training and validation will usually give usmisleading results. Not surprisingly, our metric performs ex-cellently with its training data. For unknown video sequences,on the other hand, the prediction quality could be very bad. But

Page 6: Video is a Cube: Multidimensional Analysis and Video Quality …mediatum.ub.tum.de/doc/1082908/902432.pdf · 2011-08-19 · perception as good as possible with objectively measurable

6

train

ing

set X

valid

atio

n

sequences features

x

mod

el b

uild

ing

valid

atio

n

0

10

0 10

0

10

0 10

model 1 model 2 model 3 subjective quality

pred

icte

d qu

ality

cross validation

no cross validation

Fig. 6: Cross validation: splitting up the data set in training and validation set (left), building different models with eachcombination (center) and quality prediction of a metric (right)

as mentioned previously, the data in video quality metrics isusually expensive to generate as we have to conduct subjectivetests. Hence we can not really afford to use only a sub-set ofall available data for the model building, as the more trainingdata we have, the better the prediction abilities of our metricwill be.

This problem can be partially avoided by performing across validation, e.g. leave-one-out. This allows us to use allavailable data for training, but also to use the same data setfor the validation of the metric. Assuming we have i videosequences with different content, then we use i − 1 videosequences for training and the left out sequence for validation.All in all, we eventually get i models. This is illustrated inFig. 6. The general model can then be obtained by differentmethods e.g. averaging the weights over all models or selectingthe model with the best prediction performance. For moreinformation on cross validation in general we refer to [13]and [26].

XI. USING DATA ANALYSIS: AN EXAMPLE METRIC

But do more dimensions really help us in designing bettervideo quality metrics? In order to compare the approaches todata analysis we presented in this work, we therefore designas an example a simple metric for estimating the visual qualityof coded video with each method in this section.

The cheapest objective data available for an encoded videocan be found directly in its bitstream. Even tough we do notknow a-priori which of the bitstream’s properties are more,and which are less important, we can safely assume thatthey are related in some way to the perceived visual quality.How they are related, will be determined by data analysis.In this example, we use videos encoded with the popularH.264/AVC standard, currently used in many applications fromhigh definition HDTV to internet based IPTV. For each frame,we extract 16 different features describing the partitioninginto different block sizes and types, the properties of themotion vectors and lastly the quantization, similar to themetric proposed in [27]. Each frame is thus represented as1 × 16 feature vector x. Note, that no further preprocessingof the bitstream features was done. Alternatively, one can also

extract features independent of the used coding technology,e.g. blocking or blurring, as described in [16].

Certainly, we also need subjective quality values for theseencoded videos as ground truth in order to perform the dataanalysis. Different methodologies as well as the requirementson the test set-up and equipment for obtaining this data aredescribed in international standards e.g. ITU-R BT.500 orITU-T P.910. Another possibility is to use existing, publiclyavailable datasets, containing both the encoded videos and thevisual quality values. One advantage of using such datasets isthat different metrics can be compared more easily.

For this example, we will use a dataset provided by IT-IST [28]. It consists of eleven videos in CIF resolution (352×288) and a frame rate of 30 frames per second as shown inFig. 7. They cover a wide range of different content types,at bitrates from 64 kBit/s to 2.000 kBit/s, providing a widevisual quality range with in total n = 52 data points, leadingto a 52 × 1 quality vector y. According to [28], the test wasconducted using to the DCR double stimulus method describedin ITU-T P.910. For each data point, the test subjects wereshown the undistorted original video, followed by the distortedencoded video, and then asked to assess the impairment of thecoded video with respect to the original on a discrete five pointmean opinion score (MOS) scale from 1, very annoying, to 5,imperceptible. For more information on H.264/AVC in generalwe refer to [29], and for the H.264/AVC feature extractionto [27]. A comprehensive list of publicly available datasets isprovided at [30].

XII. MORE DIMENSIONS ARE REALLY BETTER

Finally, we compare the four video quality metrics, eachdesigned with one of the presented methods. By using a crossvalidation approach, we design eleven different models foreach method. Each model is trained using ten video sequencesand the left out sequence is then used for validation of themodel built with the training set. Hence, we can measurethe prediction performance of the models for unknown videosequences.

The performance of the different models is compared bycalculating the Pearson correlation and the Spearman rankorder correlation between the subjective visual quality and the

Page 7: Video is a Cube: Multidimensional Analysis and Video Quality …mediatum.ub.tum.de/doc/1082908/902432.pdf · 2011-08-19 · perception as good as possible with objectively measurable

7

Fig. 7: Test videos from top to bottom, left to right: City,Coastguard, Container, Crew, Football, Foreman, Mobile,Silent, Stephan, Table Tennis and Tempete.

quality predictions. The Pearson correlation gives an indicationabout the prediction accuracy of the model and the Spearmanrank order correlation gives an indication how much theranking between the sequences changes between the predictedand subjective quality. Additionally, we determine the rootmean squared error (RMSE) between prediction and groundtruth, but also the percentage of predictions that fall outsidethe used quality scale from 1 to 5.

By comparing the results in Fig. 8 and Table I, we can seethat a better inclusion of the temporal dimension in the modelbuilding helps to improve the prediction quality. Note, that thisimprovement was achieved very easily, as we did nothing else,but just changing the data analysis method. In each step weexploit the variation in our data better. Firstly just within ourtemporally pooled features with the step from multiple linearregression to PCR, then by the step in the third dimensionwith unfolding and 2D-PCR.

XIII. SUMMARY

In this work, we provide an introduction into the world ofdata analysis and especially the benefits of multidimensionaldata analysis in the design of video quality metrics. We haveseen in our example, that even with a very basic metric, byusing multidimensional data analysis, we can increase the per-formance of predicting the Quality of Experience significantly.Although the scope of this introduction covered only the qual-ity of video, the proposed methods can obviously be extendedto more dimensions and/or other areas of application. It isinteresting to note, that the dimensions need not be necessarilyspatial or temporal, but also may represent different modalitiesor perhaps even a further segmentation of the existing featurespaces.

REFERENCES

[1] Z. Wang and A. Bovik, “Mean squared error: Love it or leave it? a newlook at signal fidelity measures,” IEEE Signal Processing Magazine,vol. 26, no. 1, pp. 98 –117, Jan. 2009.

[2] S. Winkler, Digital Video Image Quality and Perceptual Coding. CRCPress, 2006, ch. Perceptual Video Quality Metrics - A Review, pp. 155–179.

[3] S. J. Daly, Digital Images and Human Vision. MIT Press, 1993, ch.The visible differences predictor: An algorithm for the assessment ofimage fidelity, pp. 179–206.

[4] J. Lubin, Vision Models for Target Detection and Recognition. WorldScientific Publishing, 1995, ch. A visual discrimination model foriamging system design and evaluation, pp. 245–283.

[5] J. Lubin and D. Fibush, Sarnoff JND vision model, T1A1.5 WorkingGroup, ANSI T1 Standards Committee Std., 1997.

[6] S. Winkler, Digital Video Quality - Vision Models and Metrics. Wiley& Sons, 2005.

[7] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli, “Image quality assess-ment: From error visibility to structural similarity,” IEEE Transactionson Image Processing, vol. 13, no. 4, pp. 600–612, Apr. 2004.

[8] S. Wolf and M. H. Pinson, “Spatial-temporal distortion metric for in-service quality monitoring of any digital video system,” in Society ofPhoto-Optical Instrumentation Engineers (SPIE) Conference Multime-dia Systems and Applications II, vol. 3845, Nov. 1999, pp. 266–277.

[9] A. B. Watson, “Toward a perceptual video-quality metric,” in Societyof Photo-Optical Instrumentation Engineers (SPIE) Conference HumanVision and Electronic Imaging III, vol. 3299, Jan. 1998, pp. 139–147.

[10] B. A. Wandell, Foundations of Vision. Sinauer Associates, 1996.[11] H. R. Wu and K. R. Rao, Eds., Digital Video Image Quality and

Perceptual Coding. CRC Press, 2006.[12] Z. Wang and A. C. Bovik., Modern Image Quality Assessment, ser. Syn-

thesis Lectures on Image, Video, and Multimedia Processing. Morgan& Claypool Publishers, 2006.

[13] H. Martens and M. Martens, Multivariate Analysis of Quality. Wiley& Sons, 2001.

[14] M. Miyahara, “Quality assessments for visual service,” IEEE Commu-nications Magazine, vol. 26, no. 10, pp. 51 –60, 81, Oct. 1988.

[15] I. Jolliffe, Principal Component Analysis. Springer, 2002.[16] T. Oelbaum, C. Keimel, and K. Diepold, “Rule-based no-reference video

quality evaluation using additionally coded videos,” IEEE Journal ofSelected Topics in Signal Processing, vol. 3, no. 2, pp. 294 –303, April2009.

[17] F. Westad, K. Diepold, and H. Martens, “QR-PLSR: Reduced-rankregression for high-speed hardware implementation,” Journal of Chemo-metrics, vol. 10, pp. 439–451, 1996.

[18] K. Seshadrinathan and A. Bovik, “Motion tuned spatio-temporal qualityassessment of natural videos,” IEEE Transactions on Image Processing,vol. 19, no. 2, pp. 335 –350, Feb. 2010.

[19] A. Ninassi, O. Le Meur, P. Le Callet, and D. Barba, “Considering tem-poral variations of spatial visual distortions in video quality assessment,”IEEE Journal of Selected Topics in Signal Processing, vol. 3, no. 2, pp.253 –265, Apr. 2009.

[20] C. Keimel, T. Oelbaum, and K. Diepold, “Improving the predictionaccuracy of video qualtiy metrics.” IEEE International Conference onAcoustics, Speech and Signal Processing, 2010. ICASSP 2010., pp.2442–2445, Mar. 2010.

[21] A. Cichocki, R. Zdunek, A. H. Phan, and S.-I. Amari, NonnegativeMatrix and Tensor Factorizations: Applications to Exploratory Multi-way Data Analysis and Blind Source Separation. Wiley & Sons, 2009.

[22] A. Smilde, R. Bro, and P. Geladi, Multi-way Analysis: Applications inthe Chemical Sciences. Wiley & Sons, 2004.

[23] P. M. Kroonenberg, Applied Multiway Data Analysis. Wiley & Sons,2008.

[24] J. Yang, D. Zhang, A. Frangi, and J. Yang, “Two-dimensional pca:a new approach to appearance-based face representation and recogni-tion,” IEEE Transactions on Pattern Analysis and Machine Intelligence,vol. 26, no. 1, pp. 131 –137, Jan. 2004.

[25] C. Keimel, M. Rothbucher, and K. Diepold, “Extending video qualitymetrics to the temporal dimension with 2D-PCR,” in Image Quality andSystem Performance VIII, S. P. Farnand and F. Gaykema, Eds., vol. 7867.SPIE, Jan. 2011.

[26] E. Anderssen, K. Dyrstad, F. Westad, and H. Martens, “Reducing over-optimism in variable selection by cross-model validation,” Chemometricsand Intelligent Laboratory Systems, vol. 84, no. 1-2, pp. 69–74, 2006.

[27] C. Keimel, J. Habigt, M. Klimpke, and K. Diepold, “Design of no-reference video quality metrics with multiway partial least squaresregression,” in IEEE International Workshop on Quality of MultimediaExperience, 2011. QoMEX 2011., Sep., 2011, pp. 49–54.

[28] T. Brandao and M. P. Queluz, “No-reference quality assessment ofH.264/AVC encoded video,” IEEE Transactions on Circuits and Systemsfor Video Technology, vol. 20, no. 11, pp. 1437–1447, Nov. 2010.

[29] T. Wiegand, G. J. Sullivan, G. Bjøntegaard, and A. Luthra, “Overview ofthe H.264/AVC Video Coding Standard,” IEEE Transactions on Circuitsand Systems for Video Technology, vol. 13, no. 7, pp. 560–576, Jul. 2003.

Page 8: Video is a Cube: Multidimensional Analysis and Video Quality …mediatum.ub.tum.de/doc/1082908/902432.pdf · 2011-08-19 · perception as good as possible with objectively measurable

8

1

2

3

4

5

1 2 3 4 5

pred

icet

ed v

isua

l qua

lity

[MO

S]

visual quality [MOS]

1

2

3

4

5

1 2 3 4 5

pred

icet

ed v

isua

l qua

lity

[MO

S]

visual quality [MOS]

unfolding + PCR 2D-PCR

PCR

1

2

3

4

5

1 2 3 4 5

pred

icet

ed v

isua

l qua

lity

[MO

S]

visual quality [MOS]

1

2

3

4

5

1 2 3 4 5

pred

icet

ed v

isua

l qua

lity

[MO

S]

visual quality [MOS]

linear regression

Fig. 8: Comparison of the presented methods: subjective quality vs. predicated quality on a mean opinion score (MOS) scalefrom 1 to 5, worst to best quality.

Pearson correlation

Spearman correlation RMSE Outside scale

linear regression 0.72 0.72 1.04 15%PCR 0.80 0.82 0.81 13%unfolding + PCR 0.75 0.83 0.82 10%2D-PCR 0.89 0.94 0.59 6%

TABLE I: Performance measurements: Pearson correlation, Spearman rank order correlation and RMSE. Additionally, the ratioof how many quality predictions are outside of the given scale.

[30] S. Winkler. (2011, Jul.) Image and video quality resources. [Online].Available: http://stefan.winkler.net/resources.html