IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

12
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013 2711 Sparse/DCT (S/DCT) Two-Layered Representation of Prediction Residuals for Video Coding Je-Won Kang, Moncef Gabbouj, Fellow, IEEE , and C.-C. Jay Kuo, Fellow, IEEE Abstract—In this paper, we propose a cascaded sparse/DCT (S/DCT) two-layer representation of prediction residuals, and implement this idea on top of the state-of-the-art high efficiency video coding (HEVC) standard. First, a dictionary is adaptively trained to contain featured patterns of residual signals so that a high portion of energy in a structured residual can be efficiently coded via sparse coding. It is observed that the sparse represen- tation alone is less effective in the R-D performance due to the side information overhead at higher bit rates. To overcome this problem, the DCT representation is cascaded at the second stage. It is applied to the remaining signal to improve coding efficiency. The two representations successfully complement each other. It is demonstrated by experimental results that the proposed algorithm outperforms the HEVC reference codec HM5.0 in the Common Test Condition. Index Termsρ domain rate model, discrete cosine transform (DCT), high efficiency video coding (HEVC), multilayered coding, overcomplete dictionary based video coding, residual coding, sparse representation. I. I NTRODUCTION T HE SPARSE representation allows a transform of a signal as a linear combination of very few elementary atoms in a dictionary. It has been intensively studied in recent years and applied to various image/video processing applications [1]–[3]. For example, it was used in [4] for image decomposition in morphological component analysis. The two-dimensional (2D) separable discrete cosine transform (DCT) has been used in practical image/video coding standards for several decades. The new video coding standard, i.e., High Efficiency Video Coding (HEVC), also adopts the 2D DCT as the core transform [5]. In this work, we propose a novel two-layered transform scheme that consists of two transforms in cascade; namely, the sparse-representation-based transform and the 2D DCT. In the JPEG image coding standard, the DCT is applied to image sources, which can be well approximated by a Manuscript received July 22, 2012; revised December 18, 2012; accepted March 24, 2013. Date of publication April 4, 2013; date of current version May 16, 2013. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Joan Serra-Sagrista. J.-W. Kang is with the Multimedia R&D and Standard Team, Qualcomm Technologies, Inc., San Diego, CA 92121 USA (e-mail: [email protected]). M. Gabbouj is with the Department of Signal Processing, Tampere Univer- sity of Technology, Tampere 33720, Finland (e-mail: moncef.gabbouj@tut.fi). C.-C. J. Kuo are with the Ming Hsieh Department of Electrical Engineering and Signal and Image Processing Institute, University of Southern California, Los Angeles CA 90089-2564 USA (e-mail: [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2013.2256917 first-order stationary Markov model, with a strong inter-pixel correlation. However, in the context of video coding, the DCT is applied to prediction residuals after inter or intra prediction. The residual signals often contain strong gradient components caused by motion compensated prediction errors along edges or object boundaries, and exhibit a more dynamic correlation between neighbor pixels [6], [7]. The 1D horizontal and vertical sinusoidal bases cannot efficiently represent the slant features and, thus, lead to undesired degradation such as the ringing artifact. Several modified DCT schemes have been proposed to address the non-stationary property of prediction residuals. They took the slant features of residual signals into account. One idea was to apply an adaptive transform [8], [9], which included directionally oriented basis functions, to directional components. In [10], the position and size of a transform block could be varied to localize prediction errors. Several contri- butions were made in the development of the transform core experiment (CE) with supplementary unitary transforms in the HEVC standardization. The proposals were categorized into a mode-dependent transform and a secondary transform. In the current HEVC design, 1D DCT and 1D DST are adaptively applied to the intra 4 × 4 transform unit (TU) based on an intra prediction mode [11]. Besides, a mode-dependent KLT transform and a boundary adaptive transform were proposed in the CE [12], [13]. In [14], the 1D vertical/horizontal transforms were selectively skipped to yield a better coding gain. For the secondary transform, a rotational transform was proposed to move transform coefficients to low frequency positions to facilitate entropy coding [15]. To allow a sparse representation, a dictionary usually con- tains a larger number of atoms than the dimension of the signal space. The Matching Pursuit (MP) [16] algorithm is often used as a greedy method for atom selection. Math- ematical models created by a union of orthonormal bases such as the Wavelet and the Discrete Fourier Transforms were used to build a dictionary. Directional and redundant wavelet packets in a dual-tree discrete wavelet transform were proposed in [17] to be adapted to image contents. It offered better coding performance over JPEG2000. Research on the use of an overcomplete dictionary for video coding was first conducted in [18]. They built a dictionary with modulated Gabor atoms. In [19], a sparse constraint was pertained in the training process for a more compact representation with reduced side information. However, the coding performance of these methods were shown to be comparable with the state-of-the-art codecs in very low bit-rates only. Their coding 1057-7149/$31.00 © 2013 IEEE

Transcript of IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

Page 1: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013 2711

Sparse/DCT (S/DCT) Two-Layered Representationof Prediction Residuals for Video Coding

Je-Won Kang, Moncef Gabbouj, Fellow, IEEE, and C.-C. Jay Kuo, Fellow, IEEE

Abstract— In this paper, we propose a cascaded sparse/DCT(S/DCT) two-layer representation of prediction residuals, andimplement this idea on top of the state-of-the-art high efficiencyvideo coding (HEVC) standard. First, a dictionary is adaptivelytrained to contain featured patterns of residual signals so that ahigh portion of energy in a structured residual can be efficientlycoded via sparse coding. It is observed that the sparse represen-tation alone is less effective in the R-D performance due to theside information overhead at higher bit rates. To overcome thisproblem, the DCT representation is cascaded at the second stage.It is applied to the remaining signal to improve coding efficiency.The two representations successfully complement each other.It is demonstrated by experimental results that the proposedalgorithm outperforms the HEVC reference codec HM5.0 in theCommon Test Condition.

Index Terms—ρ domain rate model, discrete cosine transform(DCT), high efficiency video coding (HEVC), multilayered coding,overcomplete dictionary based video coding, residual coding,sparse representation.

I. INTRODUCTION

THE SPARSE representation allows a transform of a signalas a linear combination of very few elementary atoms in

a dictionary. It has been intensively studied in recent years andapplied to various image/video processing applications [1]–[3].For example, it was used in [4] for image decompositionin morphological component analysis. The two-dimensional(2D) separable discrete cosine transform (DCT) has beenused in practical image/video coding standards for severaldecades. The new video coding standard, i.e., High EfficiencyVideo Coding (HEVC), also adopts the 2D DCT as the coretransform [5]. In this work, we propose a novel two-layeredtransform scheme that consists of two transforms in cascade;namely, the sparse-representation-based transform and the 2DDCT.

In the JPEG image coding standard, the DCT is appliedto image sources, which can be well approximated by a

Manuscript received July 22, 2012; revised December 18, 2012; acceptedMarch 24, 2013. Date of publication April 4, 2013; date of current versionMay 16, 2013. The associate editor coordinating the review of this manuscriptand approving it for publication was Dr. Joan Serra-Sagrista.

J.-W. Kang is with the Multimedia R&D and Standard Team, QualcommTechnologies, Inc., San Diego, CA 92121 USA (e-mail: [email protected]).

M. Gabbouj is with the Department of Signal Processing, Tampere Univer-sity of Technology, Tampere 33720, Finland (e-mail: [email protected]).

C.-C. J. Kuo are with the Ming Hsieh Department of Electrical Engineeringand Signal and Image Processing Institute, University of Southern California,Los Angeles CA 90089-2564 USA (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2013.2256917

first-order stationary Markov model, with a strong inter-pixelcorrelation. However, in the context of video coding, theDCT is applied to prediction residuals after inter or intraprediction. The residual signals often contain strong gradientcomponents caused by motion compensated prediction errorsalong edges or object boundaries, and exhibit a more dynamiccorrelation between neighbor pixels [6], [7]. The 1D horizontaland vertical sinusoidal bases cannot efficiently represent theslant features and, thus, lead to undesired degradation such asthe ringing artifact.

Several modified DCT schemes have been proposed toaddress the non-stationary property of prediction residuals.They took the slant features of residual signals into account.One idea was to apply an adaptive transform [8], [9], whichincluded directionally oriented basis functions, to directionalcomponents. In [10], the position and size of a transform blockcould be varied to localize prediction errors. Several contri-butions were made in the development of the transform coreexperiment (CE) with supplementary unitary transforms in theHEVC standardization. The proposals were categorized into amode-dependent transform and a secondary transform. In thecurrent HEVC design, 1D DCT and 1D DST are adaptivelyapplied to the intra 4 × 4 transform unit (TU) based on anintra prediction mode [11]. Besides, a mode-dependent KLTtransform and a boundary adaptive transform were proposed inthe CE [12], [13]. In [14], the 1D vertical/horizontal transformswere selectively skipped to yield a better coding gain. For thesecondary transform, a rotational transform was proposed tomove transform coefficients to low frequency positions tofacilitate entropy coding [15].

To allow a sparse representation, a dictionary usually con-tains a larger number of atoms than the dimension of thesignal space. The Matching Pursuit (MP) [16] algorithm isoften used as a greedy method for atom selection. Math-ematical models created by a union of orthonormal basessuch as the Wavelet and the Discrete Fourier Transformswere used to build a dictionary. Directional and redundantwavelet packets in a dual-tree discrete wavelet transform wereproposed in [17] to be adapted to image contents. It offeredbetter coding performance over JPEG2000. Research on theuse of an overcomplete dictionary for video coding was firstconducted in [18]. They built a dictionary with modulatedGabor atoms. In [19], a sparse constraint was pertained inthe training process for a more compact representation withreduced side information. However, the coding performanceof these methods were shown to be comparable with thestate-of-the-art codecs in very low bit-rates only. Their coding

1057-7149/$31.00 © 2013 IEEE

Page 2: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

2712 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

performance degrades in higher bit-rates due to the rapidlyincreasing amount of side information, e.g. indices of atomsin a dictionary.

In this work, we propose a two-layer transform with sparserepresentation and DCT (S/DCT), where two transforms areapplied in cascade since a single transform is not efficientin the coding of all general types of prediction residuals.First, the sparse representation is exploited to encode thestructured component of the residual signal such as objectboundaries. The remaining residual signal is coded by theDCT. It is worthwhile to point out that certain lossy-plus-residual schemes were proposed in [2], [20] following a similarspirit.

The rest of this paper is organized as follows. Relatedprior art is reviewed in Section II. The S/DCT algorithm isdescribed in Section III, and its rate-distortion (R-D) analysisis presented in Section IV. Experimental results are given inSection V to demonstrate the effectiveness of the proposedS/DCT algorithm. Finally, concluding remarks and future workare given in Section VI.

II. REVIEW OF RESIDUAL CODING WITH

SPARSE REPRESENTATION

It is an attractive idea to use a sparse representationto approximate a signal with a significantly fewer numberof atoms in a dictionary. Typically, the number of atomsin a dictionary D is considerably larger than the signaldimension N . Since the sparse representation is a key ingre-dient in the proposed algorithm, we conduct a quick reviewon this subject in this section.

For an input 2D signal y ∈ RN (e.g. a block having

a size of N = n × n pixels), y can be represented as asuperposition of a set of orthonormal basis functions denotedby T = [t1, . . . , tN]. Mathematically, it can be written as

y =N∑

i=1

αi ti (1)

where αi is a transform coefficient obtained as an inner productof y and ti, which is denoted by < y, ti >.

Instead of (1), y can also be represented by elements, calledatoms, from an overcomplete dictionary D ∈ R

N×K whereK > N . It is often to seek a sparse solution of y by solvingthe following error constrained optimization problem:

min ‖b‖0 subject to ‖y − Db‖2 ≤ δ (2)

where D contains K columns, b ∈ RK is the coefficient vector

for columns in D, ‖b‖0 is the l0 norm of b, and δ is an errortolerance for the approximation. Typically, both D and b areunknowns.

A natural question arising from (2) is how to select thedictionary, D, to facilitate the sparse representation. The K-SVD algorithm [21] provides an effective technique to adapta dictionary to training samples. To ensure a sparse solution,we add a sparsity constraint on b in (2), demanding that b hasno more than C non-zero elements (called the C sparsity),which leads to the following:

minb,D

‖y − Db‖2 subject to ‖b‖0 ≤ C. (3)

The K-SVD algorithm offers an iterative procedure to updatethe coefficient vector, b and the dictionary matrix D, andprovides a good approximation of y in a small enough C [21].The K-SVD is adopted for dictionary training in the proposedS/DCT algorithm.

The sparse representation of a motion compensated residualsignal was studied in the dictionary based video coding [19],[22]. The dictionary can be designed to contain typicalstructural patterns of a residual signal, e.g. a directionalcomponent, learned from real prediction residuals. As aresult, it offers a coding gain over the conventional videocodec at low bit-rates. However, due to rapidly increased sideinformation (i.e., indices of atoms in D), the coding efficiencydrops at high bit-rates when a larger number of atoms isneeded. The quality of compressive-sensed video at thedecoder was enhanced by on-line learning of neighbor blocksin previously coded frames in [23], [24]. Although there is noneed to transmit the side information, the decoder complexitywill increase significantly as a result of on-line learning.

Being motivated by the observation, we propose a noveltwo-layered transform that allows a more efficient represen-tation of prediction residuals in the next section. The sparserepresentation removes a large portion of energy in the residualimage with a few coefficients. Then, the DCT is adaptivelyused to transform the remaining residuals.

III. PROPOSED S/DCT SCHEME

A. Overview of S/DCT Scheme

A block diagram of the proposed S/DCT coding system isshown in Fig. 1. On top of the HEVC video coding scheme,the proposed algorithm includes two cascaded transformsapplied to residual signals. The decomposed signals by thesparse representation and the DCT are, respectively, denotedby yF and yS in Fig. 1 (a), and the quantized signals arey

′F and y

′S.

The residual signals undergo a two-step approximation.For the sparse representation denoted by T1, we search fora sparse representation of y based on an off-line traineddictionary using the Orthogonal Matching Pursuit (OMP) [25]as the forward transform. The dictionary is well adapted toprediction residuals. They are in form of directional com-ponents and contain high DCT frequency components. Thesparse representation produces a coarse approximation of yusing fewer coefficients, which is denoted by y

′F. The input

signal to the 2D DCT, yS, is the quantized error between thesource and the reconstructed one obtained from the sparserepresentation. At the decoder end, y is reconstructed as thesum of y

′F and y

′S. Several coding modes are developed for

multiple representations of y, and the best one is selected viaR-D optimization. The mode information is transmitted to thedecoder end.

A CABAC-based entropy coder is modified to encode theOMP coefficients and its associated side information. As tothe coding of DCT coefficients, the same entropy codingscheme with HM 5.0 reference software [26] is used. Themain change is the addition of the sparse-representation-basedtransform, quantization and coding. Indices of atoms and

Page 3: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

KANG et al.: S/DCT TWO-LAYERED REPRESENTATION OF PREDICTION RESIDUALS FOR VIDEO CODING 2713

(a) (b)

Fig. 1. Block diagram of the proposed S/DCT technique. (a) Encoder block diagram. (b) Decoder block diagram.

Fig. 2. Samples of representative residual signals trained by the K-SVDalgorithm.

quantized coefficients are transmitted and used to reconstructy

′F in a decoder. More details are given below.

B. Dictionary Training

Prediction residual samples from an HM 5.0 encoder aretrained to create a dictionary. As introduced in Eq. (3), theK-SVD [21] is employed to generate a dictionary since it isknown to provide a better approximation to a signal sourcethan other clustering algorithms (e.g. the k-means clustering)with the same number of atoms. The dictionary is learnedoff-line, and the same dictionary is stored in both the encoderand the decoder. Thus, an encoder does not need to transmit anatom but its index. The dictionary is fixed in a codec to encodevideo contents. Nevertheless, an application-specific dictionarycan be developed for particular applications such as gamingor medical video by changing residual samples in training.

We use a block-adaptive dictionary training. Fig. 2 showslearned dictionaries to transform a 8 × 8 TU and 4 × 4 TU.Each image pattern represents an atom in a dictionary.Because the number of the atoms is two times larger than thecorresponding signal dimension, it is called a 2-overcompletedictionary. For example, the dictionary for 4 × 4 TU contains32 atoms for representing a 4 × 4 block. The HEVC alsoprovides 16 × 16 TU and 32 × 32 TU. However, theproposed scheme is not applied to TUs of a larger size inour work since the associated OMP complexity is too high.In sum, there are two different dictionaries, respectively, for8 × 8 TU and 4 × 4 TU, stored in a codec. We will addressthe complexity issue in Sec. V.

The DCT yields high frequency components in represent-ing line or curved patterns, which are often observed in

Fig. 3. Training residual samples with a classifier to create a block-adaptivedictionary.

object boundaries after prediction. A dictionary is learnedto encode these structured patterns before the application ofDCT. To allow more efficient training, we add a classifier thatrejects residual samples composed primarily by random noiseand/or low frequency components. The block-adaptive trainingscheme is shown in Fig. 3. The classifier operates based uponresiduals obtained from HM 5.0 encoder with the followingconsiderations.

First, the classifier accepts an input residual block only ifit contains energy higher than a threshold. It is observed in[6], [7] that the energy in a residual block is highly correlatedto the strength of the gradient. Second, The DCT alone isefficient enough to encode a chroma component due to itshomogeneous signal characteristics. Thus, we apply S/DCTonly to a luma block, and a chroma block is not included inthe training process. Third, the size of a TU is considered.Once a particular TU size is found to be the best modeselected from an encoder, the corresponding residual blockof the same size is used in the training process. For instance,if an 8 × 8 prediction unit and the 4 × 4 TU are selectedas the best coding modes as shown in Fig. 3, the four 4 × 4residual blocks, which are pointed by the red curves, are usedin training the dictionary for 4 × 4 TU.

It is shown in [27] that the solution of the L1 (or any othernorms) optimization problem is identical to the L0 solutionfor the sparse representation, if the number of atoms ofthe solution is less than a certain threshold. The thresholdis determined as the reciprocal of the mutual incoherence,ranging from 1 to

√N [27], [28]. In the current context, we

set C in Eq. (3) with respect to the block size (e.g. C = 4for 4 × 4 TU and C = 8 for a 8 × 8 TU) for the dictionarytraining with experiments.

Page 4: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

2714 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

Fig. 4. Normalized energy distribution of the top k atoms of 4×4 DCT andthe sparse representation with and without a classifier.

In the following, we give a simple example to demonstratethe superior energy compaction capability of a trained dic-tionary over 4 × 4 DCT. In this test, we set the sparsityto 4 to create a dictionary. We use different sequences fortraining and testing. More detailed configuration in trainingwill be explained in Sec. V. The normalized energy valuesof the largest k atoms, where k = 1, 2, 3, . . ., of the 4 × 4DCT and the sparse representation are shown in Fig. 4. Wesee from Fig. 4 that the two largest components of the sparserepresentation capture more than 80% energy of the wholeblock.

C. Two Layered Transform and Its Residual Analysis

Consider an overcomplete dictionary D with K atoms, i.e.,D = [d1, . . . , dK ]. A signal, y ∈ R

N , is represented with alinear combination of column vectors in D with a coefficientvector b = [β1, . . . , βK ]T . With an initial residual r(0) = y,the orthogonal Matching Pursuit (OMP) process sequentiallybuilds the sparse approximation stepwise. In stage k, the OMPfinds an atom dik ∈ R

N that provides the best correlation withr(k−1). The problem is simply solved by

ik ∈ arg maxi∈I

< r(k−1), di > (4)

where I is an index set, <,> is an inner product, and scalarmultiplications Q(βi1 ), Q(βi2 ), . . . , Q(βik ) are obtained fromthe least-square minimization using an atom that is selected atthe k stage. Mathematically, we have

arg mini,β

|y −k∑

j=1

Q(βi j )di j |2 (5)

where di j is the column vector whose absolute value ofthe inner product is maximum, and Q(βi j ) is a quantizedcoefficient at each j . The quantization is coupled with OMPin each stage so that the quantization error to the previousstage can be considered in the selection of the kth atom. Thismodification improves the behavior of OMP. As a result, y canbe approximated using C non-zero atoms in form of

y =C∑

k=1

Q(βik )dik + r(C). (6)

To encode r(C), we adopt the DCT representation. As aresult, we represent the residual signal with atoms from twodictionaries

y ≈C ′∑

k=1

Q(βik )dik +N∑

j=1

Q(α j )t j (7)

where N is a signal dimension, C ′ is an integer value between1 and C , and t j is a DCT basis function. This leads to theproposed two-layered transform. It is noted that C ′ is not fixedbut controlled by an encoder for a block to be more efficientlycoded. The idea is to provide a more flexible representationfor a general source characteristics.

To explain the proposed S/DCT scheme, we show thetransform/quantization and entropy coding modules of theencoder and the decoder in Fig. 1(a) and (b), respectively.At the encoder end, the input residual is decomposed intoyF and yS as shown in Fig. 1 (a). For T1, the dictionary basedtransform is used to build yF. As a result of the transform,the indices of selected atoms, IC′ , and quantized transformedcoefficients, Q(β), will be transmitted. Furthermore, y iscoarsely reconstructed as yF and quantized to y

′F. An encoder

can control yF by adjusting C ′. How to decide the number ofatoms to be transmitted will be shown in the next section.For T2, the remaining residual is successively transformedby DCT and quantized, yielding quantized DCT coefficientsQ(α). All data to be transmitted are fed into an entropycoder, whose design will be explained in the next subsection.At the decoder end, the reconstruction process is done ina reverse order and can be processed in parallel to reducelatency as shown in Fig. 1 (b). The transmitted bit streams areparsed for IC′ , Q(α) and Q(β) to yield reconstructed signals,y

′F and y

′S. Finally, the sum of these two reconstructed signals

is the desired output y′.

D. Entropy Coder Design

The same CABAC in HM 5.0 reference software is usedto encode quantized DCT coefficients [29]. They are scanneddiagonally to form a 1D array. The CABAC encodes: i) thelast position of non-zero coefficients, ii) a significance mapindicating the positions of non-zero coefficients, and iii) thequantized level values. In the last position coding, the (x, y)-coordinate of the position in a TU is directly coded. Thesignificance map coding is performed with several stages.A 4 × 4 sub-block is scanned first and, if the sub-blockincludes at least one non-zero entity, these non-zeros arefurther scanned in the sub-block. For the level coding, greater-than-one and greater-than-two flags are used to encode levelvalues with their signs efficiently.

For the entropy coding of the sparse representation, wemodify the CABAC to encode indices of atoms and theirquantized coefficients. Statistical properties of symbols areshown in Fig. 5 to motivate the encoder design. For theindex coding, we adopt a fixed length code (FLC) as a resultof the Huffman coding. The histogram of atom’s indices isapproximated to a uniform distribution as shown in Fig. 5 (a).The binarization of indices is dependent on the TU size and

Page 5: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

KANG et al.: S/DCT TWO-LAYERED REPRESENTATION OF PREDICTION RESIDUALS FOR VIDEO CODING 2715

(a) (b)

Fig. 5. Histograms of: (a) indices of selected atoms and (b) magnitude ofquantized coefficients.

dictionary training. For an overcomplete dictionary of size K ,the binarization length is log2 (K ). For the level coding,we perform a chi-square test and find that the Laplaciandistribution provides a good fit to the histogram curve inFig. 5 (b). Thus, a progressive binarization (i.e., a truncatedunary code combined with an Exponential-Golomb code) isemployed to encode the levels, and the result is then fed tothe CABAC.

IV. RATE-DISTORTION ANALYSIS

A. Rate and Distortion Models

We perform a Rate-Distortion (R-D) analysis on the sparsetransform and DCT. We adopt the ρ-domain analysis [30],which estimates the bit-rate with a portion of quantized non-zero coefficients, since the rate model of the proposed sparserepresentation can be established more conveniently.

We use μ to denote the ratio of the number of non-zerocoefficients to a sample size. RF (μ) is a bit-rate estimator forthe sparse representation, and H F

I and H FM are entropies of

atom indices and coefficients, respectively. Then, we have

RF = �I H FI + �M H F

M (8)

where �I and �M are two scaling factors. As a result of theuniform distribution of atom indices, we get

H FI = −

C∑

i=1

pI (i) log2 pI (i) = μ log2 K (9)

where K is the dictionary size, C is the number of non-zerocoefficients, and the probability mass function (pmf) of indicesis 1

K . The side information is linearly proportional to μ asshown in (9).

We use pM to denote the pmf of quantized coefficientsand mk the quantized value of the kth atom in the sparserepresentation. pM (mk) can be approximated by

pM (mk) = pM (mk − qδ < M ′ < mk+1 − qδ) � fM ′(mk)q

(10)

where M ′ is a Laplacian random variable with pdf fM ′ assupported by Fig. 5 (b), q is a quantization step size, andδ accounts for the rounding effect. Then, the rate model of

quantized coefficients of atoms can be written as

H FM = −

L∑

k=1

pM(mk) log2(pM(mk))

� −L∑

k=1

fM ′(mk)q log2( fM ′(mk)q)

= −L∑

k=1

q

2σe− |mk |

σ log2

( q

2σe− |mk |

σ

)

= − log2 q −L∑

k=1

1

2σe− |mk |

σ log2

(1

2σe− |mk |

σ

)q

� − log2 q + H FM ′ (11)

where L is the number of intervals, σ > 0 is the scaleparameter of the Laplacian pdf, and H F

M ′ is defined by

H FM ′ =

∫ ∞

−∞fM ′ log2 fM ′ dm = −E[log2 fM ′ ] = 1 + ln(2σ)

(12)

and called the differential entropy [31]. The R-q model in (11)is converted to R-μ model using

μ = fM ′(mk = 0|q, σ )

= 1 −∫ δq

−δq

1

2σe− |m|

σ dm = e− (1−δ)qσ . (13)

With (11) and (13), we obtain

H FM = − log2

1 − δln

1

μ

)+ 1 + ln(2σ)

= ln

⎝ 1

ln(

)

⎠ + S (14)

where S is a constant. Finally, the rate model of RF is givenby

RF (μ) = �I μ log2 K + �M

⎝ln

⎝ 1

ln(

)

⎠ + S

⎠. (15)

Since μ is small as a result of the sparse representation, thesecond term in the right-hand-side of Eq. (15) grows slowlywith respect to μ as compared to the first term. Then, the ratemodel can be further simplified as

RF (μ) = θμ + ξ (16)

where θ and ξ are model parameters.Meanwhile, He et al. showed that the rate in DCT can be

also estimated by a linear function of the number of non-zero coefficients [30], [32]. We apply the model to the DCTcoefficients as follows:

RS(μ) = ημ + ζ (17)

where η and ζ are the model parameters. The two rate modelsin (16) and (17) are compared in Fig. 6.

Page 6: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

2716 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

(a) (b)

Fig. 6. Comparison of derived rate models for the sparse coding and DCTcoding in a 4 × 4 TU, with two different variance values: (a) σ < 10 and(b) 10 ≤ σ .

For the distortion model, it assumes that the source xfollows the Laplacian distribution with parameter σ . For a uni-form quantizer with a step size, q , and an offset round, δ, witha dead-zone, the distortion can be derived as

D =∞∑

i=−∞

∫ (i+1−δ)q

(i−δ)q|x − iq|2 f (x)dx

=∫ (1−δ)q

−(1−δ)qx2 f (x)dx + 2

∞∑

i=1

∫ (i+1−δ)q

(i−δ)q|x − iq|2 f (x)dx

= 2σ 2 − 2σ 2e−qσ + (2δq − q − 2σ)qe

(δ−1)qσ

1 − e−qσ

(18)

which allows the estimation of the distortion with respect toμ using Eq. (13).

B. Rate-Distortion Optimization

The estimated rate curves for the sparse representation andthe DCT are shown in Fig. 6. We see from Fig. 6 that therate model for the sparse representation increases faster thanthat for the DCT. This is due to the overhead in the codingof atom indices, which is in proportion of the number of non-zero coefficients. Also, we show two curves for two differentvariance values in Fig. 6 (a) and (b). A larger variance valuedemands more coding bits, yet the gap between the two curvesis smaller. When the number of non-zero coefficients is smallerin Fig. 6 (b), the coding bits for the sparse representationand the DCT are closer. This indicates the efficiency of thesparse representation in a non-stationary block since it offersimproved energy compaction as shown in Fig. 4.

Following this line of thought, we aim to find the opti-mal combination of the two representations mathematically.The rate-distortion optimization can be formulated based onEq. (7), where the sparse representation builds a coarserversion of a signal using atoms of sparsity C

′. The remaining

signal is coded with DCT. An example of signal decompositionis shown in Fig. 7, where di denotes the atom that has the i thlargest inner product and r i is the remainder in the i th step.The DCT is applied to the remainder. The problem is to findthe optimal i to minimize the distortion D subject to the rateconstraint R ≤ RB .

We can convert the constrained optimization probleminto an unconstrained optimization problem using theLagrangian multiplier method. That is, we define a new cost

function in form of

J = D + λ × R (19)

where R is the estimated bit-rates including coding bits for theDCT and the sparse representation, D is the MSE between theoriginal signal y and the reconstructed signal y′ as shown inFig. 1, and λ is a parameter called the Lagrangian multiplier.Since we want to minimize the cost with the optimal combi-nation of the sparse representation and DCT, the solution (i.e.,an optimal index set IC′) is given by

IC′ = arg minIC′ ∈SCJ (20)

where SC is all possible sets created in D within the givensparsity C .

However, we do not have a good model to estimate λ, andthe computational complexity is too high for its search on thefly. In practice, the following two ad hoc rules are found tobe useful in selecting a good S/DCT scheme.

1) The number of selected atoms in the sparse representa-tion is smaller than n

2 , where n is a size of a TU.2) The DCT alone can provide good R-D performance for

a residual signal with a low variance value so that thesparse representation module can be by-passed.

We provide several coding modes based on these rules.The modes are incorporated into the codec, and they canbe selected from the R-D optimization process in the HMsoftware. The coding modes are presented in Table I. Formode 0, the sparse coding is skipped, and the residuals gothrough the 2D DCT only as performed in HEVC. For modesgreater than 0, the number of atoms is specified by the table.For example, for mode 3 in 8 × 8 TU, a decoder knows thattwo atoms are used. Based on the mode representation, thecontents can be coded or skipped by one specific transform.For instance, for mode 2, the syntax for the DCT coefficientsis disabled. The coded block flag (CBF) in HEVC is used tomark non-zero entities in the two transforms. The proposedsyntax is shown in Fig. 8.

In HEVC, a residual quad tree (RQT) [33] is employed todivide a CU into multiple TUs for the transform recursively.In the original RQT, the 2D DCT is performed in each leafnode as shown in Fig. 9. In the proposed algorithm, the R-Dcost of each mode is computed in the leaf node of the RQT,and the optimal mode that provides the minimum Lagrangiancost is selected. We define the cost as

J H M = D + λH M × (RO + RF + RS) (21)

where RO , RF , and RS are actual coding bits for the overhead,the sparse representation (including the coding bits for theindex and OMP coefficients), and DCT coding bits, respec-tively, and λH M is a function of QP as specified by HM 5.0and D is the distortion. The minimization in Eq. (21) providesa sub-optimal but tractable solution to the problem in Eq. (20).

V. EXPERIMENTAL RESULTS

A. Dictionary Training

We create two dictionaries to process an 8 × 8 TU and a4×4 TU. The same dictionary trained off-line is incorporated

Page 7: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

KANG et al.: S/DCT TWO-LAYERED REPRESENTATION OF PREDICTION RESIDUALS FOR VIDEO CODING 2717

Fig. 7. Signal decomposition example.

TABLE I

PROPOSED CODING MODES WITH THE S/DCT AND THEIR CODEWORDS

TU size Mode SR DCT Codewords No. of atoms

4 × 4 0 By-pass Use 0 0

- 1 Use Use 10 1

- 2 Use By-pass 11 1

8 × 8 0 By-pass Use 0 0

- 1 Use Use 10 1

- 2 Use By-pass 110 1

- 3 Use By-pass 111 2

Fig. 8. Syntax information for the proposed S/DCT.

into an encoder and a decoder so that it does not need to betransmitted. Since the dictionaries are designed for the purposeof encoding general video contents, a training set contains anumber of video sequences with variable resolutions. Theyare “Sun Flower,” “Pedestrian,” and “Blue Sky” of HD reso-lution (1920 × 1080), “Crew,” “Soccer,” and “City” of 4CIFresolution (704 × 576), and “Foreman,” “Tempete,” “WaterFall,” and “Ice” of CIF resolution. Although the resolutionsare different, sizes of the sample block sizes are either 8 × 8or 4 × 4 as described in Sec. III-B. A great number of sampleblocks (∼68,000) are used for training since the computationalcomplexity is of less concern in the off-line training process.The training video and the test video are two disjoint sets.

We denote the size of a dictionary, D, by K (i.e., D ∈R

N×K ) and the dimension of a block by N (=n × n pixels).We call D a γ -overcomplete dictionary with γ = K

N . We studythe coding performance and the computational complexity ofthe training cost as a function of γ in Fig. 10, where thefive data points are obtained with γ = 1, 2, 4, 8 and 16.Its y-axis is the BD-rate reduction rate [34] used to measure theaverage difference in coding bits. A negative (or positive) valuemeans a reduction (or an increase) of coding bits as comparedwith HM 5.0 reference software. Its x-axis is the complexitymeasure in the unit of T, which is the coding complexity whenγ = 1. A dictionary of a smaller size will have a poorerapproximation capability while a dictionary of a larger size

Fig. 9. Residual quad-tree structure for a transform in HEVC.

will have higher overhead for index coding. Thus, a betterchoice for γ is 2 and 4. We adopt 2-overcomplete dictionaryfor experiments in Sec. V-B.

The computational complexity for the OMP process isO(CTD +C(K + N)+C3) [35], [36], where C is the sparsity,and TD is the time for updating a residual in an iteration.By setting C and K to

√N

4 and 2N , respectively, we havea complexity of O(n3), where n = √

N . There are severalefficient ways to implement OMP [36], [37]. Here, we adoptthe technique in [36] to speed up the process. The idea isto use a pre-computed kernel matrix (instead of computingresiduals) in all update stages so that the complexity of the firstterm can be saved. This is particularly useful when a fewerdictionaries are needed for a large data set. Nevertheless, largerTU sizes such as 16 × 16 and 32 × 32 still demand highercomplexity due to dictionaries of a larger size while theircoding performance improvement is not significant. In Fig. 12,we show the coding performance of the proposed algorithm(S/DCT) and its extension to the 16 × 16 and 32 × 32 TUs,denoted by “S/DCT(16)” and “S/DCT(32),” respectively. As aresult, we choose to apply the S/DCT technique to TU ofsize 8 × 8 or 4 × 4.

We compare dictionary training with and without the block-adaptive scheme as illustrated in Fig. 3. The block-adaptivescheme provides atoms containing sharper edge patterns asshown in Fig. 11 (a). In contrast, without the use of the block-adaptive scheme, atoms become flatter as shown in Fig. 11 (b).

B. R-D Performance Evaluation

In this section, we conduct experiments to show the R-Dperformance of the proposed algorithm, which is implemented

Page 8: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

2718 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

Fig. 10. BD-rate reduction and the computational complexity for aγ -overcomplete dictionary with γ = 1, 2, 4, 8, and 16.

(a) (b)

Fig. 11. Trained dictionary sets: (a) with and (b) without the block-adaptivescheme, where the contrast of atoms is adjusted for ease of visualization.

on top of the HM 5.0 reference software [26]. The test andtraining sets are disjoint. We compare simulation results withthe same software and configuration. In the following, wefirst consider the common test conditions and, then, other testconditions.

1) Common Test Conditions: Here, we adopt the “Main”configuration in the common test condition. Several codingtools such as the Asymmetric Motion Prediction (AMP) andthe Sample Adaptive Offset (SAO) are enabled, and codingparameters such as a depth of a RQT is specified in theconfiguration. We also use “Low Delay B (LDB),” “LowDelay P slices only (LDP),” and “Random Access (RA)” testsas a different Group-of-Picture (GOP) structure. They are,respectively, IBBB..., IPPP..., and the hierarchical B codingstructure. All coding tools and parameters in the “Main”configuration are well defined in the common test conditionsand HEVC software reference configurations [38]. We use testsequences with various resolutions from class B, class C, andclass D, which are commonly used for RA, LDB and LDPtests. The resolution of each class is shown in Table II. TheQP values are set to 22, 27, 32, and 37.

As shown in Table II, the proposed S/DCT algorithmoutperforms HM 5.0 in the R-D performance. The BD-ratesaving (where a negative value in the table indicates a positivebit rate saving) is about 0.99%, 1.35%, and 0.53% on theaverage for the LDB-Main, LDP-Main, and RA-Main cases,respectively. The R-D performance of the proposed S/DCTand HM5.0 is compared in Fig. 12. We see from the R-Dcurves that the coding efficiency of the proposed algorithmis better over a broad range of bit-rates. The S/DCT algo-rithm performs the best in the LDP-Main among the threeconfigurations. The GOP structure of the LDP yields stronger

(a)

(b)

Fig. 12. R-D curves for comparing the coding performance of the proposedmethod (“S/DCT,”) HM5.0, [19] (“S or DCT”), and the two other dictionarybased video coding schemes using a 2-x overcomplete dictionary, denoted by“S(2-x) only,” and a 4-x overcomplete dictionary, “denoted by S(4-x) only,”without a support from the DCT. The proposed algorithm with larger TU sizesis presented with “S/DCT(16)” and “S/DCT(32).” The LDP configuration wasused with the four QPs (22, 27, 32, and 37), and the number of coded framesis 100.

directional features in boundary edges due to the unidirectionalprediction of P slices, and the S/DCT algorithm adapts to thenon-stationary regions. As compared, bi-directional predictiontends to be more accurate, and may affect S/DCT’s efficiency.The proposed algorithm yields better coding performance inlower resolution video such as WVGA and WQVGA. TheBD-rate saving is up to 1.95% in class D since TUs of asmaller size are selected. For example, the BD-rate saving(2.9%) for “BQSquare” is significantly higher than video ofhigher resolution. As to the complexity, the S/DCT schemehas a complexity of about twice of the benchmark because ofthe need of OMP and mode selection.

The orange region in Fig. 13 (a) adopts both the sparse-based transform and DCT while the gray region in Fig. 13(a) adopts DCT only. The motion compensated predictionresiduals are shown in Fig. 13 (b). We see that the orangeregion in Fig. 13 (a) lies in the oblique part.

We compare the R-D performance of another threedictionary-based video coding schemes in Fig. 12. Either DCT

Page 9: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

KANG et al.: S/DCT TWO-LAYERED REPRESENTATION OF PREDICTION RESIDUALS FOR VIDEO CODING 2719

TABLE II

BD-RATE [34] (IN THE UNIT OF %) OF THE PROPOSED ALGORITHM WITH HM 5.0 AS THE BENCHMARK, WHERE A NEGATIVE

VALUE REPRESENTS A SAVING AS COMPARED WITH THE BENCHMARK. THE TEST SEQUENCES ARE CLASS B, C, AND D

THAT ARE NATURAL VIDEO CLIPS. THE TEST CONDITIONS AND CODING PARAMETERS ARE DEFINED IN [38]

Resolution Sequences LDB-Main LDP-Main RA-Main

WQVGA (416 × 240, Class D) BasketballPass −1.6% −2.0% −1.0%

BQSquare −2.3% −2.9% −1.5%

BlowingBubbles −1.5% −2.1% −0.9%

RaceHorses −0.6% −0.8% −0.2%

Average (D) −1.50% −1.95% −0.9%

WVGA (832 × 480, Class C) BasketballDrill −1.1% −1.4% −0.6%

BQMall −1.8% −2.3% −0.8%

PartyScene −1.5% −1.8% −0.7%

RaceHorses −0.4% −0.7% −0.2%

Average (C) −1.20% −1.55% −0.58%

HD (1920 × 1080, Class B) Kimono −0.2% −0.5% −0.0%

ParkScene −0.3% −0.5% −0.1%

Cactus −0.3% −0.6% −0.1%

BasketballDrive −0.5% −0.7% −0.3%

BQTerrace −0.8% −1.2% −0.5%

Average (B) −0.42% −0.70% −0.20%

Total Average −0.99% −1.35% −0.53%

Encoding Time 189% 211% 176%

(a)

(b)

Fig. 13. (a) Orange region adopts both the sparse-based transform and DCTwhile the gray regions adopt DCT only and (b) prediction residuals aftermotion compensation. The residual signals are scaled by four for ease ofvisualization. The captured frame is from the “BlowBubble” sequence.

or a sparse representation is used in the method proposedin [19] for residual coding while a two-layered representationis used in the proposed S/DCT. The coding performance ofmethods in [19] is denoted by “S or DCT” in Fig. 12. Wesee that the coding gain in “S or DCT” is comparable withthe proposed S/DCT scheme in lower bit-rates. However, its

TABLE III

BD-RATE OF THE PROPOSED ALGORITHM WITH HM 5.0 THAT

DISABLES LARGE TU SIZES AS THE BENCHMARK

Types LDB-Main LDP-Main RA-Main

HD (Class B) −2.32% −2.66% −1.43%

WVGA (Class C) −1.80% −2.25% −1.17%

WQVGA (Class D) −1.38% −1.80% −0.71%

Total Average −1.83% −2.24% −1.10%

coding gain is not as efficient in higher bit-rates due to asmaller number of selected sparse coding modes. Furthermore,we present the coding performance when the sparse represen-tation is applied alone (by disabling the DCT as the secondtransform). The 2x and 4x overcomplete dictionaries are alsoincluded in Fig. 12 for performance comparison. They aredenoted by “S(2-x) only” and “S(4-x) only.” We see that theyare not efficient in higher bit-rates. The same phenomenonwas reported in [19]. This can be explained by the increasedoverhead to encode the indices of atoms. Following the ideaof the merge mode in HEVC [39], a coding mode that mergesindices may help reduce the side information.

2) Other Test Conditions: More superior coding perfor-mance of the S/DCT can be observed in some test conditionsthat are different from the common test conditions as explainedbelow. We modify the anchor software so that larger TU sizesare never selected. In this situation, every TU is requiredto select one of the coding modes presented in Table I.As shown in Table III, the BD-rate saving is about 1.83%,2.24%, and 1.10% on the average for the LDB-Main, LDP-Main, and RA-Main cases, respectively. Note that the proposed

Page 10: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

2720 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

(a)

(b)

Fig. 14. Visual comparison of reconstructed images in the “SliceShow” sequence: (a) S/DCT and (b) HM 5.0, where the PSNR values are 37.42 and37.35 dB, respectively.

(a) (b)

Fig. 15. Visual comparison of reconstructed images in the “SpinningCalendar” sequence: (a) S/DCT and (b) HM 5.0, where the PSNR valuesare 29.0 and 28.8 dB.

algorithm yields an impressive coding gain in the LDP-Mainconfiguration. The coding gain becomes relatively higher inHD video sequences because of the disabled larger TUs thatcould be useful for coding larger blocks.

Furthermore, it is worthwhile to emphasize that theproposed algorithm offers significantly better perceptualquality. That is, because the trained dictionary contains moreadaptive patterns in representing directional components,the reconstructed video has less ringing artifact than thebenchmark, HM 5.0. Representative frames in the “SlideShow” sequence and in the “Spinning Calendar” sequenceare shown in Fig. 14 and Fig. 15, where (a) is coded with theS/DCT scheme and (b) is coded with HM 5.0. Although thePSNR difference between them is small, the ringing artifactis reduced significantly by the S/DCT scheme.

C. Coding With Content-Adaptive Dictionary

We have so far adopted a set of dictionaries targetingat general natural video. It is however possible to developa content-adaptive encoder for specific applications such asgaming or medical video. This can be accomplished byincorporating an application-specific dictionary into the codec.

TABLE IV

BD-RATE SAVING OF THE PROPOSED SCHEME FOR CLASS F VIDEO

WITH A CONTENT-ADAPTIVE DICTIONARY AGAINST HM 5.0

Types Sequences LDB-Main LDP-Main RA-Main

Class F BasketballDrillText −0.9% −1.2% −0.3%

ChinaSpeed −6.9% −8.5% −2.4%

SlideEditing −1.3% −1.7% −0.4%

SlideShow −4.4% −6.8% −0.9%

Average −3.38% −4.55% −1.00%

Such a dictionary can be trained off-line using video clips fromthe target application.

To give an example, we consider the screen-contentsequences in class F [38]. They contain characteristics quitedifferent from natural videos, e.g., characters or graphic ele-ments. They display playing games or editing contents in thePowerPoint file. The test and training sequence sets are disjointfor evaluations. The proposed scheme provides a better codinggain than HM 5.0 as shown in Table IV. For the LDB-Main,LDP-Main, and RA-Main tests, the BD-rate saving is 3.38%,4.55%, and 1.00%, respectively. On top of that, we can clearlysee from Fig. 12 that the coding performance of the proposedmethod in “ChinaSpeed” provides a superior coding gain thanthat in “BQSquare.” The remarkable coding gain for class Fvideo is due to the use of an adaptive dictionary tailored toscreen contents in S/DCT.

VI. CONCLUSION

An efficient video coding scheme, called the S/DCT scheme,that uses two-layered transforms with sparse representationand DCT in cascade was proposed in this work. In the sparserepresentation, more energy compaction could be achievedwith fewer coefficients using a block-adaptive dictionary.

Page 11: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

KANG et al.: S/DCT TWO-LAYERED REPRESENTATION OF PREDICTION RESIDUALS FOR VIDEO CODING 2721

A signal was coarsely approximated with the sparse repre-sentation, and the DCT was used for a complement. It wasdemonstrated by experimental results that the proposed algo-rithm outperforms the HEVC reference codec HM5.0 in termsof both the objective measure and subjective perception. It isan interesting research topic to develop a fast algorithm toreduce the computational complexity and to design an optimalbit allocation between the sparse-based transform and DCT.

ACKNOWLEDGMENT

The authors would like to thank L. Feng and S. Chachadafrom the University of Southern California, Los Angles, CA,USA, and K. Ugur, J. Lainema, A. Hallapuro, and O. Bici fromNokia Research Center, Tampere, Finland, for their dedicatedhelp to this work.

REFERENCES

[1] M. Elad and M. Aharon, “Image denoising via sparse and redundantrepresentations over learned dictionaries,” IEEE Trans. Image Process.,vol. 15, no. 12, pp. 3736–3745, Dec. 2006.

[2] J. Starck, M. Elad, and D. L. Donoho, “Image decomposition via thecombination of sparse representations and a variational approach,” IEEETrans. Image Process., vol. 14, no. 10, pp. 1570–1582, Oct. 2005.

[3] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decompositionby basis pursuit,” SIAM J. Sci. Comput., vol. 20, no. 1, pp. 33–61, 1999.

[4] L. W. Kang, C. W. Lin, and Y. H. Fu, “Automatic single-image-basedrain streaks removal via image decomposition,” IEEE Trans. ImageProcess., vol. 21, no. 4, pp. 1742–1755, Apr. 2012.

[5] Transform Design for HEVC With 16-Bit Intermediate Data Represen-tation, Standard JCTVC-E243, Mar. 2011.

[6] W. Niehsen and M. Brunig, “Covariance analysis of motion-compensated frame difference,” IEEE Trans. Circuits Syst. Video Techol.,vol. 9, no. 4, pp. 536–539, Jun. 1999.

[7] K. C. Hui and W. C. Siu, “Extended analysis of motion compensatedframe difference for block-based motion prediction error,” IEEE Trans.Image Process., vol. 16, no. 5, pp. 1232–1245, May 2007.

[8] B. Zeng and J. Fu, “Directional discrete cosine transforms—a newframework for image coding,” IEEE Trans. Circuits Syst. Video Techol.,vol. 18, no. 3, pp. 305–313, Mar. 2008.

[9] R. Cohen, S. Klomp, A. Vetro, and H. Sun, “Direction-adaptive trans-forms for coding prediction residuals,” in Proc. 17th IEEE Int. Conf.Image Process., Sep. 2010, pp. 185–188.

[10] C. Zhang, K. Ugur, J. L. A. Hallapuro, and M. Gabbouj, “Video codingusing spatially varying transform,” IEEE Trans. Circuits Syst. VideoTechol., vol. 21, no. 2, pp. 127–139, Feb. 2011.

[11] Mode-Dependent DCT/DST Without 4×4 Full Matrix Multiplication forIntra Prediction, Standard JCTVC-E125 CE7, Mar. 2011.

[12] Boundary-Dependent Transform for Inter-Predicted Residue, StandardJCTVC-H0309 CE7, Feb. 2012.

[13] Mode Dependent Intra Residual Coding, Standard JCTVC-E098 CE7,Mar. 2011.

[14] Transform Skip Mode, Standard JCTVC-F077, Jul. 2011.[15] Experimental Results for the ROT, Standard JCTVC-G304 CE7,

Nov. 2011.[16] S. Mallat and Z. Zhang, “Matching pursuits with time-frequency dictio-

naries,” IEEE Trans. Signal Process., vol. 41, no. 12, pp. 3397–3415,Dec. 1993.

[17] J. Yang, Y. Yang, W. Xu, and Q. Dai, “Image coding using dual-treediscrete wavelet transform,” IEEE Trans. Image Process., vol. 17, no. 9,pp. 1555–1569, Sep. 2008.

[18] R. Neff, A. Zakhor, and M. Vetterli, “Very low bit rate video codingusing matching pursuit,” in Proc. Conf. Visual Commun. Image Process.,Sep. 1994, pp. 47–60.

[19] J.-W. Kang, R. Cohen, A. Vetro, and C.-C. J. Kuo, “Efficient dictionarybased video coding with reduced side information,” in Proc. IEEE Int.Symp. Circuits Syst., May 2011, pp. 109–112.

[20] S.-H. Kim, J.-W. Kang, and C.-C. J. Kuo, “Improved H.264/AVClossless intra coding with two-layered residual coding (TRC),” IEEETrans. Circuits Syst. Video Techol., vol. 21, no. 7, pp. 1005–1010,Jul. 2011.

[21] M. Aharon, M. Elad, and A. Bruckstein, “K-SVD: An algorithm fordesigning overcomplete dictionaries for sparse representation,” IEEETrans. Signal Process., vol. 54, no. 11, pp. 4311–4322, Nov. 2006.

[22] M. Turkan and C. Guillemot, “Sparse approximation with adaptive dic-tionary for image prediction,” in Proc. IEEE Int. Conf. Image Process.,Oct. 2009, pp. 25–28.

[23] H. W. Chen, L. W. Kang, and C. Lu, “Dictionary learning baseddistributed compressive video sensing,” in Proc. Picture Coding Symp.,Dec. 2010, pp. 210–213.

[24] Y. Liu, M. Li, and D. A. Pados, “Motion-aware decoding of compressed-sensed video,” IEEE Trans. Circuits Syst. Video Tech., vol. 23, no. 3,pp. 438–444, Dec. 2012.

[25] J. A. Tropp, “Greed is good: Algorithmic results for sparse approxi-mation,” IEEE Trans. Inform. Theory, vol. 50, no. 10, pp. 2231–2242,Oct. 2004.

[26] Software Development and HM Software Technical Evaluation (AHG3),” Standard JCTVC-H003 JCT-VC AHG, Feb. 2012.

[27] D. L. Donoho and M. Elad, “Optimally sparse representation in gen-earl (nonorthogonal) dictionaries via �1 minimization,” in Proc. Nat.Academy Sci. United States Amer., Mar. 2003, pp. 2197–2202.

[28] D. L. Donoho and X. Huo, “Uncertainty principles and idealatomic decomposition,” IEEE Trans. Inform. Theory, vol. 47, no. 7,pp. 2845–2862, Nov. 2001.

[29] J. Sole, R. Joshi, N. Nguyen, T. Ji, M. Karczewicz, G. Clare, F. Henry,and A. Duenas, “Transform coefficient coding in HEVC,” IEEE Trans.Circuits Syst. Video Tech., vol. 22, no. 12, pp. 1765–1777, Dec. 2012.

[30] Z. He and S. K. Mitra, “A linear source model and a unified rate controlalgorithm for DCT video coding,” IEEE Trans. Circuits Syst. VideoTechol., vol. 12, no. 11, pp. 970–982, Nov. 2002.

[31] T. M. Cover and J. A. Thomas, Elements of Information Theory. NewYork, NY, USA: Wiley, 2006.

[32] J. Li, M. Gabbouj, and J. Takala, “Zero-quantized inter DCT coefficientprediction for real-time video coding,” IEEE Trans. Circuits Syst. VideoTechol., vol. 22, no. 2, pp. 249–259, Feb. 2012.

[33] M. Winken, P. Helle, D. Marpe, H. Schwarz, and T. Wiegand, “Trans-form coding in the HEVC test model,” in Proc. 18th IEEE Int. Conf.Image Process., Sep. 2011, pp. 3693–3696.

[34] G. Bjontegaard, Calculation of Average PSNR Differences Between RD-Curves, Austin, TX, USA, Steck-Vaughn, Mar. 2001.

[35] J. Tropp, “Computational methods for sparse solution of linear inverseproblems,” Proc. IEEE, vol. 98, no. 6, pp. 948–958, Jun. 2010.

[36] R. Rubinstein, M. Zibulevsky, and M. Elad, “Efficient implementationof the K-SVD algorithm using batch OMP,” Dept. Technion Comput.Sci., Haifa State Univ., Haifa, Israel, Tech. Rep. CS-2008-08, Apr. 2008.

[37] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, “Online learning for matrixfactorization and sparse coding,” J. Mach. Learn. Res., vol. 11, no. 1,pp. 19–60, Mar. 2010.

[38] Common Test Conditions and Software Reference Configurations, Stan-dard JCTVC-I1100, Apr. 2012.

[39] P. Helle, S. Oudin, B. Bross, D. Marpe, O. Bici, K. Ugur, J. Jung,G. Clare, and T. Wiegand, “Block merging for quadtree-based partition-ing in HEVC,” IEEE Trans. Circuits Syst. Video Techol., vol. 22, no. 12,pp. 1720–1731, Dec. 2012.

Je-Won Kang received the B.S. and M.S. degreesin electrical engineering and computer sciencefrom Seoul National University, Seoul, Korea, in2006 and 2008, respectively, and the Ph.D. degreein electrical engineering from the University ofSouthern California, Los Angeles, CA, USA, in2012.

He is currently with Multimedia R&D and stan-dard team in Qualcomm Technologies, Inc., SanDiego, CA. His current research interests includeimage, video, and 3D data compression and their

standardization, multimedia communication, and machine learning.

Page 12: IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

2722 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 22, NO. 7, JULY 2013

Moncef Gabbouj (F’11) received the B.S. degree inelectrical engineering from Oklahoma State Univer-sity, Stillwater, OK, USA, in 1985, and the M.S. andPh.D. degrees in electrical engineering from PurdueUniversity, West Lafayette, IN, USA, in 1986 and1989, respectively.

He is an Academy Professor with the Academyof Finland from 2011 to 2015. He was a Visit-ing Professor at several universities, including theHong Kong University of Science and Technology,Hong Kong, Purdue University, and the University

of Southern California, Los Angeles, CA, USA. He is a Professor withthe Department of Signal Processing, Tampere University of Technology,Tampere, Finland, and the Director of the Multimedia Research Group. Hewas the Head of the Department of Signal Processing from 2002 to 2007.He is co-author of over 500 publications, including two books. His currentresearch interests include multimedia, content-based search, nonlinear signalprocessing, voice conversion, and video communication.

Dr. Gabbouj was a recipient of the 2012 Nokia Foundation Visiting ProfessorAward, the 2005 Nokia Foundation Recognition Award, and several best paperawards.

C.-C. Jay Kuo (F’99) received the B.S. degreefrom National Taiwan University, Taipei, Taiwan,in 1980, and the M.S. and Ph.D. degrees from theMassachusetts Institute of Technology, Cambridge,MA, USA, in 1985 and 1987, respectively, all inelectrical engineering.

He is currently the Director of the Media Com-munications Laboratory and a Professor of electri-cal engineering, computer science, and mathematicswith the University of Southern California, LosAngeles, CA, USA, and the President of the Asia-

Pacific Signal and Information Processing Association. He is the co-author ofover 210 journal papers, 850 conference papers, and ten books. His currentresearch interests include digital image/video analysis and multimedia datacompression.

Dr. Kuo is a fellow of the American Association for the Advancement ofScience and the International Society for Optical Engineers.