Neural Image Compression via Non-Local Attention ...

13
1 Neural Image Compression via Non-Local Attention Optimization and Improved Context Modeling Tong Chen, Haojie Liu, Zhan Ma, Qiu Shen, Xun Cao, and Yao Wang Abstract—This paper proposes a novel Non-Local Attention optmization and Improved Context modeling-based image com- pression (NLAIC) algorithm, which is built on top of the deep nerual network (DNN)-based variational auto-encoder (VAE) structure. Our NLAIC 1) embeds non-local network operations as non-linear transforms in the encoders and decoders for both the image and the latent representation probability informa- tion (known as hyperprior) to capture both local and global correlations, 2) applies attention mechanism to generate masks that are used to weigh the features, which implicitly adapt bit allocation for feature elements based on their importance, and 3) implements the improved conditional entropy modeling of latent features using joint 3D convolutional neural network (CNN)-based autoregressive contexts and hyperpriors. Towards the practical application, additional enhancements are also introduced to speed up processing (e.g., parallel 3D CNN-based context prediction), reduce memory consumption (e.g., sparse non-local processing) and alleviate the implementation complexity (e.g., unified model for variable rates without re-training). The proposed model outperforms existing methods on Kodak and CLIC datasets with the state-of-the-art compression efficiency reported, including learned and conventional (e.g., BPG, JPEG2000, JPEG) image compression methods, for both PSNR and MS-SSIM distortion metrics. Index Terms—Non-local network, attention mechanism, con- ditional probability prediction, variable-rate model, end-to-end learning I. I NTRODUCTION Light reflected from the object surface, travels across the 3D environment and finally reaches at the sensor plane of the camera or the retina of our Human Visual System (HVS) as a projected 2D image to represent the natural scene. Nowadays, images spread everywhere via social networking (e.g., Facebook, WeChat), professional photography sharing (e.g., flickr), online advertisement (e.g., Google Ads) and so on, mostly in standard compliant formats compressed using JPEG [1], JPEG2000 [2], H.264/AVC [3] or High Efficiency Video Coding (HEVC) [4] intra-picture coding based image profile (a.k.a., Better Portable Graphics - BPG https://bellard.org/bpg/), etc. A better compression method 1 is always desired to preserve the higher image quality but with less bits consumption. This would save the file storage at the Internet scale (e.g., > 350 million images submitted and shared to Facebook per day), and enable the faster and more efficient image sharing/exchange with better quality of experience (QoE). T. Chen and H. Liu contributed equally to this work. 1 Since the focus of this paper is lossy compression, for simplicity, we use “compression” to represent “lossy compression” in short throughout this work, unless pointed out specifically. Fundamentally, image coding/compression is trying to ex- ploit signal redundancy and represent the original pixel sam- ples (in RGB or other color space such as YCbCr) using a compact and high-fidelity format. This is also referred to as the source coding [5]. Conventional transform coding (e.g., JPEG, JPEG 2000) or hybrid transform/prediction coding (e.g., intra coding of H.264/AVC and HEVC) is utilized. Here, typical transforms are Discrete Cosine Transform (DCT) [6], Wavelet Transform [7], and so on. Transforms referred here are usually with fixed basis, that are trained in advance presuming the knowledge of the source signal distribution. On the other hand, intra prediction usually leverages the local [8] and global correlations [9] to exploit the redundancy. Since intra prediction can be expressed as the linear superimposition of casual samples, it can be treated as an alternative transform as well. Lossy compression is then achieved via applying the quantization on transform coefficients followed by an adaptive entropy coding. Thus, typical image compression pipeline can be simply illustrated by “transform”, “quantization” and “entropy coding” consecutively. Instead of applying the handcrafted components in ex- isting image compression methods, such as DCT, scalar quantization, etc, most recently emerged machine learning based image compression algorithms [10], [11], [12] leverage the autoencoder structure, which transforms raw pixels into compressible latent features via stacked convolutional neural networks (CNNs) in a nonlinear means [13]. These latent features are quantized and entropy coded subsequently by further exploiting the statistical redundancy. Recent works have revealed that compression efficiency can be improved when exploring the conditional probabilities via the contexts of autoregressive spatial-channel neighbors and hyperpriors [12], [14], [10], [15] for the compression of features. Typically, rate- distortion optimization (RDO) [16] is fulfilled by minimizing Lagrangian cost J = R + λD, when performing the end-to- end learning. Here, R is referred to as entropy rate, and D is the distortion measured by either mean squared error (MSE), multiscale structural similarity (MS-SSIM) [17], even feature or adversarial loss [18], [19]. However, existing methods still present several limitations. For example, most of the operations, such as stacked convolu- tions, are performed locally with limited receptive field, even with pyramidal decomposition. Furthermore, latent features are mostly treated with equal importance in either spatial or channel dimension, without considering the diverse visual sensitivities to various content at different frequency [20]. Thus, attempts have been made in [14], [12] to exploit impor- tance maps on top of the quantized latent feature vectors for arXiv:1910.06244v1 [eess.IV] 11 Oct 2019

Transcript of Neural Image Compression via Non-Local Attention ...

Page 1: Neural Image Compression via Non-Local Attention ...

1

Neural Image Compression via Non-Local AttentionOptimization and Improved Context Modeling

Tong Chen, Haojie Liu, Zhan Ma, Qiu Shen, Xun Cao, and Yao Wang

Abstract—This paper proposes a novel Non-Local Attentionoptmization and Improved Context modeling-based image com-pression (NLAIC) algorithm, which is built on top of the deepnerual network (DNN)-based variational auto-encoder (VAE)structure. Our NLAIC 1) embeds non-local network operationsas non-linear transforms in the encoders and decoders for boththe image and the latent representation probability informa-tion (known as hyperprior) to capture both local and globalcorrelations, 2) applies attention mechanism to generate masksthat are used to weigh the features, which implicitly adapt bitallocation for feature elements based on their importance, and 3)implements the improved conditional entropy modeling of latentfeatures using joint 3D convolutional neural network (CNN)-basedautoregressive contexts and hyperpriors. Towards the practicalapplication, additional enhancements are also introduced to speedup processing (e.g., parallel 3D CNN-based context prediction),reduce memory consumption (e.g., sparse non-local processing)and alleviate the implementation complexity (e.g., unified modelfor variable rates without re-training). The proposed modeloutperforms existing methods on Kodak and CLIC datasets withthe state-of-the-art compression efficiency reported, includinglearned and conventional (e.g., BPG, JPEG2000, JPEG) imagecompression methods, for both PSNR and MS-SSIM distortionmetrics.

Index Terms—Non-local network, attention mechanism, con-ditional probability prediction, variable-rate model, end-to-endlearning

I. INTRODUCTION

Light reflected from the object surface, travels across the3D environment and finally reaches at the sensor plane ofthe camera or the retina of our Human Visual System (HVS)as a projected 2D image to represent the natural scene.Nowadays, images spread everywhere via social networking(e.g., Facebook, WeChat), professional photography sharing(e.g., flickr), online advertisement (e.g., Google Ads) andso on, mostly in standard compliant formats compressedusing JPEG [1], JPEG2000 [2], H.264/AVC [3] or HighEfficiency Video Coding (HEVC) [4] intra-picture codingbased image profile (a.k.a., Better Portable Graphics - BPGhttps://bellard.org/bpg/), etc. A better compression method1

is always desired to preserve the higher image quality butwith less bits consumption. This would save the file storageat the Internet scale (e.g., > 350 million images submittedand shared to Facebook per day), and enable the faster andmore efficient image sharing/exchange with better quality ofexperience (QoE).

T. Chen and H. Liu contributed equally to this work.1Since the focus of this paper is lossy compression, for simplicity, we use

“compression” to represent “lossy compression” in short throughout this work,unless pointed out specifically.

Fundamentally, image coding/compression is trying to ex-ploit signal redundancy and represent the original pixel sam-ples (in RGB or other color space such as YCbCr) using acompact and high-fidelity format. This is also referred to as thesource coding [5]. Conventional transform coding (e.g., JPEG,JPEG 2000) or hybrid transform/prediction coding (e.g., intracoding of H.264/AVC and HEVC) is utilized. Here, typicaltransforms are Discrete Cosine Transform (DCT) [6], WaveletTransform [7], and so on. Transforms referred here are usuallywith fixed basis, that are trained in advance presuming theknowledge of the source signal distribution. On the otherhand, intra prediction usually leverages the local [8] andglobal correlations [9] to exploit the redundancy. Since intraprediction can be expressed as the linear superimposition ofcasual samples, it can be treated as an alternative transformas well. Lossy compression is then achieved via applying thequantization on transform coefficients followed by an adaptiveentropy coding. Thus, typical image compression pipelinecan be simply illustrated by “transform”, “quantization” and“entropy coding” consecutively.

Instead of applying the handcrafted components in ex-isting image compression methods, such as DCT, scalarquantization, etc, most recently emerged machine learningbased image compression algorithms [10], [11], [12] leveragethe autoencoder structure, which transforms raw pixels intocompressible latent features via stacked convolutional neuralnetworks (CNNs) in a nonlinear means [13]. These latentfeatures are quantized and entropy coded subsequently byfurther exploiting the statistical redundancy. Recent workshave revealed that compression efficiency can be improvedwhen exploring the conditional probabilities via the contexts ofautoregressive spatial-channel neighbors and hyperpriors [12],[14], [10], [15] for the compression of features. Typically, rate-distortion optimization (RDO) [16] is fulfilled by minimizingLagrangian cost J = R + λD, when performing the end-to-end learning. Here, R is referred to as entropy rate, and D isthe distortion measured by either mean squared error (MSE),multiscale structural similarity (MS-SSIM) [17], even featureor adversarial loss [18], [19].

However, existing methods still present several limitations.For example, most of the operations, such as stacked convolu-tions, are performed locally with limited receptive field, evenwith pyramidal decomposition. Furthermore, latent featuresare mostly treated with equal importance in either spatialor channel dimension, without considering the diverse visualsensitivities to various content at different frequency [20].Thus, attempts have been made in [14], [12] to exploit impor-tance maps on top of the quantized latent feature vectors for

arX

iv:1

910.

0624

4v1

[ee

ss.I

V]

11

Oct

201

9

Page 2: Neural Image Compression via Non-Local Attention ...

2

NLA

M

Mas

k C

on

v5

×5×5

×24

Co

nv

k1x2

Co

nv

k1x9

6C

on

v k1

x48

Co

nv 5

×5×19

2 s2

Co

nv 5×5×19

2 s2

ResB

lock

ResB

lock

ResB

lock

Co

nv 5×5×19

2 s2

Co

nv 5×5×19

2 s2

Co

nv 5

×5×19

2 s2

ResB

lock

ResB

lock

ResB

lock

ResB

lock

ResB

lock

ResB

lock

Co

nv

5×5×

384

s2

Co

nv

5×5×

192

s2

Res

Blo

ck

Res

Blo

ck

Res

Blo

ck

Res

Blo

ck

Res

Blo

ck

Res

Blo

ck

Q

AE

AD

NLA

MN

LAM

Q

AE

AD

NLA

MN

LAM

ResB

lock

ResB

lock

NLA

M

Reco

nstru

ction C

on

v 5×

5×3

s2

Res

Blo

ck

Res

Blo

ck

Res

Blo

ck

Co

nv

5×5×

192

s2

Co

nv

5×5×

192

s2

Res

Blo

ck

Res

Blo

ck

Res

Blo

ck

Co

nv

5×5×

192

s2

m

m

h

h

Context Model

Y

Y

X Z

(a)

NLN

Res

Blo

ck

Res

Blo

ck

Res

Blo

ck

Co

nv

1x1

Sigm

oid

Res

Blo

ck

Res

Blo

ck

Res

Blo

ck

Inp

ut

feat

ure

Ou

tpu

t fe

atu

re

Con

v

ReL

U

Con

v

Attention mask

Corresponding input

(b)

X

Y

Z

:1 1 :1 1 :1 1g

2C HW2HW C 2HW C

H W C

2H W C 2H W C 2H W C

HW HW softmax

2HW C

2H W C

:1 1z H W C

(c)

Fig. 1: Non-Local Attention optimization and Improved Context modeling-based image compression - NLAIC. (a)NLAIC: a variational autoencoder with embedded non-local attention optimization in the main and hyperprior encoders anddecoders (e.g., Em, Eh, Dm, and Dh). ”Conv 5×5×192 s2” indicates a convolution layer using a kernel of size 5×5, 192output channels, and stride of 2 (in decoder Dm and Dh, ”Conv” indicates transposed convolution). NLAM represents theNon-Local Attention Modules. “Q” is for quantization, “AE” and “AD” are arithmetic encoding and decoding, P here denotesthe probability model serving for arithmetic coding, k1 in context model means 3d conv kernel of size 1×1×1; (b) NLAM:The main branch consists of three ResBlocks. The mask branch combines non-local modules with ResBlocks for attentionmask generation. The details of ResBlock is shown in the dash frame. (c) Non-local network (NLN): H ×W ×C denotes thesize of feature maps with height H , width W and channel C. ⊕ is the add operation and ⊗ is the matrix multiplication.

adaptive bit allocation, but still only at bottleneck layer. Thesemethods usually signal the importance maps explicitly. If theimportance map is not embedded explicitly, coding efficiencywill be slightly affected because of the probability estimationerror reported in [12].

In this paper, our NLAIC introduces non-local processingblocks into the variational autoencoder (VAE) structure tocapture both local and global correlations among pixels. Atten-tion mechanism is also embedded to generate more compactrepresentation for both latent features and hyperpriors. Simplerectified linear unit (ReLU) is applied for nonlinear activation.Different from those existing methods in [14], [12], our non-local attention masks are applied at different layers (not onlyfor quantized features at the bottleneck), to mask and adaptintelligently through the end-to-end learning framework. We

also improve the context modeling of the entropy engine forbetter latent feature compression, by using a masked 3D CNN(i.e., 5×5×5) based prediction to approximate more accurateconditional statistics.

Even with the coding efficiency outperforming most existingtraditional image compression standards, recent learning-basedmethods [21], [10], [12] are still far from the massive adoptionin reality. For practical application, compression algorithmsneed to be carefully evaluated and justified by its coding per-formance, computational and space complexity (e.g., memoryconsumption), hardware implementation friendliness, etc. Fewresearches [11], [22] were developed in this line for practicallearned image compression coder. In this paper, we proposeadditional enhancements to simply the proposed NLAIC,including 1) a unified network model for variable bitrates

Page 3: Neural Image Compression via Non-Local Attention ...

3

using quality scaling factors; 2) sparse non-local processingfor memory reduction; and 3) parallel 3D masked CNN basedcontext modeling for computational throughput improvement.All of these attempts have greatly reduce the space and timecomplexity of proposed NLAIC, with negligible sacrifice ofthe coding efficiency.

Our NLAIC has outperformed all existing learned andtraditional image compression methods, offering the state-of-the-art coding efficiency, in terms of the rate distortionperformance for the distortion measured by both MS-SSIMand PSNR (Peak Signal-to-Noise Ratio). Experiments havebeen executed using common test datasets such as Kodak [23]and CLIC testing samples that are widely studied in [10],[21], [12]. When compared with the same JPEG anchors,our NLAIC shows BD-Rate gains at 64.39%, followed byMinnen2018 [21] at 59.84%, BPG (YCbCr 4:4:4) HM at59.46%, Balle2018 [10] at 56.19%, and JPEG2000 at 38.02%,respectively.

Additional ablation studies have been conducted to analyzedifferent components of the proposed NLAIC, including theimpacts of sparse non-local processing, parallel 3D contextmodeling, unified multi-rate model, non-local operations, etc,on the coding performance, and system complexity (e.g.,time and space). These investigations further demonstrate theefficiency of our NLAIC for potential practical applications.

Contributions. We highlight the novelties of this paperbelow:• We are the first to introduce non-local operations into

compression framework to capture both local and globalcorrelations among the pixels in the original image andlatent features.

• We apply attention mechanism together with the non-local operations to generate implicit importance masksat various layers to guide the adaptive processing. Thesemasks essentially help to allocate more bits to more im-portant areas that are critical for rate-distortion efficiency.

• We employ a single-layer masked 3D CNN to exploitthe spatial-channel correlations in the latent features, theoutput of which is then concatenated with hyperpriors toestimate the conditional statistics of the latent features,enabling more efficient entropy coding.

• We introduce simplifications of the original NLAIC, in-cluding the sparse non-local processing, parallel masked3D CNN-based contexts modeling, and unified model ofvariable rates for practical applications to reduce com-putational complexity, memory storage, and to improveimplementation friendliness.

The remainder of this paper proceeds as follows: Section IIgives a brief review of related works, while the proposedNLAIC is discussed in Section III followed by proposedcomplexity-reduction options in Section IV. Section V evalu-ates the coding efficiency of proposed method in comparisonto the traditional image codecs and recently emerged learning-based approaches, while Section VI presents the ablationstudies that examine the impact of the various componentsof the proposed NLAIC framework on the coding efficiencyand complexity. Finally, concluding remarks and future worksare described in Section VII.

II. RELATED WORK

In this section, we review prior works related to the non-local operations in image/video processing, attention mecha-nism, as well as the learned image compression algorithms.

Non-local Operations. Most traditional filters (such asGaussian and mean) process the data locally, by using aweighted average of spatially neighboring pixels. It usuallyproduces over-smoothed reconstructions. Classical non-localmethods for image restoration problems (e.g., low-rank mod-eling [24], joint sparsity [25] and non-local means [26]) haveshown their superior efficiency for quality improvement by ex-ploiting non-local correlations. Recently, non-local operationshaven been extended using DNNs for video classification [27],image restoration (e.g., denoising, artifacts removal and super-resolution) [15], [28], etc, yielding significant performanceimprovements as reported. It is also worth to point outthat non-local operations have been applied in intra coding,such as the intra block copy in HEVC-based screen contentcompression [9], by allowing the block search in current frameto exploit non-local correlations.

Self Attention. Self-attention mechanism was popularizedin deep learning based natural language processing (NLP) [29],[30], [31]. It can be described as a mapping strategy whichqueries a set of key-value pairs to an output. For example,Vaswani et. al [31] have proposed multi-headed attentionmethods which are extensively used for machine transla-tion. For low-level vision tasks [28], [14], [12], self-attentionmechanism makes generated features with spatial adaptiveactivation and enables adaptive information allocation withthe emphasis on more challenging areas (i.e., rich textures,saliency, etc).

In image compression, quantized attention masks are usedfor adaptive bit allocation, e.g., Li et. al [14] uses three layersof local convolutions and Mentzer et. al [12] selects one ofthe quantized features. Unfortunately, these methods requirethe extra explicit signaling overhead. By disabling the explicitsignaling, probability estimation errors are induced. Our modeladopts attention mechanism that is close to [14], [12] butapplies multiple layers of non-local as well as convolutionaloperations to automatically generate attention masks from theinput image. The attention masks are applied to the temporarylatent features directly to generate the final latent features tobe coded. Thus, there is no need to use extra bits to code themasks.

Image Compression Architectures. DNN-based imagecompression generally relies on well-known autoencoders.One direction is based on recurrent neural networks (RNN)(e.g., convolutional LSTM) in [32], [33], [34]. Note that theseworks only require a single network model for variable rates,without resorting to the re-training. An explicit spatial adap-tive bit rate allocation of compression control was suggestedin [34] to consider the content variations for better quality atthe same bit rate. More advanced bit allocation schemes, suchas attention driven approaches, will be discussed later.

In another avenue, recently, (non-recurrent) CNN-basedapproaches [35], [36], [11], [10], [12], [14], [37], [21], [38].have attracted more attentions in both industry and academia.

Page 4: Neural Image Compression via Non-Local Attention ...

4

Among them, variational autoencoder (VAE) has beenproven to be an effective structure for compression initially re-ported in [35]. Significant advances have been developed pro-gressively in main components including non-linear transforms(such as convolutions plus generalized divisive normalization- GDN in [35]), differentiable quantization (such as uniformnoise approximated quantization - UNAQ [35], and soft-to-hard quantization [36]), conditional entropy probability mod-eling following the Bayesian generative rules (for example, viahyperpriors [10], and joint 2D PixelCNN-based [39] autore-gressive contexts and hyperpriors [21]). Importance map-basedbit rate control are studied in [14], [12] to apply more weightsto important area. In addition to common MSE or MS-SSIMloss used in practice, we have witnessed other loss functiondesigns in learning to improve the image quality, such as thefeature-based or adversarial loss [37], [40], [19].

III. NLAIC: NON-LOCAL ATTENTION OPTIMIZED IMAGECOMPRESSION

Referring to the VAE structure shown in Fig. 1(a), we canformulate the problem as

min J = RX +RZ + λ · d{Y, Y}, (1)

X = Q {WeK � (· · · (We2 � (We1 �Y)))} , (2)

Y = WdK �(· · ·(Wd2 �

(Wd1 � X

))), (3)

RX = −E{log2 pX|Z(xn|xn−1, . . . , x0, Z)

}, (4)

RZ = −E{log2 pZ(zn)

}. (5)

We wish to find appropriate parameters of transforms Wek andWdk (k ∈ [1,K]), quantization Q{} and conditional entropycoding for better compression efficiency, in an end-to-endlearning fashion. Distortion d{} between original input Y andreconstructed image Y is measured by either MSE or negativeMS-SSIM in current study. Other distortion measurements canbe applied in this learning framework as well, such as featureloss [18]. Bitrate is estimated using the expected entropyof latent and hyperprior features. For instance, bitrate forhyperpriors RZ is based on the self probability distribution(e.g., (5)), while bit rate for latent features RX is derived byexploring the probability conditioned on the distribution ofboth autoregressive neighbors in the feature maps (e.g., causalspatial-channel neighbors) and hyperpriors (e.g., (4)). � is forthe convolutional operations.

Figure 1(a) illustrates detailed network structures and as-sociated parameter settings of five different components inour NLAIC system. Our NLAIC is built on a variationalautoencoder structure [10], with non-local attention modules(NLAM) as basic units in both main and hyperprior encoder-decoder pairs (i.e., EM , DM , Eh and Dh). EM with quanti-zation Q are used to generate quantized latent features andDM decodes the features into the reconstructed image. Ehand Dh are applied to provide side information z about theprobability distribution of quantized latent features (known ashyperpriors), to enable efficient entropy coding. The hyper-priors as well as autoregressive spatial-channel neighbors ofthe latent features are then processed through the conditional

context model P to perform conditional probability estimationfor entropy coding of the quantized latent features.

The NLAM module is shown in Fig. 1(b), and explained inSections III-A and III-B below.

A. Non-local Network Processing

Our NLAM adopts the Non-Local Network (NLN) proposedin [27] as a basic block, as shown in Figs. 1(b) and 1(c).NLN has been mainly used for image/video processing, rathercompression. This NLN computes the output at i-th position,Yi, using a weighted average of the transformed feature valuesof input X, as below:

Yi =1

C(X)

∑∀j

f(Xi,Xj)g(Xj), (6)

where i is the location index of output vector Y and jrepresents the index that enumerates all accessible positionsof input X. X and Y share the same size. The functionf(·) computes the correlations between Xi and Xj , and g(·)derives the representation of the input at the position j. C(X)is a normalization factor to generate the final response which isset as C(X) =

∑∀j f(Xi,Xj). Note that a variety of function

forms of f(·) have been already discussed in [27]. Thus in thiswork, we directly use the embedded Gaussian function for f(·)for simplicity, i.e.,

f(Xi,Xj) = eθ(Xi)Tφ(Xj). (7)

Here, θ(Xi) = WθXi and φ(Xj) = WφXj , where Wθ andWφ denote the cross-channel transform using 1×1 convolu-tion in our framework. The weights f(Xi,Xj) are furtherabstracted using a softmax operation. The operation definedin Eq. (6) can be written in matrix form [27] as:

Y = softmax(XTWT

θ WφX)g(X). (8)

In addition, residual connection can be added for betterconvergence as suggested in [27], as shown in Fig. 1(c), i.e.,

Z =WzY +X, (9)

where Wz is also a linear 1×1 convolution across all channels,and Z is the final output vector with augmented global andlocal correlation of input X via above NLN.

B. Non-local Attention Module (NLAM)

Note that importance (attention) map has been studiedin [14], [12] to adaptively allocate information to quantizedlatent features. For instance, we can give more bits to edgearea but less bits to elsewhere, resulting in better visual qualityat the similar bit rate consumption. Such adaptive allocationcan be implemented by using an explicit mask encapsulated inthe compressed stream with additional bits. Note that explicitmask can be implemented using implicit signaling as well, butwith prediction error-induced coding efficiency degradations asreported in [12]. In addition, existing mask generation methodsat only bottleneck layer in [14], [12] are too simple to handleareas with more complex content characteristics.

As shown in Fig. 1(b), the entire NLAM presents threebranches. The main (or feature) branches uses conventional

Page 5: Neural Image Compression via Non-Local Attention ...

5

Input channel = 4 channel = 19 sum of attention

Input channel = 4 channel = 19 sum of attention

Fig. 2: Visualization of attention masks generated byNLAM. Brighter means more attention. Attention masks havethe same size as latent features with several channels. Herechannel 4 and 19 are picked to show the spatial and channelattention mechanism (values are in the range of (0, 1)). Term“sum of attention” denotes the accumulated attention mapsover all channels.

stacked convolutional networks (e.g., three residual blocks,a.k.a., ResBlock, in this work [41]) to generate features, andthe mask branch applies the NLN, followed by another threeResBlocks, one 1×1 convolution and non-linear sigmoidactivation to produce a joint spatial-channel attention maskM , i.e.,

M = sigmoid(FNLN(X)), (10)

where M denotes the attention mask and X is the inputfeature vector. FNLN(·) represents the operations using NLN,ResBlocks and 1×1 convolution in Fig. 1(b). This attentionmask M, having its element 0 < Mk < 1,Mk ∈ R,is multiplied element-wise with corresponding pixel elementin feature maps from the main branch to perform adaptiveprocessing. Another residual connection in Fig. 1(c) is addedfor faster convergence [41].

In comparison to the existing masking methods [14], [12],our NLAM only uses attention masks implicitly. Furthermore,multiple NLAMs (e.g., two pairs of NLAMs in the mainencoder-decoder, and one pair of NLAM in the hyperpriorencoder-decoder) are embedded, to massively exploit the non-local and local correlations at multi scales for accurate maskgeneration, rather than performing mask generation at bot-tleneck layer only in [14], [12]. NLAMs at different layersare able to provide attention masks with multiple levels ofgranularity. Visualization of the attention masks generated byour NLAM can be shown in Fig. 2. As will be reported inlater experiments, NLAM embedded at various layers couldoffer noticeable compression performance improvement.

Since batch normalization (BN) is sensitive to the datadistribution, we avoid any BN layers and only use one ReLU inour ResBlock, justified through our experimental observations.Note that in existing learned image compression methods,

3x3x3 masked kernel

Channel

Vertical

Horizontal

Current

Fig. 3: 3D Masked Convolution. A 3×3×3 masked con-volution exemplified for exploring contexts of spatial andchannel neighbors jointly. Current pixel (in purple grid cube)is predicted by the causal/processed pixels (in yellow forneighbors from previous channel, green for vertical neighbors,and blue for horizontal neighbors) in a 3D space. Thoseunprocessed pixels (in white cube) and the current pixel aremasked with zeros.

particularly for those with superior performance [35], [10],[21], [42], GDN [35] activation has proven its better efficiencycompared with ReLU, tanh, sigmoid, leakyReLU, etc.This may be due to the fact that GDN captures the global infor-mation across all feature channels at the same pixel location.Whereas, our NLAIC shows that simple ReLU function workseffectively without resorting to the GDN, owing to the fact thatproposed NLAM captures non-local correlations efficiently.

C. Conditional Entropy Rate Modeling

Previous sections present our novel NLAM scheme totransform the input pixels into more compact latent features.This section details the entropy rate modeling that is criticalfor the overall rate-distortion efficiency.

1) Context Modeling Using Hyperpriors: Similar as [10],a non-parametric, fully factorized density model is trained forhyperpriors z, which is described as:

pz|ψ(z|ψ) =∏i

(pzi|ψ(i)(ψ(i)) ∗ U(−1

2,1

2))(zi), (11)

where ψ(i) represents the parameters of each univariate distri-bution pz|ψ(i) .

For quantized latent features x, each element xi can bemodeled as a conditional Gaussian distribution as:

px|z(x|z) =∏i

(N (µi, σ2i ) ∗ U(−

1

2,1

2))(xi), (12)

where its µi and σi are predicted using the hyperpriors z. Weevaluate the bits of x and z using:

Rx = −∑

ilog2(pxi|zi

(xi|zi)), (13)

Rz = −∑

ilog2(pzi|ψ(i)(zi|ψ(i))). (14)

Usually, we take z as the side information for estimating µiand σi and z only occupies a very small fraction of bits, asrevealed in Section VI and shown in Fig. 13.

Page 6: Neural Image Compression via Non-Local Attention ...

6

2) Context Modeling Using Joint Autoregressive Spatial-Channel Neighbors and Hyperpriors: Local image neigh-bors usually present high correlations. PixelCNNs and Pix-elRNNs [39] have been used for effective modeling of prob-abilistic distribution of images using local neighbors in anautoregressive way. It is further extended for adaptive contextmodeling in compression framework with noticeable improve-ment [33]. For example, Minnen et al. [21] proposed to extractautoregressive information by a 2D 5×5 masked convolutionat each feature channel. Such neighbor information is thencombined with hyperpriors using stacked 1×1 convolutions,for probability estimation. Models in [21] was reported as thefirst learning-based method with better PSNR compared withthe BPG-YUV444 at the same bit rate.

In NLAIC, we use a 5×5×5 3D masked convolution toexploit the spatial and cross-channel correlation jointly. This5×5×5 convolutional kernel shares the same parameters for allchannels, offering reduced model complexity for implementa-tion. For simplicity, a 3×3×3 example is shown in Fig. 3.

Traditional 2D PixelCNNs in [21] need to search for a wellstructured channel order to exploit the conditional probabilityefficiently. Instead, our proposed 3D masked convolutionsimplicitly capture the correlations across adjacent channels.Compared with 2D masked CNN used in [21], our 3D CNN-based approach significantly reduces the network parametersfor the conditional context modeling, by enforcing the sameconvolutional kernels across the entire spatial-channel space.Leveraging the additional contexts from spatial-channel neigh-bors via an autoregressive fashion, we can obtain a betterconditional Gaussian distribution to model the entropy as:

px(xi|x1, ...,xi−1, z) =∏i

(N (µi, σi2) ∗ U(−1

2,1

2))(xi), (15)

where x1, x2, ..., xi−1 denote the causal (and possibly recon-structed) elements prior to current xi in feature maps. µi andσi are estimated from these causal samples and hyperpriors z.

IV. EXTENSIONS OF NLAIC FOR COMPLEXITYREDUCTION

This section introduces series of enhancements to reducethe model complexity of our proposed NLAIC. For example,sparse NLAM is applied to reduce the memory consumption,while parallel masked 3D convolution is used to reduce thecomputational complexity. Another improvement is applyinga unified neural model for variable bitrates via quality mappingfactors, by which the model complexity for application isgreatly alleviated.

A. Sparse NLAM

Referring to the NLN aforementioned, it typically requiresa large amount of space/memory to host a correlation matrixat size of HW × HW for a naıve implementation. Notethat H and W are the height and width for input featuremap. Sparsity has been widely applied in image processingby leveraging the local pixel similarity. This motivates us toperform maxpooling-based sampling to reduce the size of the

X

Y

Z

:1 1 :1 1 :1 1g

22C HW s

2HW C

H W C

2H s W s C

2H W C 2H W C

2HW HW s softmax

2HW C

2H W C

:1 1z H W C

2H s W s C

2 2HW s C

2H W C

DOWN

SAMPLEDOWN

SAMPLE

Fig. 4: Sparse NLN. Downsampling is utilized to scale downthe memory consumption of the correlation matrix.

correlation matrix. We set downsampling factor s to balancebetween the coding efficiency and memory consumption, asshown in Fig. 4. Our experimental studies showed that s = 8(e.g., downsampled as a factor of 64×) can achieve a goodtrade-off. In practice, the factor s can be chosen according tothe memory constraints imposed by the underlying system.

B. Parallel 3D Masked CNN based Context Modeling

Both 2D and 3D masked convolutions [39] can be ex-tended to model the conditional probability distribution forthe quantized latent features pixel by pixel in Fig. 3. Althoughthe masked convolutions can leverage the neighbor pixels topredict the current pixels efficiently, it usually leads to a greatcomputational penalty because of the strictly sequential pixel-by-pixel processing, making the compression framework farfrom the practical application.

Recalling the examples in Fig. 3, parallelism is mainlybroken by using the left (horizontal) neighbor (highlighted inblue) for masked convolutions, due to a raster scan processingorder. For this design, it requires H×W×C convolutions tocomplete all pixel elements in feature maps, with computa-tional complexity noted as O(H×W×C). Here, C representsthe number of channels of the quantized latent features.

One simplification is to remove the dependency on leftneighbors, by only using the vertical and channel neighbors.Then the convolution for each line can be performed inparallel by W processors, each computing H×C convolutions.Theoretically this can reduce the processing time to O(H×C).

We can further remove the vertical neighbors in Fig. 3, sothat the convolutions for all the pixels in each feature mapchannel can be computed simultaneously using H ×W pro-cessors, reducing the processing time to O(C). Simulations inablation studies and in Fig. 12 show that performance impactis negligible when we used such parallel 3D convolution-basedcontext modeling.

Page 7: Neural Image Compression via Non-Local Attention ...

7

0.2

0.1

0.2

0.1

0.0

bpp=0.14

bpp=0.41

bpp=1.11

(a)

5

10

15

20

15

10

5

0

(b) bpp = 0.14

10

20

30

40

30

20

10

0

(c) bpp = 0.41

60

80

20

40

60

40

20

0

(d) bpp = 1.11

Fig. 5: Visualization at Variable Rates. (a) ConvolutionalKernels with bitrates at 0.14 bpp (first row), 0.41 bpp (secondrow), 1.11 bpp (last row); (b)-(d) Feature maps at correspond-ing bitrates.

C. A Unified Model for Variable Rates

Deep learning-based image coders are usually trained byminimizing a rate-distortion criteria, i.e., R + λD, where λis a hyperparameter controlling the trade-off between R andD. For most studies [10], [21], [12], they have trained dif-ferent models for different target bitrates by varying λ. Usingdifferent models for different rates not only brings significantmemory consumption to host/cache them, but also introducesadditional model switching overhead when performing thebitrate adaptation in encoder optimization. Thus, a single orunified model covering a reasonable range of bitrate is highlydesirable in practice.

A few attempts have been made in pursuit of a single/unifiedmodel for variable rates without re-training for individualbitrate, such as [32], [33], [34], [43], all of which compressbit-planes progressively. These methods inherently offer thebit-depth scalability. Standing upon the scalability perspective,a layered network design [44] is proposed to produce fine-grain bit rates in a unified framework, but it is actually aconcatenation of a base layer model trained at the low bitrate and a set of refinement models trained for successivelyrefinement. Recently, Dumas et. al [45] have tried to learnthe transform and quantization jointly in a exemplified simplenetwork, to offer the quantization independent compression ofluminance image.

As aforementioned adapting λ yields the best coding ef-ficiency for a specific bitrate target. To examine how doesthe learnt filters and the feature maps change with the bitrate, we first train a model at a high bitrate (≈ 1bpp),and retrain additional models at a variety of bitrates (byusing increasingly larger λ) based on this high-bitrate modelinitialization. Corresponding feature maps and convolutionalkernels are presented in Fig. 5. It shows that both kernels andfeature maps keep almost the same pattern with the intensityof each element scaled, for models at variable rates. This

Q

AE

AD

Decoder

SF

ISF

Encoder

Fig. 6: Quality Scaling Factor. Feature maps are scaled usingscaling factor (SF) before quantization, and inverse scalingfactor (ISF) is used for entropy-decoded elements appropri-ately. A fixed context model is used for entropy probabilityestimation

suggests that we can simply apply a set of scaling factorsto adapt the last feature maps to be coded which was trainedat a high bitrate scenario to other bitrate ranges, without theretraining of entire model for individual rates any more. Thisis analogous to the scaling operations [46] used in H.264/AVCor HEVC for transformed coefficients prior to being quantized.

Thus, we propose a set of quality scaling factors (sf ) thatwill be embedded in autoencoder as shown in Fig. 6 for the bitrate adaptation. For any input image Y, encoder E generatescorresponding feature maps using X0 = E(Y) at a specificbit rate, typically for a high bit-rate R0. Scaling factors (sf ∈{a0, b0}, {a1, b1}, ..., {an, bn}) are devised to linearly scaleeach of all n channels in input features to new bitrate. Thus,for feature maps Xnew at new bitrate target, we have:

Bi = Q{Xnewi } = Q{SF{X0

i }} = Q{ai ·X0i + bi}. (16)

Bi is a vector of quantized elements in ith channel for entropycoding, and will be inversely scaled,

Xi = ISF{Bi} =Bi − biai

, (17)

prior to being fed into the decoder network D for reconstructedimage Y = D(X).

Note that entropy context modeling P is actively employedin learned image compression to accurately capture the prob-ability distribution of feature map elements for rate-distortionoptimization and entropy coding. Given that elements infeature maps are scaled and biased, thus we need to adjust thequantization step in probability calculation accordingly, i.e.,

p(x) =∏i

(N (µi, σ2i ) ∗ U(−

1

2,1

2))(xi) (18)

⇒∏i

(N (µi, σ2i ) ∗ U(−

1

2ai,1

2ai))(xnewi ) (19)

Though we have exemplified the quality scaling factor usingan illustrative autoencoder in Fig. 6, it can be extended to othercomplex network structures easily. In practice, our NLAICimplements both main and hyper encoders and decoders.However, our simulations have revealed that hyperpriors onlyoccupy a small overhead, e.g., 2%-8% (larger number for

Page 8: Neural Image Compression via Non-Local Attention ...

8

0.0 0.2 0.4 0.6 0.8 1.0bit per pixel (bpp)

10

12

14

16

18

20

22

MS-

SSIM

(dB)

NLAIC (MS-SSIM opt.)NLAIC baseline (MS-SSIM opt.)Minnen (2018)Balle (2018)Ripple (2017)Mentzer (2018)NLAIC (MSE opt.)NLAIC baseline (MSE opt.)BPG (YCbCr 4:4:4) HMBPG (YCbCr 4:4:4) x265JPEG2000JPEG (4:2:0)

(a)

0.0 0.2 0.4 0.6 0.8 1.0bit per pixel (bpp)

26

28

30

32

34

36

PSNR

(dB)

NLAIC (MSE opt.)Minnen (2018)NLAIC baseline (MSE opt.)BPG (YCbCr 4:4:4) HMBPG (YCbCr 4:4:4) x265Balle (2018)JPEG2000NLAIC (MS-SSIM opt.)NLAIC baseline (MS-SSIM opt.)JPEG (4:2:0)

(b)

NLAIC Minnen2018 BPG444 (HM)

NLAIC (baseline)

BPG444 (x265)

Balle2018 JPEG20000

10

20

30

40

50

60

70

BD-R

ate

Gain

s(%

)

64.39%59.84% 59.46% 57.74% 56.19%

52.52%

38.02%

(c)

Fig. 7: Rate-distortion Efficiency. Illustrative comparisonsare given using public Kodak data. (a) distortion is measuredby MS-SSIM (dB). Here we use −10 log10(1−d) to representraw MS-SSIM (d) in dB scale. (b) PSNR is used for distortionevaluation. (c) Numerical coding gains with JPEG as anchorand distortion is measured by PSNR.

smaller bpp), for an entire compressed bitstream. Thus, inthe view of low-complexity and practical application, we onlyapply scaling factors in main codec, leaving the hyper codecfixed, which is trained at the highest bit rate.

V. EXPERIMENTAL STUDIES

This section presents comprehensive performance evalua-tion. Training and measurement follows the comment practicesused by other compression algorithms [47], [2], [10], [21] fora fair comparison.

A. Training

We use COCO [48] and CLIC [49] datasets to trainour NLAIC framework. We randomly crop images into192×192×3 patches for subsequent learning. Well-knownRDO process is applied to do end-to-end training at variousbit rates via L = λ·d(Y,Y) + Rx + Rz . d(·) is a distortionmeasurement between reconstructed image Y and the originalimage Y. Both negative MS-SSIM and MSE are used in ourwork as distortion loss functions, which are marked as “MS-SSIM opt.” and “MSE opt.”, respectively. Rx and Rz representthe estimated bit rates of latent features and hyperpriors,respectively. Note that all components of our NLAIC aretrained together. We set learning rates (LR) for EM , DM , Eh,Dh and P at 3e−5 in the beginning. But for P, its LR is clippedto 1e−5 after 30 epochs. Batch size is set to 16 and the entiremodel is trained on 4-GPUs in parallel.

B. Rate-Distortion Efficiency

We evaluate our NLAIC models by comparing the rate-distortion performance averaged on publicly available Kodakdataset. Figure 7 shows the performance when distortion ismeasured by MS-SSIM and PSNR, respectively. MS-SSIMand PSNR are widely used in image and video compressiontasks. Here, PSNR represents the pixel-level distortion whileMS-SSIM describes the structural similarity. MS-SSIM isreported to offer higher correlation with human perceptualinception, especially for low bit rates [17]. As we can see, ourNLAIC provides the state-of-the-art performance with notice-able performance gain compared with the other existing lead-ing methods, such as Minnen2018 [21] and Balle2018 [10].

Objective Measurement. As shown in Fig. 7(a) using MS-SSIM for both loss and final distortion measurement, and inFig. 7(b) using MSE for loss function and PSNR for finaldistortion, NLAIC is both ranked at the first place, offeringthe state-of-the-art coding efficiency. Figure 7(c) comparesthe average BD-Rate reductions by various methods over thelegacy JPEG encoder. Our NLAIC model shows 64.39% and11.97% BD-Rate [50] reduction against JPEG (4:2:0) and BPG(YCbCr 4:4:4) HM, respectively. Here BPG HM is compiledwith HEVC HM reference software which has slightly betterperformance than x265 used in default BPG.

Subjective Evaluation. We also evaluate our method onBSD500 [51] dataset, which is widely used in image restora-tion problems. Figure 8 shows the results of different imagecodecs at the similar bit rate. Our NLAIC provides the best

Page 9: Neural Image Compression via Non-Local Attention ...

9

(a) JPEG:0.3014bpp

PSNR:21.23 MS-SSIM:0.8504

(b) BPG:0.3464bpp

PSNR:24.84 MS-SSIM:0.9270

(c) NLAIC MSE opt.:0.2929bpp

PSNR:24.71 MS-SSIM:0.9277

(d) NLAIC MS-SSIM opt.:0.3087bpp

PSNR:23.57 MS-SSIM:0.9551

(a) JPEG:0.2127bpp

PSNR:25.17 MS-SSIM:0.8629

(b) BPG:0.1142bpp

PSNR:31.97 MS-SSIM:0.9581

(c) NLAIC MSE opt.:0.1276bpp

PSNR:34.63 MS-SSIM:0.9738

(d) NLAIC MS-SSIM opt.:0.1074bpp

PSNR:32.54 MS-SSIM:0.9759

(e) Original

(e) Original

Fig. 8: Subjective Evaluation. Visual comparison among JPEG420, BPG444, NLAIC MSE opt., MS-SSIM opt. and theoriginal image from left to right. Our method achieves the best visual quality containing more texture without blocky norblurring artifacts.

subjective quality with relative smaller bit rate. In practice,some bit rate points cannot be reached for BPG and JPEG.Thus we choose the closest one to match our NLAIC bit rate.

C. Complexity Analysis

We perform all tests on a NVIDIA P100 GPU with Pytorchtoolbox. The trained model has a size of about 262MB.For an input image at a size of 512×768×3, The modelhas 291.8G FLOPs for encoder, 353.2G FLOPs for decoderand 3.46G FLOPs for context model. Forwarding encodingrequires 6172MB running memory and takes about 438ms.At the decoder side, the default line-by-line decoding withcontext models takes most of the decoding time which canbe remarkably decreased by channel-wise parallel contextmodeling, as discussed in Section IV-B.

VI. ABLATION STUDIES

We further analyze our NLAIC in following aspects tounderstand the capability of our system in practice:

Impacts of Loss Functions. Considering that MS-SSIMloss optimized results demonstrate much smaller PSNR athigh bit rate in Fig. 7(a), we visualize decompressed images

at high bit rate using models optimized for PSNR and MS-SSIM loss as shown in Fig. 9. We find that MS-SSIM lossoptimized results exhibit worse details compared with PSNRloss optimized models at high bit rate. This may be due tothe fact that pixel distortion becomes more significant at highbit rate, but structural similarity puts more weights at a fairlylow bit rate range. It will be interesting to explore a bettermetric to cover the advantages of PSNR at high bit rate andMS-SSIM at low bit rate for an overall optimal efficiency.

Impacts of Contexts. To understand the contribution ofthe context modeling using spatial-channel neighbors, we offeranother alternative implementation. It is referred to as “NLAICbaseline” that only uses the hyperpriors to estimate the meansand variances of the latent features (see Eq. (12)). In contrast,default NLAIC uses both hyperpriors and previously codedpixels in the latent feature maps (see Eq. (15)).

Referring to Fig. 7(a), even our “NLAIC baseline” out-performs the existing methods when using MS-SSIM as lossand evaluation measures. For the case that uses MSE as lossand PSNR as distortion measurement, “NLAIC baseline” isslightly worse than the model in [21] that uses contexts fromboth hyperpriors and autoregressive neighbors jointly as our“NLAIC”, but better than the work [10] that only uses thehyperpriors to do context modeling for fair comparison.

Page 10: Neural Image Compression via Non-Local Attention ...

10

(a) MS-SSIM opt:0.8743bpp

PSNR:31.49 MS-SSIM:0.9956

(d) MS-SSIM opt:0.6056bpp

PSNR:28.84 MS-SSIM:0.9879

(c) Original

(e) MSE opt:0.6045bpp

PSNR:30.43 MS-SSIM:0.9815(f) Original

(b) MSE opt:0.8798bpp

PSNR:35.12 MS-SSIM:0.9935

Fig. 9: Impacts of Loss Functions. Illustrative reconstruction samples of respective PSNR and MS-SSIM loss optimizedcompression

36

34 (

S

P)

2

3tlNSd

30

28

-+- NLAIC (MSE opt.)

-+- NLAIC baseline (MSE opt.)

一- NLAIC baseline (MSE opt.)(remove_first)

一�NLAIC baseline (MSE opt.)(remove_main)

-+- NLAIC baseline (MSE opt.)(remove_all)

Balle (2018)

一一·一·一一一一一·一·一一一➔-•一·一一一一一·一·一一一一一·一仁

-一一一·一·

一·一·一·一·一一一·一·一·一·一一 i·一·一·一·一一一·一·一·

一·一·一·一·一一一·一·一·一·一一1

0.40 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80

0.0 0.2 0.4 0.6

bit per pixel (bpp) 0.8 1.0

Fig. 10: Impacts of NLAM. Efficiency illustration whenremoving NLAM components gradually and re-training themodel.

We further compare conditional context modeling efficiencyof the model variants in Fig. 11. As we can see, with embeddedNLAM and joint contexts modeling, our NLAIC could providemore compact latent features, and less normalized feature pre-diction error, both contributing to its leading coding efficiency.

In this work, we first train the “NLAIC baseline” models.To train the NLAIC model, one way is fixing the main andhyperprior encoders and decoders in the baseline model, andupdating only the conditional context model P. Comparedwith the “NLAIC baseline”, such transfer learning based“NLAIC” provides 3% bit rate reduction at the same distortion.Alternatively, we could use the baseline models as the startpoint, and refine all the modules in the “NLAIC” system. Inthis way, “NLAIC” offers more than 9% bit rate reduction overthe “NLAIC baseline” at the same quality. Thus, we choosethe latter one for better performance.

Impacts of NLAM. To further delineate the gain due tothe newly introduced NLAM, we remove the mask branch

in the NLAM pairs gradually, and retrain our framework forperformance evaluation. For this study, we use the baselinecontext modeling (only hyperpriors) in all cases, and useMSE as the loss function and PSNR as the final distortionmeasurement, shown in Fig. 10. For illustrative understanding,we also provide two anchors, i.e., “Balle2018” [10] and“NLAIC” respectively. However, to see the degradation causedby gradually removing the mask branch in NLAMs, one shouldcompare with the NLAIC baseline curve.

Removing the mask branches of the first NLAM pair inthe main encoder-decoders (referred to as “remove first”)yields a PSNR drop of about 0.1dB compared to “NLAICbaseline” at the same bit rate. PSNR drop is further enlargednoticeably when removing all NLAM pairs’ mask branchesin main encoder-decoders (a.k.a., “remove main”). It givesthe worst performance when further disabling the NLAMpair’s mask branches in hyperprior encoder-decoders, resultingin the traditional variational autoencoder without non-localcharacteristics explorations (i.e., “remove all”).

Impacts of Parallel Context Modeling. Parallel 3D CNN-based context modeling is introduced in Section IV-B tospeedup the computational throughput, by removing the pre-diction dependency on left or left and upper neighbors. First,we found that left neighbors have negligible affects to theperformance. In this case, we just use contexts without leftneighbors in the default NLAIC. As shown in Fig. 12, we haveexamined three scenarios for conditional probability modelingusing, e.g., 1) hyperprior only (a.k.a., NLAIC baseline), 2)hyperprior + ctx#1 (a.k.a., contexts w/ channel neighborsonly), and 3) NLAIC (hyperprior + ctx#2 (a.k.a., contextsw/o left neighbor)), leading to 6.76%, 3.73%, ≈ 0% BD-Ratelosses measured by MS-SSIM for respective 1), 2), 3) usecases, when compared with the default NLAIC in with fullaccess to complete information for probability modeling. Onthe other hand, computational complexity is greatly reducedby enforcing the parallel context modeling.

Hyperpriors z. Hyperpriors z has noticeable contribution to

Page 11: Neural Image Compression via Non-Local Attention ...

11

Latent features Mean Scale Normalized Predicted Error

Join

tR

em

ove

Fir

stR

em

ove

All

Bas

elin

eR

em

ove

Mai

n

Distribution

Fig. 11: Prediction error with different models at similarbit rate. Column-wisely, it depicts the latent features, thepredicted mean, predicted scale, normalized prediction error( i.e., feature−mean

scale ) and the distribution of the normalizedprediction error from left to right plots. Each row represents adifferent model (e.g., various combinations of NLAM compo-nents, and contexts prediction). These figures show that withNLAM and joint contexts from hyperprior and autoregres-sive neighbors, the latent features capture more information(indicated by a larger dynamic range), which leads to largerscales for both predicted mean and scale (standard deviation offeatures). Meanwhile, the distribution of the final normalizedfeature prediction error remains compact, which leads to thelowest bit rate.

the overall compression performance [21], [10]. Its percentagedecreases as the overall bit rate increases, shown in Fig. 13.The percentage of z for MSE loss optimized model is higherthan the case using MS-SSIM loss optimization, but still muchless than the bits consumed by the latent features.

Impacts of Unified Model for Variable Rates. A unifiedmodel for variable rates is highly desired for practical applica-tion, without resorting to the model re-training for individualrates. As discussed in Sec. IV-C, we have developed qualityscaling factors to re-use the models trained at high-bit-ratescenario directly for another set of bitrates.

Since the original feature maps are shifted, scaled andquantized, it may induce the data distribution variations whenhaving scaling and inverse scaling operations, leading to thedegradation of coding efficiency at lower bitrates. Usually,coding efficiency degrades larger when the bitrate distancegets further. Thus, we suggest to apply three models insteadof a single one to avoid an unexpected performance gap atultra-low bitrate (assuming the model is trained at the high

0.3 0.4 0.5 0.6 0.7 0.8 0.9bit per pixel (bpp)

17

18

19

20

21

22

MS-

SSIM

(dB)

NLAIC (hyperprior + ctx#2)hyperprior + ctx#1NLAIC baseline (hyperprior only)

Fig. 12: Impacts of Parallel Context Modeling. Coding Ef-ficiency Illustrations for a variety of conditional probabilityprediction using hyperprior only (a.k.a., NLAIC baseline), hy-perprior + ctx#1 (a.k.a., contexts w/ channel neighbors only),and hyperprior + ctx#2 (a.k.a., contexts w/o left neighbor).

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0Total bit rate (bpp)

1

2

3

4

5

6

7

8bi

ts o

f hyp

er in

form

atio

n (%

)

NLAIC joint (MSE opt.)NLAIC joint (MS-SSIM opt.)

Fig. 13: Percentage of z. Bits consumed by z in the entirebitstream. The percentage of z for MSE loss optimized methodis noticeably higher than the scenario using MS-SSIM loss.

bitrate), as shown in Fig. 14, to cover the respective low,medium and high bitrate ranges. For each bitrate range, itsmodel is trained at the highest bitrate. Note that bitrate rangemay overlap. Simulations have revealed that, our approachwith three unified models for variable rates, still outperformsthe BPG with significant performance margin, and offers thecomparable efficiency with Minnen2018 [21] which requiresmodel retraining for each bitrate. Note that the bitrate ofhyperpriors are fixed for the entire bitrate range, which causesthe performance loss at low bitrates. We can further applyquality factors to the hyper encoder and decoder as well tonarrow the gap between our unified model and models thatare separately trained.

VII. CONCLUDING REMARKS AND FUTURE WORKS

In this paper, we proposed a neural image compressionalgorithm using non-local attention optimization and improved

Page 12: Neural Image Compression via Non-Local Attention ...

12

0.0 0.2 0.4 0.6 0.8 1.0bit per pixel (bpp)

12

14

16

18

20

22M

S-SS

IM (d

B)NLAIC (MS-SSIM opt.)Minnen (2018)BPG (YCbCr 4:4:4) HMNLAIC QF lowNLAIC QF midNLAIC QF high

Fig. 14: RD performance of Multi Bitrates Generation

context modeling (NLAIC). Our NLAIC method achieves thestate-of-the-art performance, for both MS-SSIM and PSNRevaluations at the same bit rate, when compared with exist-ing image compression methods, including well-known BPG,JPEG2000, JPEG as well as the most recent learning basedschemes [21], [10].

Key novelties of NLAIC are non-local operations as non-linear transforms to capture both local and global correlationsfor latent features and hyperpriors, implicit attention mecha-nisms to adapt more weights for salient image areas, and con-ditional entropy modeling using joint 3D CNN-based spatial-channel neighbors and hyperpriors. For practical applications,sparse nonlocal operations, parallel 3D CNN-based contextmodeling, and unified model for variable rates, are introducedto reduce the space, time, and implementation complexity.

For future study, we would like to extend our frameworkfor end-to-end video compression framework with more priorsacquired from spatial and temporal information. Another inter-esting avenue is to further simplify the models for embeddedsystem, such as fixed-point implementation, platform frienlynetwork structures, etc.

VIII. ACKNOWLEDGEMENT

We are very grateful for the constructive comments fromanonymous reviewers to improve this manuscript.

REFERENCES

[1] G. K. Wallace, “The jpeg still picture compression standard,” Commu-nications of the ACM, vol. 34, no. 4, pp. 30–44, 1991. 1

[2] D. T. Lee, “Jpeg 2000: Retrospective and new developments,” Proceed-ings of the IEEE, vol. 93, no. 1, pp. 32–41, Jan 2005. 1, 8

[3] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, “Overviewof the h.264/avc video coding standard,” IEEE Transactions on Circuitsand Systems for Video Technology, vol. 13, no. 7, pp. 560–576, July2003. 1

[4] G. J. Sullivan, J. R. Ohm, W. J. Han, and T. Wiegand, “Overview ofthe high efficiency video coding (hevc) standard,” IEEE Transactions onCircuits and Systems for Video Technology, vol. 22, no. 12, pp. 1649–1668, Dec 2012. 1

[5] T. Berger, “Rate-distortion theory,” Wiley Encyclopedia of Telecommu-nications, 2003. 1

[6] N. Ahmed, T. Natarajan, and K. R. Rao, “Discrete cosine transform,”IEEE transactions on Computers, vol. 100, no. 1, pp. 90–93, 1974. 1

[7] B.-F. Wu and C.-F. Lin, “A high-performance and memory-efficientpipeline architecture for the 5/3 and 9/7 discrete wavelet transform ofjpeg2000 codec,” IEEE Transactions on circuits and systems for videotechnology, vol. 15, no. 12, pp. 1615–1628, 2005. 1

[8] J. Lainema, F. Bossen, W.-J. Han, J. Min, and K. Ugur, “Intra codingof the hevc standard,” IEEE Transactions on Circuits and Systems forVideo Technology, vol. 22, no. 12, pp. 1792–1801, 2012. 1

[9] X. Xu, S. Liu, T. Chuang, Y. Huang, S. Lei, K. Rapaka, C. Pang,V. Seregin, Y. Wang, and M. Karczewicz, “Intra block copy in hevcscreen content coding extensions,” IEEE Journal on Emerging andSelected Topics in Circuits and Systems, vol. 6, no. 4, pp. 409–419,Dec 2016. 1, 3

[10] J. Balle, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, “Vari-ational image compression with a scale hyperprior,” arXiv preprintarXiv:1802.01436, 2018. 1, 2, 3, 4, 5, 7, 8, 9, 10, 11, 12

[11] O. Rippel and L. Bourdev, “Real-time adaptive image compression,”arXiv preprint arXiv:1705.05823, 2017. 1, 2, 3

[12] F. Mentzer, E. Agustsson, M. Tschannen, R. Timofte, and L. Van Gool,“Conditional probability models for deep image compression,” in IEEEConference on Computer Vision and Pattern Recognition (CVPR), vol. 1,no. 2, 2018, p. 3. 1, 2, 3, 4, 5, 7

[13] J. Balle, “Efficient nonlinear transforms for lossy image compression,”in 2018 Picture Coding Symposium (PCS). IEEE, 2018, pp. 248–252.1

[14] M. Li, W. Zuo, S. Gu, D. Zhao, and D. Zhang, “Learning convolutionalnetworks for content-weighted image compression,” arXiv preprintarXiv:1703.10553, 2017. 1, 2, 3, 4, 5

[15] D. Liu, B. Wen, Y. Fan, C. C. Loy, and T. S. Huang, “Non-local recurrentnetwork for image restoration,” in Advances in Neural InformationProcessing Systems, 2018, pp. 1680–1689. 1, 3

[16] G. J. Sullivan, T. Wiegand et al., “Rate-distortion optimization for videocompression,” IEEE signal processing magazine, vol. 15, no. 6, pp. 74–90, 1998. 1

[17] Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structuralsimilarity for image quality assessment,” in The Thrity-Seventh AsilomarConference on Signals, Systems & Computers, 2003, vol. 2. Ieee, 2003,pp. 1398–1402. 1, 8

[18] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.1, 4

[19] C. Huang, H. Liu, T. Chen, S. Pu, Q. Shen, and Z. Ma, “Extreme imagecompression via multiscale autoencoders with generative adversarialoptimization,” 2019. 1, 4

[20] D. Marr, Vision: A Computational Investigation into the Human Repre-sentation and Processing of Visual Information. New York, NY: HenryHolt and Co., Inc., 1982. 1

[21] D. Minnen, J. Balle, and G. D. Toderici, “Joint autoregressive andhierarchical priors for learned image compression,” in Advances inNeural Information Processing Systems, 2018, pp. 10 794–10 803. 2,3, 4, 5, 6, 7, 8, 9, 11, 12

[22] J. Balle, N. Johnston, and D. Minnen, “Integer networks for datacompression with latent-variable models,” 2018. 2

[23] “Kodak lossless true color image suite,” http://r0k.us/graphics/kodak/,Dec. 15th 2016. 3

[24] S. Gu, L. Zhang, W. Zuo, and X. Feng, “Weighted nuclear normminimization with application to image denoising,” in Proceedings ofthe IEEE conference on computer vision and pattern recognition, 2014,pp. 2862–2869. 3

[25] J. Mairal, F. Bach, J. Ponce, G. Sapiro, and A. Zisserman, “Non-localsparse models for image restoration,” in 2009 IEEE 12th InternationalConference on Computer Vision (ICCV). IEEE, 2009, pp. 2272–2279.3

[26] A. Buades, B. Coll, and J.-M. Morel, “A non-local algorithm for imagedenoising,” in 2005 IEEE Computer Society Conference on ComputerVision and Pattern Recognition (CVPR’05), vol. 2. IEEE, 2005, pp.60–65. 3

[27] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural net-works,” in Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, 2018, pp. 7794–7803. 3, 4

[28] Y. Zhang, K. Li, K. Li, B. Zhong, and Y. Fu, “Residual non-localattention networks for image restoration,” 2018. 3

[29] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap-proaches to attention-based neural machine translation,” arXiv preprintarXiv:1508.04025, 2015. 3

[30] O. Firat, K. Cho, and Y. Bengio, “Multi-way, multilingual neuralmachine translation with a shared attention mechanism,” arXiv preprintarXiv:1601.01073, 2016. 3

Page 13: Neural Image Compression via Non-Local Attention ...

13

[31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advancesin Neural Information Processing Systems, 2017, pp. 5998–6008. 3

[32] G. Toderici, S. M. O’Malley, S. J. Hwang, D. Vincent, D. Minnen,S. Baluja, M. Covell, and R. Sukthankar, “Variable rate image compres-sion with recurrent neural networks,” arXiv preprint arXiv:1511.06085,2015. 3, 7

[33] G. Toderici, D. Vincent, N. Johnston, S. J. Hwang, D. Minnen, J. Shor,and M. Covell, “Full resolution image compression with recurrent neuralnetworks,” in CVPR. IEEE, 2017, pp. 5435–5443. 3, 6, 7

[34] N. Johnston, D. Vincent, D. Minnen, M. Covell, S. Singh, T. Chinen,S. Jin Hwang, J. Shor, and G. Toderici, “Improved lossy image compres-sion with priming and spatially adaptive bit rates for recurrent networks,”in Proceedings of the IEEE Conference on Computer Vision and PatternRecognition, 2018, pp. 4385–4393. 3, 7

[35] J. Balle, V. Laparra, and E. P. Simoncelli, “End-to-end optimized imagecompression,” arXiv preprint arXiv:1611.01704, 2016. 3, 4, 5

[36] E. Agustsson, F. Mentzer, M. Tschannen, L. Cavigelli, R. Timofte,L. Benini, and L. V. Gool, “Soft-to-hard vector quantization for end-to-end learning compressible representations,” in Advances in NeuralInformation Processing Systems, 2017, pp. 1141–1151. 3, 4

[37] H. Liu, T. Chen, Q. Shen, T. Yue, and Z. Ma, “Deep image compressionvia end-to-end learning.” in Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops, 2018, pp. 0–0. 3,4

[38] H. Liu, T. Chen, Q. Shen, and Z. Ma, “Practical stacked non-localattention modules for image compression,” in Proceedings of the IEEEConference on Computer Vision and Pattern Recognition Workshops,2019, pp. 0–0. 3

[39] A. v. d. Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrentneural networks,” arXiv preprint arXiv:1601.06759, 2016. 4, 6

[40] E. Agustsson, M. Tschannen, F. Mentzer, R. Timofte, and L. Van Gool,“Generative adversarial networks for extreme learned image compres-

sion,” arXiv preprint arXiv:1804.02958, 2018. 4[41] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image

recognition,” in Proceedings of the IEEE conference on computer visionand pattern recognition, 2016, pp. 770–778. 5

[42] H. Liu, T. Chen, P. Guo, Q. Shen, and Z. Ma, “Gated context modelwith embedded priors for deep image compression,” arXiv preprintarXiv:1902.10480, 2019. 5

[43] Z. Zhang, Z. Chen, J. Lin, and W. Li, “Learned scalable imagecompression with bidirectional context disentanglement network,” arXivpreprint arXiv:1812.09443, 2018. 7

[44] C. Jia, Z. Liu, Y. Wang, S. Ma, and W. Gao, “Layered image com-pression using scalable auto-encoder,” arXiv preprint arXiv:1904.00553,2019. 7

[45] T. Dumas, A. Roumy, and C. Guillemot, “Autoencoder based imagecompression: can the learning be quantization independent?” in 2018IEEE International Conference on Acoustics, Speech and Signal Pro-cessing (ICASSP). IEEE, 2018, pp. 1188–1192. 7

[46] H. S. Malvar, A. Hallapuro, M. Karczewicz, and L. Kerofsky, “Low-complexity transform and quantization in h.264/avc,” IEEE Transactionson Circuits and Systems for Video Technology, vol. 13, no. 7, pp. 598–603, July 2003. 7

[47] Overview of JPEG, “https://jpeg.org/jpeg/,” 2018. 8[48] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,

P. Dollar, and C. L. Zitnick, “Microsoft coco: Common objects incontext,” in European conference on computer vision. Springer, 2014,pp. 740–755. 8

[49] Challenge on learned image compression 2018. [Online]. Available:http://www.compression.cc/challenge/ 8

[50] G. Bjontegaard, “Calculation of average psnr differences between rd-curves,” VCEG-M33, 2001. 8

[51] The berkeley segmentation dataset and benchmark. [Online]. Available:https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/ 8