Spatially-Adaptive Residual Networks for Efﬁcient Image ... · Kuldeep Purohit A. N. Rajagopalan...

Spatially-Adaptive Residual Networks for Efficient Image and Video Deblurring

Kuldeep Purohit A. N. RajagopalanIndian Institute of Technology Madras, India

[email protected], [email protected]

Abstract

In this paper, we address the problem of dynamic scenedeblurring in the presence of motion blur. Restoration ofimages affected by severe blur necessitates a network de-sign with a large receptive field, which existing networksattempt to achieve through simple increment in the num-ber of generic convolution layers, kernel-size, or the scalesat which the image is processed. However, increasing thenetwork capacity in this manner comes at the expense ofincrease in model size and inference speed, and ignoringthe non-uniform nature of blur. We present a new architec-ture composed of spatially adaptive residual learning mod-ules that implicitly discover the spatially varying shifts re-sponsible for non-uniform blur in the input image and learnto modulate the filters. This capability is complementedby a self-attentive module which captures non-local rela-tionships among the intermediate features and enhancesthe receptive field. We then incorporate a spatiotempo-ral recurrent module in the design to also facilitate effi-cient video deblurring. Our networks can implicitly modelthe spatially-varying deblurring process, while dispensingwith multi-scale processing and large filters entirely. Exten-sive qualitative and quantitative comparisons with prior arton benchmark dynamic scene deblurring datasets clearlydemonstrate the superiority of the proposed networks viareduction in model-size and significant improvements in ac-curacy and speed, enabling almost real-time deblurring.

1. IntroductionThough computational imaging has made tremendous

progress over the years, handling motion blur in capturedcontent remains a challenge. Motion blur is caused by mo-tion of objects in the scene or the camera during sensor ex-posure. Apart from significantly degrading the visual qual-ity, the distortions arising from blur lead to considerableperformance drop for many vision tasks [39]. There exista few commercially available cameras which can captureframes at a high frame-rate and thus experience less blurbut they suffer from noise at high resolution and are quite

10−2

10−1

100

101

102

103

24

25

26

27

28

29

30

31

32

MSCNNDeblurgGAN

SRN

Sun et.al.

MotionFlow

OursOurs

SVRNN

PS

NR

(d

B)

Runtime for an HD image (seconds)

Figure 1. Our network outperforms all existing approaches on thedynamic scene deblurring benchmark [26] across all factors: ac-curacy, inference time, and compactness (diameter of each circleis proportional to the corresponding model-size).

expensive.Motion deblurring is a challenging problem in computer

vision due to its ill-posed nature. The past decade has wit-nessed significant advances in deblurring, wherein major ef-forts have gone into designing priors that are apt for recov-ering the underlying undistorted image and the camera tra-jectory [47, 29, 11, 34, 8, 17, 18, 19, 47, 28, 30, 40, 49]. Anexhaustive survey of uniform blind deblurring algorithmscan be found in [21]. Few approaches [4, 32, 33] have pro-posed hybrid algorithms where a Convolutional Neural Net-work (CNN) estimates the blur kernel, which is then usedin an alternative optimization framework for recovering thelatent image.

However, these methods have been developed based ona rather strong constraint that the scene is planar and thatthe blur is governed by only camera motion. This pre-cludes commonly occurring blur in most practical settings.Real-world blur arises from various sources including mov-ing objects, camera shake and depth variations, causingdifferent pixels to acquire different motion trajectories. Aclass of algorithms involve segmentation methods to relaxthe static and fronto-parallel scene assumption by indepen-dently restoring different blurred regions in the scene [13].However, these methods depend heavily on an accuratesegmentation-map. Few methods [37, 12] circumvent thesegmentation stage by training CNNs to estimate locally

1

arX

iv:1

903.

1139

4v1

[cs

.CV

] 2

5 M

ar 2

019

linear blur kernels and feeding them to a non-uniform de-blurring algorithm based on patch-level prior. However,they are limited in their capability when it comes to gen-eral dynamic scenes.

The afore-mentioned methods are not end-to-end sys-tems and share a severe disadvantage of involving itera-tive, time-intensive, and cumbersome optimization schemesat the network output for getting the final deblurred result.Use of fully convolutional CNNs to directly estimate the la-tent sharp image [26, 27, 20, 38] has recently advanced thestate-of-the-art in non-uniform motion deblurring. This hasthe advantage of enabling dynamic scene deblurring at lowlatency, by circumventing the iterative optimization stageinvolving fitting of hand-designed motion models.

Conventional approaches for video deblurring are basedon image deblurring techniques (using priors on the latentsharp frames and the blur kernels) which remove uniformblurs [3, 52] and non-uniform blur caused by rotationalcamera motion [22, 7, 50, 53]. However, these approachesare applicable only under the strong assumption of staticscenes and absence of depth-dependent distortions. Thework in [45] proposed a segmentation-based approach toaddress different blurs in foreground and background re-gions. Kim et al. [14] further relaxed the constraint onthe scene motion by parameterizing spatially varying blurkernel using optical flow.

With the introduction of labeled realistic motion blurdatasets [36], deep learning based approaches have beenproposed to estimate sharp video frames in an end-to-endmanner. Deep Video Deblurring (DVD) [36] is the first suchwork to address generalized video deblurring wherein aneural network accepts a stack of neighboring blurry framesfor deblurring. They perform off-line stabilization of theblurred frames before feeding them to the network, whichlearns to exploit the information from multiple frames to de-blur the central frame. Nevertheless, when images are heav-ily blurred, this method introduces temporal artifacts thatbecome more visible after stabilization. Few methods havealso been proposed for burst image deblurring [44, 1] whichutilize number of observations with independent blurs to re-store a scene, but are not trained for general video deblur-ring. Online Video Deblurring (OVD) [15] presents a fasterdesign for video deblurring which does not require framealignment. It utilizes temporal connections to increase thereceptive field of the network. Although OVD can handlelarge motion blur without adding a computational overhead,it lacks in accuracy and is not real-time.

There are two major limitations shared by prior deblur-ring works. Firstly, the filters of a generic CNN are spatiallyinvariant (with spatially-uniform receptive field), which issuboptimal to model the process of dynamic scene deblur-ring and limits their accuracy. Secondly, existing meth-ods achieve high receptive field through networks with a

large number of parameters and high computational foot-print, making them unsuitable for real-time applications.As the only other work of this kind, [54] recently proposeda design composed of multiple CNNs and Recurrent Neu-ral Networks (RNN) to learn spatially varying weights fordeblurring. However, their performance is inferior to thestate-of-the-art [38] in several aspects. Reaching a trade-offamong inference time, accuracy of restoration, and recep-tive field is a non-trivial task which we address in this paper.We investigate position and motion-aware CNN architec-ture, which can efficiently handle multiple image segmentsundergoing motion with different magnitude and direction.

Following recent developments, we adopt an end-to-end learning based approach to directly estimate the re-stored sharp image. For single image deblurring, webuild a fully convolutional architecture equipped with filter-transformation and feature modulation capability suited forthe task of motion deblurring. Our design hinges on the factthat motion blur is essentially an aggregation of various spa-tially varying transformations of the image, and a networkthat implicitly adapts to the location and direction of suchmotion, is a better candidate for the restoration task. Next,we address the problem of video deblurring, wherein weextend our single image deblurring network to exploit theredundancy across consecutive frames of a video to guidethe process. To this end, we introduce spatio-temporal re-currence at frame and feature-level to efficiently restore se-quences of blurred frames.

Our network contains various layers to spatially trans-form intermediate filters as well as the feature maps. Itsadvantages over prior art are three-fold: 1. It is fully con-volutional and parametrically efficient: deblurring can beachieved with just a single forward pass through a compactnetwork. 2. Its components can be easily introduced intoother architectures and trained in an end-to-end manner us-ing conventional loss functions. 3. The transformations es-timated by the network are dynamic and hence can be mean-ingfully interpreted for any test image.

The efficiency of our architecture is demonstratedthrough comprehensive comparisons with the state-of-the-art on image and video deblurring benchmarks. While amajority of image and video deblurring networks contain> 7 million parameters, our model achieves superior per-formance at only a fraction of this size, while being compu-tationally more efficient, resulting in real-time deblurring ofimages on a single GPU.The major contributions of our work are:

• We propose an efficient motion deblurring networkarchitecture built using a cascade of dynamic convo-lutional modules that facilitate position-specific filtertransformation.

• Our network benefits from estimating image depen-

dent spatial attention-maps to process local featuresjointly with their globally distributed interdependen-cies. This adaptively enlarges the receptive field ofthe network in accordance with the blur present in thescene.

• We extend our architecture to perform video deblur-ring by incorporating a spatio-temporal recurrent mod-ule to learn propagation of relevant features along thetemporal direction.

• Extensive evaluation on image and video deblurringbenchmarks demonstrate that our architectures arefaster, more accurate and contain fewer parametersthan existing art, enabling deblurring in real-time.

2. Proposed ArchitecturesAn existing technique for accelerating various image

processing operations is to down-sample the input image,execute the operation at low resolution, and up-samplethe output [5]. However, this approach discounts the im-portance of resolution, rendering it unsuitable for imagerestoration tasks where high-frequency content of the im-age is of prime importance (deblurring, super-resolution).

Another efficient design is a CNN with a fixed but verylarge receptive field (comparable to very-high resolutionimages), e.g. Cascaded dilated network [6], which was pro-posed to accelerate various image-to-image tasks. However,simple dilated convolutions are not appropriate for restora-tion tasks (as shown in [24] for image super-resolution). Af-ter several layers of dilated filtering, the output only consid-ers a fixed sparse sampling of input locations, resulting insignificant loss of information.

Until recently, the driving force behind performance im-provement in deblurring was use of large number of layers,larger filters, and multi-scale processing which gradually in-creases the “fixed” receptive field. Not only is it a subop-timal design, it is also difficult to scale since the effectivereceptive field of deep CNNs is much smaller than the the-oretical one (investigated in [25]).

We claim that a better alternative is to design a con-volutional network whose receptive field is adaptive to in-put image instances. We show that the latter approach is afar better choice due to its task-specific efficacy and utilityfor computationally limited environments, and it deliversconsistent performance across diverse magnitudes of blur.We explain the need for a network with asymmetric filters.Given a 2D image I and a blur kernel K, the motion blurprocess can be formulated as:

B[x, y] =

M/2,M/2∑m,n=−M/2

K[m,n]I[x− n, y − n], (1)

RECONSTRUCTION

CONVOLUTIONLAYER

TRANSPOSED CONVOLUTION

LAYER

RESIDUAL BLOCK

INPUT n n n n 2n 2n2n2n 4n 4n 4n 4n 4n 4n 4n 4n 4n 2n2n2n2n n n n n DEBLURRED IMAGE

FEATURE EXTRACTION

DEFORMABLE RESIDUAL MODULE

SELF-ATTENTION

MODULE

Figure 2. The proposed deblurring network and its components.

where B is the blurred image, [x, y] represents the pixel co-ordinates, and M ×M is the size of the blur kernel. At anygiven location [x, y], the sharp intensity can be representedas

I[x, y] =B[x, y]

K[0, 0]−∑M/2,M/2

m,n=−M/2K[m,n]I[x− n, y − n]

K[0, 0],

(2)which is a 2D infinite impulse response (IIR) model. Recur-sive expansion of the second term would eventually lead toan expression which contains values from only the blurredimage and the kernel as

I[x, y] =B[x, y]

K[0, 0]−

M/2,M/2∑m,n=−M/2

K[m,n]B[x−m, y − n]

K[0, 0]2+

∑M/2,M/2

m,n=−M/2

∑M/2,M/2

i,j=−M/2 K[m,n]K[i, j]I[x− n− i, y − n− j]

K[0, 0]2

(3)

The dependence of I[x, y] on a large number of loca-tions in B shows that the deconvolution process requiresinfinite signal information. If we assume that the bound-ary of the image is zero, eq. 3 is equivalent to applying aninverse filter to B. As visualized in [54], the non-zero re-gion of such an inverse deblurring filter is typically muchlarger than the blur kernel. Thus, if we use a CNN to modelthe process, a large receptive field should be considered tocover the pixel positions that are necessary for deblurring.Eq. 3 also shows that only a few coefficients (which areK[m,n] for m,n ∈ [−M/2,M/2]) need to be estimatedby the deblurring model, provided we can find an appropri-ate operation to cover a large enough receptive field.

For this theoretical analysis, we will temporarily assumethat the motion blur kernel K is linear (assumption usedin few prior deblurring works [37, 12]). Now, consider animage B which is affected by motion blur in the horizontaldirection (without loss of generality), implying K[m,n] =0 for m 6= 0 (non-zero values present only in the middle

row of the kernel). For such a case, eq. 3 translates to

I[x, y] =B[x, y]

K[0, 0]−

M∑n=1

K[o, n]B[x, y − n]

K[0, 0]2+

∑Mn=1

∑Mj=1K[0, n]K[0, j]I[x, y − n− j]

K[0, 0]2= ...

(4)

It can be seen that for this case, I[x, y] can be expressedas a function of only one row of pixels in the blurred im-age B, which implies that for a horizontal blur kernel, thedeblurring filter is also purely horizontal. We use this ob-servation to state a hypothesis that holds for any motionblur kernel: “Deblurrig filters are directional/asymmetric inshape”. This is because motion blur kernels are known to beinherently directional. Such an operation can be efficientlylearnt by a CNN with adaptive and asymmetric filters andthis forms the basis for our work.

Inspired by the success of deblurring works that utilizenetworks composed of residual blocks to directly regress tothe sharp image [27, 26, 20, 38], we build our network overa residual encoder-decoder structure. Such a structure wasadopted in Scale Recurrent Network (SRN) [38], which isthe current state-of-the-art in deblurring. We differentiateour design from SRN in terms of compactness and compu-tational footprint. While SRN is composed of 5 × 5 convfilters, we employ only 3 × 3 filters for economy. Unlike[38], our single image deblurring network does not containrecurrent units, and most importantly, our approach does notinvolve multi-scale processing; the input image undergoesonly a single pass through the network. Understandably,these changes can drastically reduce the inference time ofthe network and also decrease the model’s representationalcapacity and receptive field in comparison to SRN, withpotential for significant reduction in the deblurring perfor-mance. In what follows, we describe our proposed architec-ture which matches the efficiency of above network whilesignificantly improving representational capacity and per-formance.

In our proposed Spatially-Adaptive Residual Network(SARN), the encoder sub-network progressively transformsthe input image into feature maps with smaller spatial sizeand more channels. Our spatially adaptive modules (De-formable Residual Module (DRM) and Spatial Attention(SA) module) operate on the output of the encoder, wherethe spatial resolution of features is the smallest which leadsto minimum additional computations. The resulting fea-tures are fed to the Decoder, wherein it is passed througha series of Res-Blocks and deconvolution layers to recon-struct the output image. A schematic of the proposed ar-chitecture is shown in Fig. 2, where n (=32) represents thenumber of channels in the first feature map. Next, we de-scribe the proposed modules in detail.

Figure 3. Schematic of our deformable residual module.

2.1. Deformable Residual Module (DRM)

CNNs operate on fixed locations in a regular grid whichlimits their ability to model unknown geometric transfor-mations. Spatial Transform Networks (STN) [16] intro-duced spatial transformation learning into CNNs, whereinan image-dependent global parametric transformation is es-timated and applied on the feature map. However, suchwarping is computationally expensive and the transforma-tion is considered to be global across the whole image,which is not the case for motion in dynamic and 3D sceneswhere different regions are affected by different magnitudeand direction of motion. Hence, we adopt deformable con-volutions [9], which enable local transformation learning inan efficient manner. Unlike regular convolutional layers,the deformable convolution[9] also learns to estimate theshapes of convolution filters conditioned on an input fea-ture map. While maintaining filter weights invariant to theinput, a deformable convolution layer first learns a denseoffset map from the input, and then applies it to the regularfeature map for re-sampling.

As shown in Fig. 3, our DRM contains the additional ca-pability to a learn positions of the sampling grid used in theconvolution. A regular convolution layer is present to esti-mate the features and another convolution layer to estimate2D filter offsets for each spatial location. These channels(feature-maps containing red-arrows in Fig. 3) representthe estimated 2D offset of each input. The 2D offsets areencoded in the channel dimension i.e., convolution layer ofk×k filters is paired with offset predicting convolution layerof 2k2 channels. These offsets determine the shifting of thek2 filter locations along horizontal and vertical axes. As aresult, the regular convolution filter operates on an irregulargrid of pixels. Since the offsets can be fractional, bilinearinterpolation is used to sample from the input feature map.All the parts of our network are trainable end-to-end, sincebilinear sampling and the grid generation of the warpingmodule are both differentiable [31]. The offsets are initial-ized to 0. Finally, the additive link grants the benefits ofreusing common features with low redundancy.

The convolution operator slides a filter or kernel over theinput feature map X to produce output feature map Y. Foreach sliding position pb, a regular convolution with filter

weights W, bias term b and stride 1 can be formulated as

Y = W ∗X + b

ypb=

∑c

∑pn∈R

wc,n · xc,pb+pn + b (5)

where c is the index of input channel, pb is the base posi-tion of the convolution, n = 1, . . . , N with N = |R| andpn ∈ R enumerates the locations in the regular grid R.The center of R is denoted as pm which is always equal to(0, 0), under the assumption that both height and width ofthe kernel are odd numbers. This assumption is suitable formost CNNs. m is the index of the central location inR.

The deformable convolution augments all the samplinglocations with learned offsets {∆pn|n = 1, . . . , N}. Eachoffset has a horizontal component and a vertical component.Totally 2N offset parameters are required to be learnt foreach sliding position. Equation (5) then becomes

ypb=

∑pn∈R

wn · xH(pn) + b (6)

where H(pn) = pb + pn + ∆pn is the learned samplingposition on input feature map. The input channel c in (5) isomitted in (6) for notational clarity, because the same oper-ation is applied in every channel.

The receptive field and the spatial sampling locations areadapted according to the scale, shape, and location of thedegradation. Presence of a cascade of DRMs imparts higheraccuracy to the network while delivering higher parameterefficiency than the state-of-the-art deblurring approaches.Although the focus of our work is a compact network de-sign, it also provides an effective way to further increase thenetwork capacity since replacing normal Res-Blocks withDRMs is much more efficient than going deeper or wider.In our final network, 6 DRMs are present in the mid-levelof the network.

2.2. Self-Attention Module (SA)

Recent deblurring works [26, 38] have emphasized theadvantages of multi-scale processing. It efficiently capturesdifferent scales of motion blur, and increases the receptivefield of the network. Although it facilitates local growth inreceptive field, it does not leverage the relationship betweentwo distant locations in the scene. While this coarse-to-fine approach helps to handle different magnitudes of blur,it cannot leverage the relationship among blurred regionsfrom a global perspective, which is also essential for therestoration task at hand. In this work, we employ a bet-ter strategy: attention based learnable non-local connectionsamong features at different spatial locations.

Trainable attention over features for modeling long-range dependencies has shown its benefits in several tasksspanning across language [23, 35, 41] and vision [42, 51],

Figure 4. A schematic of our self-attention module.

but has not been explored for image restoration. Our workis inspired by the recent work of [51] that utilizes non-localattention to connect different scene regions and uses it toimprove image generation quality.

Our SA module selectively aggregates the features ateach position by a weighted sum of the features at all posi-tions. This efficient design ensures that similar features areconnected to each other regardless of their spatial distances,which helps in directly connecting regions with similar blur.It has two advantages: First, it overcomes the issue of lim-ited receptive field, as any pixel has access to features atevery other pixel in the image. Second, it implicitly actsas a gate for propagating only relevant information acrossthe layers. These properties make it suitable for deblurring,since blur affecting various scene-edges is often correlated.

As illustrated in Fig. 4, given a local feature A ∈RC×H×W , we first feed it into two 1×1 convolution layersto generate two new feature maps B and C, respectively,where {B,C} ∈ RC×H×W . Next, we reshape them toRC×N , where N = H ×W is the number of features. Wethen perform matrix multiplication between the transpose ofC and B, and apply a softmax layer to calculate the spatialattention map S ∈ RN×N :

sji =exp(Bi · Cj)∑Ni=1 exp(Bi · Cj)

(7)

where sji measures the ith position’s impact on jth posi-tion. Note that similarity between feature representations ofany two position contributes to greater correlation (higherattention) between them.

Finally, the feature A is passed through another 1 × 1convolution layer to generate a new feature map D ∈RC×H×W and reshape it to RC×N . Then we perform amatrix multiplication between D and the transpose of S toobtain an enhanced feature-map, which is added to A toobtain the final output E ∈ RC×H×W as follows:

Ej = Aj +

N∑i=1

sjiDi (8)

It can be inferred from Eq. 8 that the resulting feature Eat each position is a weighted sum of the features at all posi-

tions and original features. Therefore, it has global contextand selectively aggregates contexts according to the spatialattention map, causing similar features to reinforce gainsand irrelevant features to get subdued.

We found that placing the SA module at the beginningof our spatially adaptive subnetwork delivered best perfor-mance as compared to deeper positions. We suspect thatsuch early enhancement of features helps a large portion ofour network to be non-locally aware.

2.3. Video Deblurring through Spatio-temporal re-currence

A natural extension to single image deblurring is videodeblurring. However, video deblurring is a more structuredproblem as it can utilize information distributed across mul-tiple observations to mitigate the ill-posedness of deblur-ring. Existing learning-based approaches [36, 15] have pro-posed generic encoder-decoder architectures to aggregateinformation from neighboring frames. At each time step,DVD [36] accepts a stack of neighboring blurred frames asinput to network, while OVD [15] accepts intermediate fea-tures extracted from past frames.

We present an effective technique which elegantly ex-tends our efficient single image deblurring design to restorea sequence of blurred frames. The proposed network en-courages recurrent information propagation along the tem-poral direction at feature-level as well as frame-level toachieve temporal consistency and improve restoration qual-ity. For feature propagation, our network employs Convo-lutional Long-Short Term Memory (LSTM) modules [46]which are known to efficiently process spatio-temporal dataand perform gated feature propagation. The process can beexpressed as

f i = NetE(Bi, Ii−1),

hi,gi = ConvLSTM(hi−1, f i; θLSTM ),

Ii = NetD(gi; θD),

(9)

where i represents frame index, NetD is the decoder partof our network with parameters θD and NetE is the por-tion before the decoder. θLSTM is the set of parametersin ConvLSTM. The hidden state hi contains useful infor-mation about intermediate results and blur patterns, whichare passed to the network processing the next frame, thusassisting it in sharp feature aggregation.

Unlike [36, 15], our framework also employs recurrenceat frame level wherein previously deblurred estimates areprovided at the input to the network that processes subse-quent frames. This naturally encourages temporally consis-tent results by allowing it to assimilate a large number ofprevious frames without increased computational demands.Our network accepts 5 frames at each time-step (early fu-sion), of which 2 frames are deblurred estimates from past

and 2 are blurred frames from future. As discussed in [2],such early fusion allows the initial layers to assimilate com-plementary information from neighboring frames and im-proves restoration quality.

3. Experimental Setup

We conduct our experiments on a PC with Intel Xeon E5CPU, 256GB RAM and an NVIDIA Titan X GPU.Single Image Deblurring: Since the prime application ofour work is efficient deblurring of general dynamic scenes,we provide quantitative and qualitative evaluation of theefficacy of our network on the dynamic scene deblurringbenchmark [26], while comparing our performance withconventional and learning-based works. This dataset is ob-tained by capturing 240fps videos captured using GoProcamera and contains diverse 3D scenes captured in pres-ence of significant object and camera motion. Followingthe same train-test split as in [26], we use 2103 pairs fortraining and 1111 pairs for evaluation. Training is done for1× 106 iterations using Adam optimizer with learning rate0.0001 and batch-size of 4.Video Deblurring: We utilize the dataset of [36] for train-ing our video deblurring network, using the same test-trainsplit as [36], namely 50 blur-sharp video pairs for train-ing and 10 pairs for testing. Each video pair contains 100frames, from which we extract training sets containing 8blurred frames. Training is done for 3×106 iterations usingAdam optimizer with learning rate 0.0001 and batch-size of4. L1 loss was used for training. The test images are ofresolution 1280× 720.

4. Experimental Results

In this section, we carry out quantitative and qualitativecomparisons of our architectures with state-of-the-art meth-ods for image as well as video deblurring tasks.

4.1. Image Deblurring

Due to the complexity of the blur present in general dy-namic scenes, conventional uniform blur model based de-blurring approaches struggle to perform well [26]. How-ever, we compare with conventional non-uniform deblur-ring approaches by Xu et al. [48] and Whyte et. al. [43](proposed for static scenes) and [13] (proposed for dynamicscenes). Further, we compare with state-of-the-art end-to-end learning based methods [26, 20, 54, 38]. The sourcecodes and trained models of competing methods are pub-licly available on the authors’ websites, except for [13] and[54] whose results have been reported in previous works[54, 38]. Public implementations with default parameterswere used to obtain qualitative results on selected test im-ages.

(a) Blurred Image (b) Blurred patch (c) Whyte et al. [43] (d) Nah et al. [26] (e) DelurGAN [20] (f) SRN [38] (g) Ours

Figure 5. Visual comparisons of deblurring results on test images from the GoPro dataset [26]. Key blurred patches are shown in (b), whilezoomed-in patches from the deblurred results are shown in (c)-(g). (best viewed in high resolution).

Table 1. Performance Comparison of our method with existing deblurring algorithms on single image deblurring benchmark dataset [26].Method Xu [48] Whyte [43] Kim [13] Sun[37] MBMF [12] MS-CNN [26] DeblurGAN [20] SRN [38] SVRNN [54] SARN (Ours)

PSNR (dB) 21 24.6 23.64 24.64 26.4 29.08 28.7 30.26 29.19 31.13SSIM 0.7407 0.8458 0.8239 0.843 0.8632 0.914 0.858 0.934 0.931 0.947

Time (s) 3800 700 3600 1500 1200 6 1 0.4 1 0.02Size (MB) - - - 54.1 41.2 55 50 28 37.1 11.2

Table 2. Performance comparison of our method with existingvideo deblurring approaches on the benchmark dataset [36].

Method WFA DVD [36] DVD [36] OVD Ours- Ours-[10] Noalign Flow [15] Multi Recurrent

PSNR (dB) 28.35 30.05 30.05 29.95 30.60 31.15Time (s) 15 0.7 5 0.3 0.02 0.05

Size (MB) - 61.2 61.2 11.0 11.2 12.4

Quantitative Evaluation Quantitative comparisons usingPSNR and SSIM scores obtained on the GoPro testing setare presented in Table 1. Since traditional methods can-not model combined effects of general camera shake andobject motion [48, 43] or forward motion and depth varia-tions [13], they fail to faithfully restore most of the imagesin the test-set. The below par performance of [37, 12] canbe attributed to the fact that they use synthetic and sim-plistic blur kernels to train their CNN and employ tradi-tional deconvolution methods to estimate the sharp image,which severely limits their applicability to general dynamicscenes. On the other hand, the method of [20] trains a net-work containing instance-normalization layers using a mix-ture of deep-feature losses and adversarial losses, but leadsto suboptimal performance on images containing large blur.The methods [26, 38] use a multi-scale strategy to improvecapability to handle large blur, but fail in challenging situ-ations. One can note that the proposed SARN significantlyoutperforms all prior works, including the spatially varyingRNN based approach [54]. As compared to the state-of-the-art [38], our network offers an improvement of ∼ 0.9 dB.Qualitative Evaluation Visual comparisons on differentdynamic and 3D scenes are given in Fig. 5. It shows thatresults of prior works suffer from incomplete deblurringor ringing artifacts. In contrast, our network is able to re-store scene details more faithfully due to its effectiveness inhandling large dynamic blur and preserving sharpness. Im-

portantly, our method fares significantly better in terms ofmodel-size and inference-time (70% smaller and 20× fasterthan the nearest competitor [38] on a single GPU). An addi-tional advantage over [48, 43] is that our model waives-offthe requirement of parameter tuning during test phase.

4.2. Video Deblurring

Quantitative Evaluation To demonstrate the superiority ofour model, we compare the performance of our networkwith that of state-of-the-art video deblurring approaches on10 test videos from the benchmark [36]. Specifically, wecompare our models with conventional model of [10], twoversions of DVD [36], and OVD [15]. Source codes ofcompeting methods are publicly available on the authors’websites, except for [10] whose results have been reportedin [36]. Table 2 shows quantitative comparisons betweenour method and competing methods. We also include abaseline ‘Ours-Multi’, which refers to a version of our net-work which takes a stacks 5 consecutive blurred framesas input (configuration of DVD-Noalign). ‘Ours-recurrent’refers to our final network involving recurrence at frame aswell as feature level. The results indicate that our methodsignificantly outperforms prior methods (∼ 1 dB higher).Qualitative Evaluation Fig. 6 contains visual comparisonswith [14, 36, 15] on different test frames from the qualita-tive and quantitative subsets of [36] which suffer from com-plex blur due to large motion. Although traditional method[14] models pixel-level blur using optical flow as a cue, itsfails to completely deblur many scenes due to its simplisticassumptions on the kernels and the image properties. Learn-ing based methods [36, 15] fare better than [14] in severalcases but still lead to artifacts in deblurring due to their sub-optimal network design. Our method generates sharper re-sults and faithfully restores the scenes, while yielding sig-

(a) Blurred Image (b) Blurred patch (c) Kim et al. [14] (d) DVD [36] (e) OVD [15] (f) Ours

Figure 6. Visual comparisons of video deblurring results on two test frames from the DVD dataset [36]. Key blurred patches are shown in(b), while zoomed-in patches from the deblurred results are shown in (c)-(f). (best viewed in high resolution).

Table 3. Quantitative comparisons of different versions of our sin-gle image deblurring network on GoPro testset [26].

DRMs 0 3 6 6SA X 7 7 X

PSNR (dB) 30.64 30.69 31.05 31.13Size (MB) 10.7 10.9 11.2 11.2

nificant improvements on images affected with large blur.

5. Ablation studiesIn this section, we analyze the effect of different mod-

ules of our network on training and testing performance.The number of DRMs that are introduced in place of nor-mal Resblocks (in central part of our network) is one of thekey hyper-parameters of our deblurring network. To studyits effect, we designed and trained 2 versions of the networkwherein the number of DRMs are 3 and 6, respectively. Fig.7(a) shows comparisons of the convergence plots of thesemodels. It can be observed that the training performanceas well as the quantitative results get better with increasein the number of DRMs, as it introduces additional filteradaptability into the network. Fig. 7(a) also shows the per-formance of our model with contains SA but no DRMs, andits performance is expectedly lower than other models. Fi-nally, the best training performance is delivered by our finalmodel which contains 6 DRMs and SAs, which shows thatthe advantages of the DRMs and SA blocks are comple-mentary and their union leads to a superior model. Theseimprovements are also reflected in the quantitative valuesreported in Table 3. We chose to keep 1 SA module inour network, since performance improvement beyond it wasmarginal and it serves as a good balance between restorationaccuracy and processing-time.

The table also shows that our modules are lightweightsince their inclusion has only a marginal effect on the modelsize. Note that although the proposed network is alreadyquite efficient, replacing the standard convolutions in ournetwork with grouped convolution and/or separable convo-lution can lead to further reduction in model size and infer-ence time. We leave this analysis for future work.

The filter transformations estimated in the DRMs are oneof the key ingredients of our deblurring network. To eval-uate its importance, we designed and trained 3 versions of

0 10 20 30 40 50 60 70 80

Iterations (x21000)

3.5

4

4.5

5

5.5

6

6.5

7

7.5

Lo

ss

With only SA

With 3 DRMs

With 6 DRMs

With SA and 6 DRMs

0 5 10 15 20 25 30 35 40 45 50

Iterations (x27000)

3.8

4

4.2

4.4

4.6

4.8

5

5.2

5.4

Lo

ss

No Offset

Single Global Offset

Spatially Varying Offsets

(a) (b)Figure 7. Network analysis through comparison of training perfor-mance of various ablations of our models.

the network wherein we changed the offsets in the DR mod-ules: no offset, global offset and spatially varying offsets.In the second case, we force our network to estimate a sin-gle shift for all the spatial locations in the features. Fig.7(b) shows comparisons of the convergence plots of thesemodels. The drastic improvement in the training perfor-mance demonstrates the importance of spatially adaptivefilter-offset learning capability.

Additional experimental details and qualitative compar-isons are provided in the supplementary material.

6. ConclusionsWe proposed efficient image and video deblurring ar-

chitectures composed of convolutional modules that enablespatially adaptive feature learning through filter transforma-tions and feature attention over spatial domain, namely de-formable residual module (DRM) and self-attentive (SA)module. The DRMs implicitly address the shifts respon-sible for the local blur in the input image, while the SAmodule non-locally connects spatially distributed blurredregions. Presence of these modules awards higher capac-ity to our compact network without any notable increase inmodel size. Our network’s key strengths are large receptivefield and spatially varying adaptive filter learning capability,whose effectiveness is also demonstrated for video deblur-ring through a recurrent extension of our network. Exper-iments on dynamic scene deblurring benchmarks showedthat our approach performs favorably against prior art andfacilitates real-time deblurring. We believe our spatially-aware design can be utilized for other image processing andvision tasks as well, and we shall explore them in the future.

References[1] M. Aittala and F. Durand. Burst image deblurring using per-

mutation invariant convolutional neural networks. In Pro-ceedings of the European Conference on Computer Vision(ECCV), pages 731–747, 2018. 2

[2] J. Caballero, C. Ledig, A. Aitken, A. Acosta, J. Totz,Z. Wang, and W. Shi. Real-time video super-resolution withspatio-temporal networks and motion compensation. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 4778–4787, 2017. 6

[3] J.-F. Cai, H. Ji, C. Liu, and Z. Shen. Blind motion deblurringusing multiple images. Journal of computational physics,228(14):5057–5071, 2009. 2

[4] A. Chakrabarti. A neural approach to blind motion deblur-ring. In European Conference on Computer Vision, pages221–235. Springer, 2016. 1

[5] J. Chen, A. Adams, N. Wadhwa, and S. W. Hasinoff. Bi-lateral guided upsampling. ACM Transactions on Graphics(TOG), 35(6):203, 2016. 3

[6] Q. Chen, J. Xu, and V. Koltun. Fast image processing withfully-convolutional networks. In Proceedings of the IEEEInternational Conference on Computer Vision, pages 2497–2506, 2017. 3

[7] S. Cho, H. Cho, Y.-W. Tai, and S. Lee. Registration basednon-uniform motion deblurring. In Computer Graphics Fo-rum, volume 31, pages 2183–2192. Wiley Online Library,2012. 2

[8] S. Cho and S. Lee. Fast motion deblurring. In ACM Transac-tions on Graphics (TOG), volume 28, page 145. ACM, 2009.1

[9] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei.Deformable convolutional networks. In Proceedings of theIEEE international conference on computer vision, pages764–773, 2017. 4

[10] M. Delbracio and G. Sapiro. Hand-held video deblurring viaefficient fourier aggregation. IEEE Transactions on Compu-tational Imaging, 1(4):270–283, 2015. 7

[11] R. Fergus, B. Singh, A. Hertzmann, S. T. Roweis, and W. T.Freeman. Removing camera shake from a single photograph.In ACM transactions on graphics (TOG), volume 25, pages787–794. ACM, 2006. 1

[12] D. Gong, J. Yang, L. Liu, Y. Zhang, I. Reid, C. Shen, A. Hen-gel, and Q. Shi. From motion blur to motion flow: a deeplearning solution for removing heterogeneous motion blur. InThe IEEE conference on computer vision and pattern recog-nition (CVPR), 2017. 1, 3, 7

[13] T. Hyun Kim, B. Ahn, and K. Mu Lee. Dynamic scene de-blurring. In Proceedings of the IEEE International Confer-ence on Computer Vision, pages 3160–3167, 2013. 1, 6, 7

[14] T. Hyun Kim and K. Mu Lee. Generalized video deblurringfor dynamic scenes. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 5426–5434, 2015. 2, 7, 8

[15] T. Hyun Kim, K. Mu Lee, B. Scholkopf, and M. Hirsch.Online video deblurring via dynamic temporal blending net-work. In Proceedings of the IEEE International Conferenceon Computer Vision, pages 4038–4047, 2017. 2, 6, 7, 8

[16] M. Jaderberg, K. Simonyan, A. Zisserman, et al. Spatialtransformer networks. In Advances in Neural InformationProcessing Systems, pages 2017–2025, 2015. 4

[17] N. Joshi, R. Szeliski, and D. J. Kriegman. Psf estimationusing sharp edge prediction. In Computer Vision and PatternRecognition, 2008. CVPR 2008. IEEE Conference on, pages1–8. IEEE, 2008. 1

[18] D. Krishnan and R. Fergus. Fast image deconvolution usinghyper-laplacian priors. In Advances in Neural InformationProcessing Systems, pages 1033–1041, 2009. 1

[19] D. Krishnan, T. Tay, and R. Fergus. Blind deconvolutionusing a normalized sparsity measure. In Computer Visionand Pattern Recognition (CVPR), 2011 IEEE Conference on,pages 233–240. IEEE, 2011. 1

[20] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin,and J. Matas. Deblurgan: Blind motion deblurringusing conditional adversarial networks. arXiv preprintarXiv:1711.07064, 2017. 2, 4, 6, 7

[21] W.-S. Lai, J.-B. Huang, Z. Hu, N. Ahuja, and M.-H. Yang. Acomparative study for single image blind deblurring. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 1701–1709, 2016. 1

[22] Y. Li, S. B. Kang, N. Joshi, S. M. Seitz, and D. P.Huttenlocher. Generating sharp panoramas from motion-blurred videos. In Computer Vision and Pattern Recogni-tion (CVPR), 2010 IEEE Conference on, pages 2424–2431.IEEE, 2010. 2

[23] Z. Lin, M. Feng, C. N. d. Santos, M. Yu, B. Xiang, B. Zhou,and Y. Bengio. A structured self-attentive sentence embed-ding. arXiv preprint arXiv:1703.03130, 2017. 5

[24] P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo. Multi-levelwavelet-cnn for image restoration. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recog-nition Workshops, pages 773–782, 2018. 3

[25] W. Luo, Y. Li, R. Urtasun, and R. Zemel. Understandingthe effective receptive field in deep convolutional neural net-works. In Advances in neural information processing sys-tems, pages 4898–4906, 2016. 3

[26] S. Nah, T. H. Kim, and K. M. Lee. Deep multi-scale con-volutional neural network for dynamic scene deblurring. InCVPR, volume 1, page 3, 2017. 1, 2, 4, 5, 6, 7, 8

[27] T. Nimisha, A. K. Singh, and A. Rajagopalan. Blur-invariant deep learning for blind-deblurring. In Proceedingsof the IEEE E International Conference on Computer Vision(ICCV), 2017. 2, 4

[28] J. Pan, Z. Hu, Z. Su, and M.-H. Yang. Deblurring text im-ages via l0-regularized intensity and gradient prior. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 2901–2908, 2014. 1

[29] J. Pan, Z. Lin, Z. Su, and M.-H. Yang. Robust kernel esti-mation with outliers handling for image deblurring. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 2800–2808, 2016. 1

[30] J. Pan, D. Sun, H. Pfister, and M.-H. Yang. Blind imagedeblurring using dark channel prior. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 1628–1636, 2016. 1

[31] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. De-Vito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Auto-matic differentiation in pytorch. 2017. 4

[32] C. J. Schuler, H. Christopher Burger, S. Harmeling, andB. Scholkopf. A machine learning approach for non-blindimage deconvolution. In Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition, pages1067–1074, 2013. 1

[33] C. J. Schuler, M. Hirsch, S. Harmeling, and B. Scholkopf.Learning to deblur. IEEE transactions on pattern analysisand machine intelligence, 38(7):1439–1451, 2016. 1

[34] Q. Shan, J. Jia, and A. Agarwala. High-quality motion de-blurring from a single image. In Acm transactions on graph-ics (tog), volume 27, page 73. ACM, 2008. 1

[35] T. Shen, T. Zhou, G. Long, J. Jiang, S. Pan, and C. Zhang.Disan: Directional self-attention network for rnn/cnn-freelanguage understanding. In Thirty-Second AAAI Conferenceon Artificial Intelligence, 2018. 5

[36] S. Su, M. Delbracio, J. Wang, G. Sapiro, W. Heidrich, andO. Wang. Deep video deblurring for hand-held cameras. InCVPR, volume 2, page 6, 2017. 2, 6, 7, 8

[37] J. Sun, W. Cao, Z. Xu, and J. Ponce. Learning a convolu-tional neural network for non-uniform motion blur removal.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 769–777, 2015. 1, 3, 7

[38] X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia. Scale-recurrentnetwork for deep image deblurring. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 8174–8182, 2018. 2, 4, 5, 6, 7

[39] I. Vasiljevic, A. Chakrabarti, and G. Shakhnarovich. Exam-ining the impact of blur on recognition by convolutional net-works. arXiv preprint arXiv:1611.05760, 2016. 1

[40] S. Vasu and A. Rajagopalan. From local to global: Edgeprofiles to camera motion in blurred images. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 4447–4456, 2017. 1

[41] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones,A. N. Gomez, Ł. Kaiser, and I. Polosukhin. Attention is allyou need. In Advances in Neural Information ProcessingSystems, pages 5998–6008, 2017. 5

[42] X. Wang, R. Girshick, A. Gupta, and K. He. Non-localneural networks. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 7794–7803, 2018. 5

[43] O. Whyte, J. Sivic, A. Zisserman, and J. Ponce. Non-uniformdeblurring for shaken images. International journal of com-puter vision, 98(2):168–186, 2012. 6, 7

[44] P. Wieschollek, M. Hirsch, B. Scholkopf, and H. Lensch.Learning blind motion deblurring. In Proceedings of theIEEE International Conference on Computer Vision, pages231–240, 2017. 2

[45] J. Wulff and M. J. Black. Modeling blurred video with layers.In European Conference on Computer Vision, pages 236–252. Springer, 2014. 2

[46] S. Xingjian, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong,and W.-c. Woo. Convolutional lstm network: A machinelearning approach for precipitation nowcasting. In Advances

in neural information processing systems, pages 802–810,2015. 6

[47] L. Xu and J. Jia. Two-phase kernel estimation for robustmotion deblurring. In European Conference on ComputerVision, pages 157–170. Springer, 2010. 1

[48] L. Xu, S. Zheng, and J. Jia. Unnatural l0 sparse represen-tation for natural image deblurring. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, pages 1107–1114, 2013. 6, 7

[49] Y. Yan, W. Ren, Y. Guo, R. Wang, and X. Cao. Image de-blurring via extreme channels prior. In Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion, pages 4003–4011, 2017. 1

[50] H. Zhang and L. Carin. Multi-shot imaging: joint align-ment, deblurring and resolution-enhancement. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 2925–2932, 2014. 2

[51] H. Zhang, I. Goodfellow, D. Metaxas, and A. Odena. Self-attention generative adversarial networks. arXiv preprintarXiv:1805.08318, 2018. 5

[52] H. Zhang, D. Wipf, and Y. Zhang. Multi-image blind de-blurring using a coupled adaptive sparse prior. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition, pages 1051–1058, 2013. 2

[53] H. Zhang and J. Yang. Intra-frame deblurring by leveraginginter-frame camera motion. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages4036–4044, 2015. 2

[54] J. Zhang, J. Pan, J. Ren, Y. Song, L. Bao, R. W. Lau, and M.-H. Yang. Dynamic scene deblurring using spatially variantrecurrent neural networks. In Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition, pages2521–2529, 2018. 2, 3, 6, 7

Spatially-Adaptive Residual Networks for Efﬁcient Image ... · Kuldeep Purohit A. N. Rajagopalan...

Documents

Transcript of Spatially-Adaptive Residual Networks for Efﬁcient Image ... · Kuldeep Purohit A. N. Rajagopalan...