Veritatem Dies Aperit - Temporally Consistent Depth Prediction Enabled...
Transcript of Veritatem Dies Aperit - Temporally Consistent Depth Prediction Enabled...
Veritatem Dies Aperit - Temporally Consistent Depth Prediction Enabled by a
Multi-Task Geometric and Semantic Scene Understanding Approach
Amir Atapour-Abarghouei1 Toby P. Breckon1,2
1Department of Computer Science – 2Department of Engineering
Durham University, UK
{amir.atapour-abarghouei,toby.breckon}@durham.ac.uk
Abstract
Robust geometric and semantic scene understanding is
ever more important in many real-world applications such
as autonomous driving and robotic navigation. In this pa-
per, we propose a multi-task learning-based approach ca-
pable of jointly performing geometric and semantic scene
understanding, namely depth prediction (monocular depth
estimation and depth completion) and semantic scene seg-
mentation. Within a single temporally constrained recur-
rent network, our approach uniquely takes advantage of a
complex series of skip connections, adversarial training and
the temporal constraint of sequential frame recurrence to
produce consistent depth and semantic class labels simul-
taneously. Extensive experimental evaluation demonstrates
the efficacy of our approach compared to other contempo-
rary state-of-the-art techniques.
1. Introduction
As scene understanding grows in popularity due to its
applicability in many areas of interest for industry and
academia, scene depth has become ever more important
as an integral part of this task. Whilst in many cur-
rent autonomous driving solutions, imperfect stereo cam-
era set-ups or expensive LiDAR sensors are used to capture
depth, research has recently focused on refining estimated
depth with corrupted or missing regions in post-processing,
rendering it more useful in any downstream applications
[6, 78, 84]. Moreover, monocular depth estimation has re-
ceived significant attention within the research community
as a cheap and innovative alternative to other more expen-
sive and performance-limited technologies [8, 24, 29, 87].
Pixel-level image understanding, namely semantic seg-
mentation, also plays an important role in many vision-
based systems. Significant success has been achieved us-
ing Convolutional Neural Networks (CNN) in this field
[10, 17, 53, 66, 70] and many others such as image classifi-
cation [54], object detection [88] and alike in recent years.
Veritatem Dies Aperit: Time discovers the truth.
Figure 1: Exemplar results of the proposed approach.
RGB: input colour image; MDE: Monocular Depth Esti-
mation; GSS: Generated Semantic Segmentation.
In this work, we propose a model capable of semantically
understanding a scene by jointly predicting depth and pixel-
wise semantic classes (Figure 1). The network performs
semantic segmentation (Section 3.3) along with monocular
depth estimation (i.e., predicting scene depth based on a sin-
gle RGB image) or depth completion (i.e., completing miss-
ing regions of existing depth sensed through other imperfect
means, Section 3.2). Our approach performs these tasks
within a single model (Figure 2 (A)) capable of two sepa-
rate scene understanding objectives requiring low-level fea-
ture extraction and high-level inference, which leads to im-
proved and deeper representation learning within the model
[41]. This is empirically demonstrated via the notably im-
proved results obtained for each individual task when per-
formed simultaneously in this manner.
Within the current literature, many techniques focus on
individual frames to spatially accomplish their objectives,
ignoring temporal consistency in video sequences, one of
the most valuable sources of information widely available
within real-world applications. In this work, we propose
a feedback network that at each time step takes the output
generated at the previous time step as a recurrent input. Fur-
thermore, using a pre-trained optical flow estimation model,
we ensure the temporal information is explicitly considered
by the overall model during training (Figure 2 (A)).
In recent years, skip connections have been proven to
3373
Figure 2: Overall training procedure of the model (A) and the detailed outline of the generator architecture (B).
be very effective when the input and output of a CNN share
similar high-level spatial features [60, 66, 73, 79]. We make
use of a complex network of skip connections through-
out the architecture to guarantee that no high-level spatial
features are lost during training as the features are down-
sampled. In short, our main contributions are as follows:
• Depth Prediction - via a supervised multi-task model
adversarially trained using complex skip connections
that can predict depth (monocular depth estimation and
depth completion) having been trained on high-quality
synthetic training data [67] (Section 3.2).
• Semantic Segmentation - via the same multi-task
model, which is capable of performing the task of se-
mantic scene segmentation as well as the aforemen-
tioned depth estimation/completion (Section 3.3).
• Temporal Continuity - temporal information is explic-
itly taken into account during training using both recur-
rent network feedback and gradients from a pre-trained
frozen optical flow network.
This leads to a novel scene understanding approach capa-
ble of temporally consistent geometric depth prediction and
semantic scene segmentation whilst outperforming prior
work across the domains of monocular depth estimation
[8, 25, 29, 49, 83, 87], completion [9, 36, 50, 82] and se-
mantic segmentation [10, 17, 40, 52, 53, 59, 74, 75, 86].
2. Related Work
We consider relevant prior work over three distinct areas,
semantic segmentation (Section 2.1), monocular depth esti-
mation (Section 2.2), and depth completion (Section 2.3).
2.1. Semantic Segmentation
Within the literature, promising results have been
achieved using fully-convolutional networks [53], saved
pooling indices [10], skip connections [66], multi-path re-
finement [48], spatial pyramid pooling [85], attention mod-
ules focusing on scale or channel [18, 81] and others.
Temporal information in videos has also been used to
improve segmentation accuracy or efficiency. [26] proposes
a spatio-temporal LSTM based on frame features for higher
accuracy. Labels are propagated in [58] using gated re-
current units. In [27], features from preceding frames are
warped via flow vectors to reinforce the current frame fea-
tures. On the other hand, [69] reuses previous frame fea-
tures to reduce computation. In [89], an optical flow net-
work [23] is used to propagate features from key frames to
the current one. Similarly, [77] uses an adaptive key frame
scheduling policy to improve both accuracy and efficiency.
Additionally, [47] proposes an adaptive feature propagation
module that employs spatially variant convolutions to fuse
the frame features, thus further improving efficiency. Even
though the main objective of this work is not semantic seg-
mentation, it can be demonstrated that when the main ob-
jective (depth prediction) is performed alongside semantic
segmentation, the results are superior to when the tasks are
performed individually (Table 1).
2.2. Monocular Depth Estimation
Estimating depth from a single colour image is very de-
sirable as unlike stereo correspondence [68], structure from
motion [16] and alike [1, 71], it leads to a system with re-
duced size, weight, power and computational requirements.
For instance, [11] employs sparse coding to estimate depth,
while [24, 25] generates depth from a two-scale network
trained on RGB and depth. Other supervised models such
as [45, 46] have also achieved impressive results despite the
scarcity of ground truth depth for supervision.
Recent work has led to the emergence of new techniques
that calculate disparity by reconstructing corresponding
views within a stereo correspondence framework without
ground truth depth. The work by [76] learns to generate the
right view from the left image used as the input while pro-
ducing an intermediary disparity map. Likewise, [29] uses
bilinear sampling [39] and left/right consistency incorpo-
rated into training for better results. In [87], depth and cam-
era motion are estimated by training depth and pose predic-
tion networks, indirectly supervised via view synthesis. The
model in [44] is supervised by sparse ground truth depth and
the model is then enforced within a stereo framework via an
image alignment loss to output dense depth.
Additionally, contemporary supervised approaches such
as [8] have taken to using synthetic depth data to produce
sharp and crisp depth outputs. In this work, we also utilize
synthetic data [67] in a directly supervised training frame-
work to perform the task of monocular depth estimation.
3374
MethodDepth Error (lower, better) Depth Accuracy (higher, better) Segmentation (higher, better)
Abs. Rel. Sq. Rel. RMSE RMSE log σ < 1.25 σ < 1.252 σ < 1.253 Accuracy IoU
Two Models 0.245 1.513 6.323 0.274 0.803 0.856 0.882 0.604 0.672
One Model 0.208 1.402 6.026 0.269 0.836 0.901 0.926 0.748 0.764
Table 1: Comparison of depth prediction and segmentation tasks performed in one single network and two separate networks.
Figure 3: Comparing the results of the approach on synthetic test set when the model is trained with and without temporal
consistency. RGB: input colour image; GTD: Ground Truth Depth; GTS: Ground Truth Segmentation; TS: Temporal
Segmentation; TD: Temporal Depth; NS: Non-Temporal Segmentation; ND: Non-Temporal Depth.
2.3. Depth Completion
While colour image inpainting has been a long-standing
and well-established field of study [3, 13, 21, 62, 72, 80], its
use within the depth modality is considerably less effective
[6]. There have been a variety of depth completion tech-
niques in the literature including those utilizing smoothness
priors [33], exemplar-based depth inpainting [7], low-rank
matrix completion [78], object-aware interpolation [5], ten-
sor voting [43], Fourier-based depth filling [9], background
surface extrapolation [55, 57], learning-based approaches
using deep networks [4, 84], and alike [12, 19, 51]. How-
ever, prior work does not include any work focusing on en-
forcing temporal continuity in a learning-based approach.
3. Proposed Approach
Our approach is designed to perform two tasks using a
single joint model: depth estimation/completion (Section
3.2) and semantic segmentation (Section 3.3). This has been
made possible using a synthetic dataset [67] in which both
ground truth depth and pixel-wise segmentation labels are
available for video sequences of urban driving scenarios.
3.1. Overall Architecture
Our single network takes three different inputs produc-
ing two separate outputs for two tasks - depth prediction
and semantic segmentation. Moreover, temporal informa-
tion is explicit in our formulation, as one of the inputs at
every time step is an output from the previous time step via
recurrence. The network comprises three different compo-
nents: the input streams (Figure 2 (B) - left), in which the
inputs are encoded, the middle stream (Figure 2 (B) - mid-
dle), which fuses the features and begins the decoding pro-
cess, and finally the output streams (Figure 2 (B) - right), in
which the results are generated.
As seen in Figure 2 (A), two of the inputs are RGB or
RGB-D images (depending on whether monocular depth es-
timation to create depth, or depth completion to fill holes
within an existing depth image, is the focus) from the cur-
rent and previous time steps. The two input streams that de-
code these share their weights. The third input is the depth
generated at the previous time step. The middle section of
the network fuses and decodes the input features and finally
the output streams produce the results (scene depth and seg-
mentation). Every layer of the network contains two convo-
lutions, batch normalization [37] and PReLU [31].
Following recent successes of approaches using skip
connections [60, 66, 73, 79], we utilize a series of skip con-
nections within our architecture (Figure 2 (B)). Our inputs
and outputs, despite containing different types of informa-
tion (RGB, depth and pixel-wise class labels), relate to con-
secutive frames from the same scene and therefore, share
high-frequency information such as certain object bound-
aries, structures, geometry and alike, ensuring skip connec-
tions can be of significant value in improving the results.
By combining two separate objectives (predicting depth and
pixel-wise class labels) within our network, in which the
input streams and middle streams are fully trained on both
tasks, the results are better than when two separate networks
are individually trained to perform the same tasks (Table 1).
Even though the entire network is trained as one entity,
in our discussions, the parts of the network responsible for
predicting depth will be referred to as G1 and the portions
involved in semantic segmentation G2. These two modules
are essentially the same except for their output streams.
3.2. Depth Estimation / Completion
We consider depth prediction as a supervised image-to-
image translation problem, wherein an input RGB image
(for depth estimation) or RGB-D image (with the depth
channel containing holes for depth completion) is translated
to a complete depth image. More formally, a generative
model (G1) approximates a mapping function that takes as
its input an image x (RGB or RGB-D with holes) and out-
puts an image y (complete depth image) G1 : x → y.
The initial solution would be to minimize the Euclidean
distance between the pixel values of the output (G1(x))and the ground truth depth (y). This simple reconstruc-
tion mechanism forces the model to generate images that
3375
Figure 4: Comparing the performance of the approach with differing components of the loss function removed.
MethodDepth Error (lower, better) Depth Accuracy (higher, better) Segmentation (higher, better)
Abs. Rel. Sq. Rel. RMSE RMSE log σ < 1.25 σ < 1.252 σ < 1.253 Accuracy IoU
T/R 0.991 1.964 7.393 0.402 0.598 0.684 0.698 0.156 0.335
T/R/A 0.851 1.798 6.826 0.368 0.692 0.750 0.778 0.341 0.435
T/R/A/SC 0.655 1.616 6.473 0.278 0.753 0.812 0.838 0.669 0.738
T/R/A/SC/S 0.412 1.573 6.256 0.258 0.793 0.875 0.887 0.693 0.741
N/R/A/SC/S 0.534 1.602 6.469 0.275 0.758 0.820 0.856 0.614 0.681
T/R/A/SC/S/OF 0.208 1.402 6.026 0.269 0.836 0.901 0.926 0.748 0.764
Table 2: Numerical results with different components of loss. T: Temporal training; T: Non-Temporal training; R: Recon-
struction loss; A: Adversarial loss; SC: Skip Connections; S: Smoothing loss; OF: Optical Flow.
are structurally and contextually close to the ground truth.
For monocular depth estimation, this reconstruction loss is:
Lrec = ||G1(x)− y||1, (1)
where x is the input image, G1(x) is the output and y the
ground truth. For depth completion, however, the input x
is a four-channel RGB-D image with the depth containing
holes that would occur during depth sensing. Since we use
synthetic data [67], we only have access to hole-free pixel-
perfect ground truth depth. While one could naıvely cut
out random sections of the depth image to simulate holes,
as other approaches have done [62, 80], we opt for creat-
ing realistic and semantically meaningful holes with char-
acteristics of those found in real-world images [6]. A sepa-
rate model is thus created and tasked with predicting where
holes would be by means of pixel-wise segmentation. A
number of stereo images (30, 000) [28] are used to train
the hole prediction model by calculating the disparity us-
ing Semi-Global Matching [34] and generating a hole mask
(M ) which indicates which image regions contain holes.
The left RGB image is used as the input and the generated
mask as the ground truth label, with cross-entropy as the
loss function.
When our main model is being trained to perform depth
completion, the hole mask generated by the hole prediction
network is employed to create the depth channel of the input
RGB-D image. Subsequently, the reconstruction loss is:
Lrec = ||(1−M)⊙G1(x)− (1−M)⊙ y||1, (2)
where ⊙ is the element-wise product operation and x the
input RGB-D image in which the depth channel is y ⊙M .
Experiments with an L2 loss returned similar results.
However, the sole use of a reconstruction loss would
lead to blurry outputs since monocular depth estimation and
depth completion are multi-modal problems, i.e., several
plausible depth outputs can correctly correspond to a re-
gion of an RGB image. This multi-modality results in the
generative model (G1) averaging all possible modes rather
than selecting one, leading to blurring effects in the out-
put. To prevent this, adversarial training [30] has become
prevalent within the literature [8, 22, 38, 62, 80] since it
forces the model to select a mode from the distribution re-
sulting in better quality outputs. In this vein, our depth gen-
eration model (G1) takes x as its input and produces fake
samples G1(x) = y while a discriminator (D) is adversari-
ally trained to distinguish fake samples y from ground truth
samples y. The adversarial loss is thus as follows:
Ladv = minG1
maxD
Ex,y∼Pd(x,y)
[logD(x, y)]+
Ex∼Pd(x)
[log(1−D(x,G1(x)))],(3)
where Pd is the data distribution defined by y = G1(x),with x being the generator input and y the ground truth.
Additionally, a smoothing term [29, 32] is utilized to en-
courage the model to generate more locally-smooth depth
outputs. Output depth gradients (∂G1(x)) are penalized us-
ing L1 regularization, and an edge-aware weighting term
based on input image gradients (∂x) is used since image
gradients are stronger where depth discontinuities are most
likely found. The smoothing loss is therefore as follows:
Ls = |∂G1(x)|e||∂x||, (4)
where x is the input and G1(x) the depth output. The gra-
dients are summed over vertical and horizontal axes.
3376
Figure 5: Results on CamVid [14] (left) and Cityscapes [20]
(right). RGB: input colour image; GTS: Ground Truth Seg-
mentation; GS: Generated Segmentation; GD: Generated
Depth.
Method IoU Method IoU
CRF-RNN [86] 62.5 DeepLab [17] 63.1
Pixel-level Encoding [74] 64.3 FCN-8s [53] 65.3
DPN [52] 66.8 Our Approach 67.0
Table 3: Segmentation on the Cityscapes [20] test set.
Another important consideration is ensuring the depth
outputs are temporally consistent. While the model is ca-
pable of implicitly learning temporal continuity when the
output at each time step is recurrently used as the input at
the next time step, we incorporate a light-weight pre-trained
optical flow network [65], which utilizes a coarse-to-fine
spatial pyramid to learn residual flow at each scale, into our
pipeline to explicitly enforce consistency in the presence of
camera/scene motion. At each time step n, the flow between
the ground truth depth frames n and n−1 is estimated using
our pre-trained optical flow network [65] as well as the flow
between generated outputs from the same frames. The gra-
dients from the optical flow network (F ) are used to train
the generator (G1) to capture motion information and tem-
poral continuity by minimizing the End Point Error (EPE)
between the produced flows. Hence, the last component of
our loss function is:
LVn= ||F (G1(xn), G1(xn−1))− F (yn, yn−1)||2, (5)
where x and y are input and ground truth depth images re-
spectively and n the time step. While we utilize ground
truth depth as inputs to the optical flow network, colour im-
ages can also be equally viable inputs. However, since our
training data contains noisy environmental elements (e.g.,
lighting variations, rain, etc.), using the sharp and clean
depth images leads to more desirable results.Within the final decoder used exclusively for depth pre-
diction, outputs are produced at four scales, following [29].Each scale output is twice the spatial resolution of its pre-vious scale. The overall depth loss is therefore the sum oflosses calculated at every scale c:
Ldepth =
4∑
c=1
(λrecLrec + λadvLadv + λsLs + λV LVn). (6)
The weighting coefficients (λ) are empirically selected
(Section 3.4). These loss components, used to optimize
Method IoU Method IoU
SegNet-Basic [10] 46.4 DeconvNet [59] 48.9
SegNet [10] 50.2 Bayesian SegNet-Basic [40] 55.8
Reseg [75] 58.8 Our Approach 59.1
Table 4: Segmentation on the CamVid [14] test set.
depth fidelity, are used alongside the semantic segmentation
loss, explained in Section 3.3.
3.3. Semantic Segmentation
As semantic segmentation is not the primary focus of
our approach, but only used to enforce deeper and better
representation learning within our model, we opt for a sim-
ple and efficient fully-supervised training procedure for our
segmentation (G2). The RGB or RGB-D image is used as
the input and the network outputs class labels. Pixel-wise
softmax with cross-entropy is used as the loss function, with
the loss summed over all the pixels within a batch:
Pk(x) =eak(x)
∑Kk′=1 e
ak′ (x)
, (7)
Lseg = −log(Pl(G2(x))), (8)
where G2(x) denotes the network output for the segmenta-
tion task, ak(x) is the feature activation for channel k, K is
the number of classes, Pk(x) is the approximated maximum
function and l is the ground truth label for image pixels. The
loss is summed for all pixels within the images.
Finally, since the entire network is trained as one unit,
the joint loss function is as follows:
L = Ldepth + λrecLseg. (9)
with coefficients selected empirically (Section 3.4).
3.4. Implementation Details
Synthetic data [67] consisting of RGB, depth and class
labels are used for training. The discriminator follows the
architecture of [64], and the optical flow network [65] is
pre-trained on the KITTI dataset [56]. Experiments with the
Sintel dataset [15] returned similar, albeit slightly inferior,
results. The discriminator uses convolution-BatchNorm-
leaky ReLU (slope = 0.2) modules. The dataset [67]
contains numerous sequences some spanning thousands of
frames. However, a feedback network taking in high-
resolution images (512× 128) back-propagating over thou-
sands of time steps is intractable to train. Empirically, we
found training over sequences of 10 frames offers a rea-
sonable trade-off between accuracy and training efficiency.
Mini-batches are loaded in as tensors containing two se-
quences of 10 frames each, resulting in roughly 10, 000batches overall. All implementation is done in PyTorch
[61], with Adam [42] providing the best optimization (β1 =
3377
Figure 6: Results of our approach applied to KITTI [2, 56]. RGB: input colour image; GTD: Ground Truth Depth; MDE:
Monocular Depth Estimation; GTS: Ground Truth Segmentation; GS: Generated Segmentation.
Figure 7: Our results on locally captured data. SD: Depth
via Stereo Correspondence; DC: Depth Completion; MDE:
Monocular Depth Estimation; S: Semantic Segmentation.
0.5, β2 = 0.999, α = 0.0002). The weighting coef-
ficients in the loss function are empirically chosen to be
λrec = 1000, λadv = 100, λs = 10, λV = 1, λseg = 10.
4. Experimental Results
We assess our approach using ablation studies and both
qualitative and quantitative comparisons with state-of-the-
art methods applied to publicly available datasets [2, 14, 20,
28, 56]. We also utilize our own synthetic test set and data
captured locally to further evaluate the approach.
4.1. Ablation Studies
A crucial part of our work is demonstrating that every
component of the approach is integral to the overall per-
formance. We train our model to perform two tasks based
on the assumption that the network is forced to learn more
about the scene if different objectives are to be accom-
plished. We demonstrate this by training one model per-
forming both tasks and two separate models focusing on
each and conducting tests on randomly selected synthetic
sequences [67]. As seen in Table 1, both tasks (monocular
depth estimation and semantic segmentation) perform better
when the model is trained on both. Moreover, since the seg-
mentation pipeline does not receive any explicit temporal
Method PSNR SSIM Method PSNR SSIM
Holes 33.73 0.372 GTS [36] 31.47 0.672
ICA [82] 31.01 0.488 GIF [50] 44.57 0.972
FDF [9] 46.13 0.986 Ours 47.45 0.991
Table 5: Structural integrity analysis post depth completion.
supervision (from the optical flow network) and its temporal
continuity is only enforced by the input and middle streams
trained by the depth pipeline, when the two pipelines are
disentangled, the segmentation results become far worse
than the depth results (Table 1).
Figure 3 depicts the quality of the outputs when the
model is a feedback network trained temporally compared
to our model when the output depth from the previous time
step is not used as the input during training. We can clearly
see that both depth and segmentation results are of higher
fidelity when temporal information is used during training.
Additionally, our depth prediction pipeline uses sev-
eral loss functions. We employ the same test sequences
to evaluate our model trained as different components are
removed. Table 2 demonstrates the network temporally
trained with all the loss components (T/R/A/SC/S/OF) out-
performs models trained without specific ones. Qualita-
tively, we can see in Figure 4 that the results are far better
when the network is fully trained with all the components.
Specifically, the set of skip connections used in the network
make a significant difference in the quality of the outputs.
4.2. Semantic Segmentation
Segmentation is not the focus of this work and is mainly
used to boost the performance of depth prediction. How-
ever, we extensively evaluate our segmentation pipeline
which outperforms several well-known comparators. We
utilize Cityscapes [20] and CamVid [14] test sets for our
performance evaluation despite the fact that our model is
solely trained on synthetic data and without any domain
adaptation should not be expected to perform well on nat-
urally sensed real-world data. The effective performance
of our segmentation points to the generalization capabilities
of our model. When tested on CamVid [14], our approach
produces better results compared to well-established tech-
niques such as [10, 40, 59, 75] despite the lower quality
3378
Figure 8: Comparison of various completion methods applied to the synthetic test set. RGB: input colour image; GTD:
Ground Truth Depth; DH: Depth Holes; FDF: Fourier based Depth Filling [9]; GLC: Global and Local Completion [36];
ICA: Inpainting with Contextual Attention [82]; GIF: Guided Inpainting and Filtering [50].
MethodError Metrics (lower, better) Accuracy Metrics (higher, better)
Abs. Rel. Sq. Rel. RMSE RMSE log σ < 1.25 σ < 1.252 σ < 1.253
Train Set Mean [28] 0.403 0.530 8.709 0.403 0.593 0.776 0.878
Eigen et al. [25] 0.203 1.548 6.307 0.282 0.702 0.890 0.958
Liu et al. [49] 0.202 1.614 6.523 0.275 0.678 0.895 0.965
Zhou et al. [87] 0.208 1.768 6.856 0.283 0.678 0.885 0.957
Godard et al. [29] 0.148 1.344 5.927 0.247 0.803 0.922 0.964
Zhan et al. [83] 0.144 1.391 5.869 0.241 0.803 0.928 0.969
Our Approach 0.193 1.438 5.887 0.234 0.836 0.930 0.958
Table 6: Numerical comparison of monocular depth estimation over the KITTI [28] data split in [25]. All comparators are
trained and tested on the same dataset (KITTI [28]) while our approach is trained on [67] and tested using [28].
of the input images as seen in Table 4. As for Cityscapes
[20], the test set does not contain video sequences, but
our temporal model still outperforms approaches such as
[17, 52, 53, 74, 86], as demonstrated in Table 3.
Examples of the segmentation results over both datasets
are seen in Figure 5. Additionally, we also use the KITTI
semantic segmentation data [2] in our tests and as shown in
Figure 6, our approach produces high fidelity semantic class
labels despite including no domain adaptation.
4.3. Depth Completion
Evaluation for depth completion ideally requires dense
ground truth scene depth. However, no such dataset exists
for urban driving scenarios, which is why we utilize ran-
domly selected previously unseen synthetic data with avail-
able dense depth images to assess the results. Our model
generates full scene depth and the predicted depth values
for the missing regions of the depth image are subsequently
blended in with the known regions of the image using [63].
Figure 8 shows a comparison of our results against other
contemporary approaches [9, 36, 50, 82]. As seen from the
enlarged sections, our approach produces minimal artefacts
(blurring, streaking, etc.) compared to the other techniques.
To evaluate the structural integrity of the results post com-
pletion, we also numerically assess the performance of our
approach and the comparators. As seen in Table 5, our ap-
proach quantitatively outperforms the comparators as well.
While blending [63] might work well for colour images
with a connected missing region, significant quantities of
small and large holes in depth images can lead to undesir-
able artefacts such as stitch mark or burning effects post
blending. Examples of artefacts can be seen in Figure 7,
which demonstrates the results of the approach applied to
locally captured data. This is further discussed in Section 5.
4.4. Monocular Depth Estimation
As the main focus of our model, our monocular depth
estimation model is evaluated against contemporary state-
of-the-art approaches [8, 25, 29, 49, 83, 87]. Following the
conventions of the literature, we use the data split suggested
in [25] as the test set. These images are selected from ran-
dom sequences and do not follow a temporally sequential
pattern, while our full approach requires video sequences
as its input. As a result, we apply our approach to all the
sequences from which the images are chosen but the evalu-
ation itself is only performed on the 697 test images.
For numerical assessment, the generated depth is cor-
rected for the differences in focal length between the train-
ing [67] and testing data [28]. As seen in Table 6, our ap-
proach outperforms [25, 49, 87] across all metrics and stays
3379
Figure 9: Comparing the results of the approach against [87, 29, 44, 8]. Images have been adjusted for better visualization.
RGB: input colour image; GTD: Ground Truth Depth; DEV: Depth and Ego-motion from Video [87]; LRC: Left-Right
Consistency [29]; SSE: Semi-supervised Estimation [44]; EST: Estimation via Style Transfer [8]; GS: Generated Segmen-
tation.
competitive with [29, 83]. It is important to note that all of
these comparators are trained on the same dataset as the one
used for testing [28] while our approach is trained on syn-
thetic data [67] without domain adaptation and has not seen
a single image from [28]. Additionally, none of the other
comparators is capable of producing temporally consistent
outputs as all of them operate on a frame level. As this can-
not be readily illustrated via still images within Figures 8
and 9, we kindly invite the reader to view the supplemen-
tary video material accompanying the paper.
We also assess our model using the data split of KITTI
[56] and qualitatively evaluate the results, since the ground
truth images in [56] are of higher quality than the laser data
and provide CAD models as replacements for the cars in the
scene. As shown in Figure 6, our method produces sharp
and crisp depth outputs with segmentation results in which
object boundaries and thin structures are well preserved.
5. Limitations and Future Work
Even though our approach can generate temporally con-
sistent depth and segmentation by utilizing a feedback net-
work, this can lead to error propagation, i.e., when an er-
roneous output is generated at one time step, the invalid
values will continually propagate to future frames. This
can be resolved by exploring the use of 3D convolutions or
regularization terms aimed at penalizing propagated invalid
outputs. Moreover, as mentioned in Section 4.3, blending
the depth output into the known regions of the depth [63]
produces undesirable artefacts in the results. This can be
rectified by incorporating the blending operation into the
training procedure. In other words, the blending itself will
take place before the supervisory signal is back-propagated
through the network during training, which would force
the network to learn these artefacts, removing any need for
post-processing. As for our segmentation component, no
explicit temporal consistency enforcement or class balanc-
ing is performed, which has lead to frame-to-frame flick-
ering and lower accuracy with unbalanced classes (e.g.,
pedestrians, cyclists). By improving segmentation, the en-
tire model can benefit from a performance boost. Most of
all, the use of domain adaptation [8, 35] can significantly
improve all results since despite its generalization capabili-
ties, the model is only trained on synthetic data and should
not be expected to perform just as well on naturally-sensed
real-world images.
6. Conclusion
We propose a multi-task model capable of performing
depth prediction and semantic segmentation in a tempo-
rally consistent manner using a feedback network that takes
as its recurrent input the output generated at the previous
time step. Using a series of dense skip connections, we
ensure that no high-frequency spatial information is lost
during feature down-sampling within the training process.
We consider the task of depth prediction within the areas
of depth completion and monocular depth estimation, and
therefore train models based on both objectives within the
depth prediction component. Using extensive experimen-
tation, we demonstrate that our model achieves much bet-
ter results when it performs depth prediction and segmen-
tation at the same time compared to two separate networks
performing the same tasks. The use of skip connections is
also shown to be significantly effective in improving the re-
sults for both depth prediction and segmentation tasks. Al-
though certain isolated issues remain, experimental evalu-
ation demonstrates the efficacy of our approach compared
to contemporary state-of-the-art methods tackling the same
problem domains [17, 29, 36, 40, 53, 82, 83, 87].
We kindly invite the readers to refer to the video:
https://vimeo.com/325161805 for more information and
larger improved-quality result images.
3380
References
[1] Austin Abrams, Christopher Hawley, and Robert Pless. He-
liometric stereo: Shape from sun position. Euro. Conf. Com-
puter Vision, pages 357–370, 2012.
[2] Hassan Alhaija, Siva Mustikovela, Lars Mescheder, Andreas
Geiger, and Carsten Rother. Augmented reality meets com-
puter vision: Efficient data generation for urban driving
scenes. Int. J. Computer Vision, 126(9):961–972, 2018.
[3] Pablo Arias, Gabriele Facciolo, Vicent Caselles, and
Guillermo Sapiro. A variational framework for exemplar-
based image inpainting. Computer Vision, 93(3):319–347,
2011.
[4] Amir Atapour-Abarghouei, Samet Akcay, Gregoire Payen de
La Garanderie, and Toby Breckon. Generative adversarial
framework for depth filling via wasserstein metric, cosine
transform and domain transfer. Pattern Recognition, 91:232–
244, 2019.
[5] Amir Atapour-Abarghouei and Toby Breckon. Depthcomp:
Real-time depth image completion based on prior semantic
scene segmentation. In British Machine Vision Conference,
pages 1–13, 2017.
[6] Amir Atapour-Abarghouei and Toby Breckon. A compara-
tive review of plausible hole filling strategies in the context
of scene depth image completion. Computers and Graphics,
72:39–58, 2018.
[7] Amir Atapour-Abarghouei and Toby Breckon. Extended
patch prioritization for depth filling within constrained
exemplar-based RGB-D image completion. In Int. Conf. Im-
age Analysis and Recognition, pages 306–314, 2018.
[8] Amir Atapour-Abarghouei and Toby Breckon. Real-time
monocular depth estimation using synthetic data with do-
main adaptation via image style transfer. In IEEE Conf. Com-
puter Vision and Pattern Recognition, pages 1–12, 2018.
[9] Amir Atapour-Abarghouei, Gregoire Payen de La Garan-
derie, and Toby Breckon. Back to butterworth - a Fourier
basis for 3D surface relief hole filling within RGB-D im-
agery. In Int. Conf. Pattern Recognition, pages 2813–2818,
2016.
[10] Vijay Badrinarayanan, Alex Kendall, and Roberto Cipolla.
SegNet: A deep convolutional encoder-decoder architecture
for image segmentation. IEEE Trans. Pattern Analysis and
Machine Intelligence, 39(12):2481–2495, 2017.
[11] Mohammad Haris Baig, Vignesh Jagadeesh, Robinson Pi-
ramuthu, Anurag Bhardwaj, Wei Di, and Neel Sundaresan.
Im2Depth: Scalable exemplar based depth transfer. In Win-
ter Conf. Applications of Computer Vision, pages 145–152,
2014.
[12] Marcelo Bertalmio, Andrea Bertozzi, and Guillermo Sapiro.
Navier-stokes, fluid dynamics, and image and video inpaint-
ing. In IEEE Conf. Computer Vision and Pattern Recogni-
tion, volume 1, pages I–I, 2001.
[13] Toby Breckon and Robert Fisher. A hierarchical extension to
3D non-parametric surface relief completion. Pattern Recog-
nition, 45:172–185, 2012.
[14] Gabriel Brostow, Julien Fauqueur, and Roberto Cipolla. Se-
mantic object classes in video: A high-definition ground
truth database. Pattern Recognition Letters, 30(2):88–97,
2009.
[15] Daniel Butler, Jonas Wulff, Garrett Stanley, and Michael
Black. A naturalistic open source movie for optical flow
evaluation. In Euro. Conf. Computer Vision, pages 611–625,
2012.
[16] P. Cavestany, A.L. Rodriguez, H. Martinez-Barbera, and T.P.
Breckon. Improved 3D sparse maps for high-performance
structure from motion with low-cost omnidirectional robots.
In Int. Conf. Image Processing, pages 4927–4931, 2015.
[17] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,
Kevin Murphy, and Alan L Yuille. Deeplab: Semantic im-
age segmentation with deep convolutional nets, atrous con-
volution, and fully connected CRFs. IEEE Trans. Pattern
Analysis and Machine Intelligence, 40(4):834–848, 2018.
[18] Liang-Chieh Chen, Yi Yang, Jiang Wang, Wei Xu, and
Alan L Yuille. Attention to scale: Scale-aware semantic im-
age segmentation. In Computer Vision and Pattern Recogni-
tion, pages 3640–3649, 2016.
[19] Weihai Chen, Haosong Yue, Jianhua Wang, and Xingming
Wu. An improved edge detection algorithm for depth map in-
painting. Optics and Lasers in Engineering, 55:69–77, 2014.
[20] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe
Franke, Stefan Roth, and Bernt Schiele. The cityscapes
dataset for semantic urban scene understanding. In IEEE
Conf. Computer Vision and Pattern Recognition, pages
3213–3223, 2016.
[21] Ding Ding, Sundaresh Ram, and Jeffrey Rodriguez. Percep-
tually aware image inpainting. Pattern Recognition, 83:174–
184, 2018.
[22] Alexey Dosovitskiy and Thomas Brox. Generating im-
ages with perceptual similarity metrics based on deep net-
works. In Advances in Neural Information Processing Sys-
tems, pages 658–666, 2016.
[23] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, Philip
Hausser, Caner Hazirbas, Vladimir Golkov, Patrick Van
Der Smagt, Daniel Cremers, and Thomas Brox. Flownet:
Learning optical flow with convolutional networks. In Int.
Conf. Computer Vision, pages 2758–2766, 2015.
[24] David Eigen and Rob Fergus. Predicting depth, surface nor-
mals and semantic labels with a common multi-scale convo-
lutional architecture. In Int. Conf. Computer Vision, pages
2650–2658, 2015.
[25] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map
prediction from a single image using a multi-scale deep net-
work. In Advances in Neural Information Processing Sys-
tems, pages 2366–2374, 2014.
[26] Mohsen Fayyaz, Mohammad Hajizadeh Saffar, Moham-
mad Sabokrou, Mahmood Fathy, Reinhard Klette, and Fay
Huang. STFCN: Spatio-temporal FCN for semantic video
segmentation. In Asian Conf. Computer Vision Workshop,
pages 493–509, 2016.
[27] Raghudeep Gadde, Varun Jampani, and Peter V Gehler. Se-
mantic video CNNs through representation warping. In Int.
Conf. Computer Vision, pages 4463–4472, 2017.
3381
[28] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel
Urtasun. Vision meets robotics: The KITTI dataset. Robotics
Research, pages 1231–1237, 2013.
[29] Clement Godard, Oisin Mac Aodha, and Gabriel J. Bros-
tow. Unsupervised monocular depth estimation with left-
right consistency. In IEEE Conf. Computer Vision and Pat-
tern Recognition, pages 6602 – 6611, 2017.
[30] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Yoshua Bengio. Generative adversarial nets. In Advances in
Neural Information Processing Systems, pages 2672–2680,
2014.
[31] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Delving deep into rectifiers: Surpassing human-level perfor-
mance on ImageNet classification. In Int. Conf. Computer
Vision, pages 1026–1034, 2015.
[32] Philipp Heise, Sebastian Klose, Brian Jensen, and Alois
Knoll. Pm-huber: Patchmatch with huber regularization for
stereo matching. In Int. Conf. Computer Vision, pages 2360–
2367, 2013.
[33] Daniel Herrera, Juho Kannala, Janne Heikkila, et al. Depth
map inpainting under a second-order smoothness prior. In
Scandinavian Conf. Image Analysis, pages 555–566, 2013.
[34] Heiko Hirschmuller. Stereo processing by semi-global
matching and mutual information. IEEE Trans. Pattern Anal-
ysis and Machine Intelligence, 30:328–341, 2008.
[35] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu,
Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell.
CyCADA: Cycle-consistent adversarial domain adaptation.
In Int. Conf. Machine Learning, pages 1–13, 2018.
[36] Satoshi Iizuka, Edgar Simo-Serra, and Hiroshi Ishikawa.
Globally and locally consistent image completion. ACM
Trans. Graphics, 36(4):107, 2017.
[37] Sergey Ioffe and Christian Szegedy. Batch normalization:
Accelerating deep network training by reducing internal co-
variate shift. In Int. Conf. Machine Learning, pages 1–9,
2015.
[38] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei Efros.
Image-to-image translation with conditional adversarial net-
works. In IEEE Conf. Computer Vision and Pattern Recog-
nition, pages 5967–5976, 2017.
[39] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al.
Spatial transformer networks. In Advances in Neural Infor-
mation Processing Systems, pages 2017–2025, 2015.
[40] Alex Kendall, Vijay Badrinarayanan, and Roberto Cipolla.
Bayesian SegNet: Model uncertainty in deep convolutional
encoder-decoder architectures for scene understanding. In
British Machine Vision Conference, pages 1–12, 2017.
[41] Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task
learning using uncertainty to weigh losses for scene geome-
try and semantics. In IEEE Conf. Computer Vision and Pat-
tern Recognition, pages 1–10, 2018.
[42] Diederik Kingma and Jimmy Ba. Adam: A method for
stochastic optimization. In Int. Conf. Learning Representa-
tions, pages 1–15, 2014.
[43] Mandar Kulkarni and Ambasamudram Rajagopalan. Depth
inpainting by tensor voting. J. Optical Society of America A,
30(6):1155–1165, 2013.
[44] Yevhen Kuznietsov, Jorg Stuckler, and Bastian Leibe. Semi-
supervised deep learning for monocular depth map predic-
tion. In IEEE Conf. Computer Vision and Pattern Recogni-
tion, pages 6647–6655, 2017.
[45] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Fed-
erico Tombari, and Nassir Navab. Deeper depth prediction
with fully convolutional residual networks. In Int. Conf. 3D
Vision, pages 239–248, 2016.
[46] Bo Li, Chunhua Shen, Yuchao Dai, Anton van den Hengel,
and Mingyi He. Depth and surface normal estimation from
monocular images using regression on deep features and hi-
erarchical CRFs. In IEEE Conf. Computer Vision and Pattern
Recognition, pages 1119–1127, 2015.
[47] Yule Li, Jianping Shi, and Dahua Lin. Low-latency video
semantic segmentation. In IEEE Conf. Computer Vision and
Pattern Recognition, pages 5997–6005, 2018.
[48] Guosheng Lin, Anton Milan, Chunhua Shen, and Ian D
Reid. RefineNet: Multi-path refinement networks for high-
resolution semantic segmentation. In IEEE Conf. Computer
Vision and Pattern Recognition, volume 1, pages 5168–5177,
2017.
[49] Fayao Liu, Chunhua Shen, Guosheng Lin, and Ian Reid.
Learning depth from single monocular images using deep
convolutional neural fields. IEEE Trans. Pattern Analysis
and Machine Intelligence, 38(10):2024–2039, 2016.
[50] Junyi Liu, Xiaojin Gong, and Jilin Liu. Guided inpainting
and filtering for kinect depth maps. In Int. Conf. Pattern
Recognition, pages 2055–2058, 2012.
[51] Miaomiao Liu, Xuming He, and Mathieu Salzmann. Build-
ing scene models by completing and hallucinating depth and
semantics. In Euro. Conf. Computer Vision, pages 258–274,
2016.
[52] Ziwei Liu, Xiaoxiao Li, Ping Luo, Chen-Change Loy, and
Xiaoou Tang. Semantic image segmentation via deep parsing
network. In Int. Conf. Computer Vision, pages 1377–1385,
2015.
[53] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully
convolutional networks for semantic segmentation. In IEEE
Conf. Computer Vision and Pattern Recognition, pages
3431–3440, 2015.
[54] Emmanuel Maggiori, Yuliya Tarabalka, Guillaume Charpiat,
and Pierre Alliez. Convolutional neural networks for large-
scale remote-sensing image classification. IEEE Trans. Geo-
science and Remote Sensing, 55(2):645–657, 2017.
[55] Kiyoshi Matsuo and Yoshimitsu Aoki. Depth image en-
hancement using local tangent plane approximations. In
IEEE Conf. Computer Vision and Pattern Recognition, pages
3574–3583, 2015.
[56] Moritz Menze, Christian Heipke, and Andreas Geiger. Ob-
ject scene flow. Photogrammetry and Remote Sensing, pages
60–76, 2018.
[57] Suryanarayana M Muddala, Marten Sjostrom, and Roger
Olsson. Depth-based inpainting for disocclusion filling. In
3DTV Conf.: The True Vision-Capture, Transmission and
Display of 3D Video, pages 1–4, 2014.
[58] David Nilsson and Cristian Sminchisescu. Semantic video
segmentation by gated recurrent flow propagation. In IEEE
3382
Conf. Computer Vision and Pattern Recognition, pages 1–11,
2018.
[59] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han.
Learning deconvolution network for semantic segmentation.
In Int. Conf. Computer Vision, pages 1520–1528, 2015.
[60] Emin Orhan and Xaq Pitkow. Skip connections eliminate
singularities. In Int. Conf. Learning Representations, pages
1–11, 2018.
[61] Adam Paszke, Sam Gross, Soumith Chintala, Gregory
Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban
Desmaison, Luca Antiga, and Adam Lerer. Automatic dif-
ferentiation in PyTorch. In Advances in Neural Information
Processing Systems, pages 1–4, 2017.
[62] Deepak Pathak, Philipp Krahenbuhl, Jeff Donahue, Trevor
Darrell, and Alexei Efros. Context encoders: Feature learn-
ing by inpainting. In IEEE Conf. Computer Vision and Pat-
tern Recognition, pages 2536–2544, 2016.
[63] Patrick Perez, Michel Gangnet, and Andrew Blake. Pois-
son image editing. In Graphics, volume 22, pages 313–318,
2003.
[64] Alec Radford, Luke Metz, and Soumith Chintala. Un-
supervised representation learning with deep convolu-
tional generative adversarial networks. arXiv preprint
arXiv:1511.06434, 2015.
[65] Anurag Ranjan and Michael J Black. Optical flow estimation
using a spatial pyramid network. In IEEE Conf. Computer
Vision and Pattern Recognition, pages 2720–2729, 2017.
[66] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net:
Convolutional networks for biomedical image segmenta-
tion. In Int. Conf. Medical Image Computing and Computer-
Assisted Intervention, pages 234–241, 2015.
[67] German Ros, Laura Sellart, Joanna Materzynska, David
Vazquez, and Antonio Lopez. The SYNTHIA Dataset: A
large collection of synthetic images for semantic segmenta-
tion of urban scenes. In IEEE Conf. Computer Vision and
Pattern Recognition, pages 3234–3243, 2016.
[68] Daniel Scharstein and Richard Szeliski. A taxonomy and
evaluation of dense two-frame stereo correspondence algo-
rithms. Int. J. Computer Vision, 47:7–42, 2002.
[69] Evan Shelhamer, Kate Rakelly, Judy Hoffman, and Trevor
Darrell. Clockwork ConvNets for video semantic segmenta-
tion. In Euro. Conf. Computer Vision, pages 852–868, 2016.
[70] Marijn F Stollenga, Wonmin Byeon, Marcus Liwicki, and
Juergen Schmidhuber. Parallel multi-dimensional LSTM,
with application to fast biomedical volumetric image seg-
mentation. In Advances in Neural Information Processing
Systems, pages 2998–3006, 2015.
[71] Michael W Tao, Pratul P Srinivasan, Jitendra Malik, Szymon
Rusinkiewicz, and Ravi Ramamoorthi. Depth from shading,
defocus, and correspondence using light-field angular coher-
ence. In IEEE Conf. Computer Vision and Pattern Recogni-
tion, pages 1940–1948, 2015.
[72] Alexandru Telea. An image inpainting technique based on
the fast marching method. Graphics Tools, 9(1):23–34, 2004.
[73] Tong Tong, Gen Li, Xiejie Liu, and Qinquan Gao. Image
super-resolution using dense skip connections. In Int. Conf.
Computer Vision, pages 4809–4817, 2017.
[74] Jonas Uhrig, Marius Cordts, Uwe Franke, and Thomas Brox.
Pixel-level encoding and depth layering for instance-level
semantic labeling. In German Conf. Pattern Recognition,
pages 14–25, 2016.
[75] Francesco Visin, Marco Ciccone, Adriana Romero, Kyle
Kastner, Kyunghyun Cho, Yoshua Bengio, Matteo Mat-
teucci, and Aaron Courville. Reseg: A recurrent neural
network-based model for semantic segmentation. In Com-
puter Vision and Pattern Recognition Workshops, pages 41–
48, 2016.
[76] Junyuan Xie, Ross Girshick, and Ali Farhadi. Deep3D: Fully
automatic 2D-to-3D video conversion with deep convolu-
tional neural networks. In Euro. Conf. Computer Vision,
pages 842–857, 2016.
[77] Yu-Syuan Xu, Tsu-Jui Fu, Hsuan-Kung Yang, and Chun-Yi
Lee. Dynamic video segmentation network. In IEEE Conf.
Computer Vision and Pattern Recognition, pages 6556–
6565, 2018.
[78] Hongyang Xue, Shengming Zhang, and Deng Cai. Depth im-
age inpainting: Improving low rank matrix completion with
low gradient regularization. IEEE Trans. Image Processing,
26(9):4311–4320, 2017.
[79] Jin Yamanaka, Shigesumi Kuwashima, and Takio Kurita.
Fast and accurate image super resolution by deep CNN with
skip connection and network in network. In Neural Informa-
tion Processing, pages 217–225, 2017.
[80] Raymond Yeh∗, Chen Chen∗, Teck Yian Lim, Schwing
Alexander, Mark Hasegawa-Johnson, and Minh Do. Se-
mantic image inpainting with deep generative models. In
IEEE Conf. Computer Vision and Pattern Recognition, pages
6882–6890, 2017.
[81] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao,
Gang Yu, and Nong Sang. Learning a discriminative feature
network for semantic segmentation. In IEEE Conf. Computer
Vision and Pattern Recognition, pages 1–10, 2018.
[82] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and
Thomas Huang. Generative image inpainting with contex-
tual attention. In IEEE Conf. Computer Vision and Pattern
Recognition, pages 1–15, 2018.
[83] Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera,
Kejie Li, Harsh Agarwal, and Ian Reid. Unsupervised learn-
ing of monocular depth estimation and visual odometry with
deep feature reconstruction. In IEEE Conf. Computer Vision
and Pattern Recognition, pages 340–349, 2018.
[84] Yinda Zhang and Thomas Funkhouser. Deep depth comple-
tion of a single RGB-D image. In IEEE Conf. Computer
Vision and Pattern Recognition, pages 175–185, 2018.
[85] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang
Wang, and Jiaya Jia. Pyramid scene parsing network. In
IEEE Conf. Computer Vision and Pattern Recognition, pages
2881–2890, 2017.
[86] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-
Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang
Huang, and Philip Torr. Conditional random fields as recur-
rent neural networks. In Int. Conf. Computer Vision, pages
1529–1537, 2015.
[87] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G.
Lowe. Unsupervised learning of depth and ego-motion from
3383
video. In IEEE Conf. Computer Vision and Pattern Recogni-
tion, pages 6612–6619, 2017.
[88] Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen
Wei. Flow-guided feature aggregation for video object detec-
tion. In Int. Conf. Computer Vision, pages 408–417, 2017.
[89] Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen
Wei. Deep feature flow for video recognition. In IEEE Conf.
Computer Vision and Pattern Recognition, pages 4141–
4150, 2017.
3384