Depth prediction by deep learning - Diva1272923/FULLTEXT01.pdf · The outcome of deep learning...

50
IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS , STOCKHOLM SWEDEN 2018 Depth prediction by deep learning VALENTIN FIGUÉ KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

Transcript of Depth prediction by deep learning - Diva1272923/FULLTEXT01.pdf · The outcome of deep learning...

  • IN DEGREE PROJECT COMPUTER SCIENCE AND ENGINEERING,SECOND CYCLE, 30 CREDITS

    , STOCKHOLM SWEDEN 2018

    Depth prediction by deep learning

    VALENTIN FIGUÉ

    KTH ROYAL INSTITUTE OF TECHNOLOGYSCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE

  • Depth prediction by deeplearning

    VALENTIN FIGUÉ

    Double Degree in Computer ScienceDate: November 8, 2018Supervisor: Mårten BjörkmanExaminer: Danica KragicSwedish title: Djupförutsägelse genom deep learningSchool of Electrical Engineering and Computer Science

  • iii

    Abstract

    Knowing the depth information is of critical importance in scene un-derstanding for several industrial projects such as self-driving cars forinstance. Where depth inference from a single still image has taken aprominent place in recent studies with the outcome of deep learningmethods, practical cases often offer useful additional information thatshould be considered early in the architecture of the design to benefitfrom them in order to improve quality and robustness of the estimates.

    Hence, this thesis proposes a deep fully convolutional network whichallows to exploit the informations of either stereo or monocular tem-poral sequences, along with a novel training procedure which takesmulti-scale optimization into account.

    Indeed, this thesis found that using multi-scale information all alongthe network is of prime importance for accurate depth estimation andgreatly improves performances, allowing to obtain new state-of-the-art results on both synthetic data using Virtual KITTI and also on realimages with the challenging KITTI dataset.

  • iv

    Sammanfattning

    Att känna till djupet i en bild är av avgörande betydelse för scenför-ståelse i flera industriella tillämpningar, exempelvis för självkörandebilar. Bestämning av djup utifrån enstaka bilder har fått en alltmerframträdande roll i studier på senare år, tack vare utvecklingen inomdeep learning. I många praktiska fall tillhandahålls ytterligare informa-tion som är högst användbar, vilket man bör ta hänsyn till då mandesignar en arkitektur för att förbättra djupuppskattningarnas kvali-tet och robusthet.

    I detta examensarbete presenteras därför ett så kallat djupt fullstän-digt faltningsnätverk, som tillåter att man utnyttjar information fråntidssekvenser både monokulärt och i stereo samt nya sätt att optimaltträna nätverken i multipla skalor.

    I examensarbetet konstateras att information från multipla skalorär av synnerlig vikt för noggrann uppskattning av djup och för avse-värt förbättrad prestanda, vilket resulterat i nya state-of-the-art-resultatpå syntetiska data från Virtual KITTI såväl som på riktiga bilder fråndet utmanande KITTI-datasetet.

  • Contents

    1 Introduction 1

    1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Societal Impact . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Ethical consideration . . . . . . . . . . . . . . . . . . . . . 21.4 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.5 Validation of the results . . . . . . . . . . . . . . . . . . . 3

    2 Theoretical Background 4

    2.1 Definition of the problem . . . . . . . . . . . . . . . . . . 42.2 3D vision . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

    2.2.1 Mathematical Notations . . . . . . . . . . . . . . . 52.2.2 Stereoscopic vision . . . . . . . . . . . . . . . . . . 62.2.3 Temporal monocular vision . . . . . . . . . . . . . 6

    2.3 Deep Learning . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.1 Composition of a deep convolutional neural net-

    work . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.2 Experimental implementation . . . . . . . . . . . 9

    3 Related Work 11

    3.1 ResNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Prediction from a single image . . . . . . . . . . . . . . . 12

    3.2.1 Eigen Network . . . . . . . . . . . . . . . . . . . . 123.2.2 Laina Network . . . . . . . . . . . . . . . . . . . . 13

    3.3 Stereoscopic depth inference . . . . . . . . . . . . . . . . . 143.3.1 Kuznietsov network . . . . . . . . . . . . . . . . . 153.3.2 Godart Network . . . . . . . . . . . . . . . . . . . 16

    3.4 Sequential inference . . . . . . . . . . . . . . . . . . . . . . 173.4.1 Vijayanarasimhan Network . . . . . . . . . . . . . 17

    3.5 Extension to similar problems . . . . . . . . . . . . . . . . 18

    v

  • vi CONTENTS

    3.5.1 FlowNet . . . . . . . . . . . . . . . . . . . . . . . . 193.6 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

    4 Approach 21

    4.1 MSDOS-Net Architecture . . . . . . . . . . . . . . . . . . 214.1.1 Muli-Scale Coarse to Fine (MSCF) module . . . . 234.1.2 Coarse-to-Fine Inference . . . . . . . . . . . . . . . 26

    4.2 Multi Scale Training Approach . . . . . . . . . . . . . . . 27

    5 Experimentation and results 30

    5.1 Implementation Details . . . . . . . . . . . . . . . . . . . . 305.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . 315.3 Virtual KITTI . . . . . . . . . . . . . . . . . . . . . . . . . . 315.4 KITTI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    5.4.1 Comparison with the State-of-the-art . . . . . . . 355.4.2 Generalization to temporal sequences . . . . . . . 35

    6 Conclusion 37

    Bibliography 39

  • Chapter 1

    Introduction

    1.1 Context

    During the last few years arised many new industrial projects suchas in the area of robotics, or for self driving cars where the core ofthe problem is to retrieve spatial information from a single, couple orsequence of images. Recovering dense depth information from a pairof images is a non trivial essential task in computer vision which hasbeen explored for several decades [7], [17]. However all these classicalapproaches need a pair of stereoscopic images to be performed andachieve intermediate performance.

    The outcome of deep learning methods allowed to improve thisfield of research, in terms of performance. Indeed thanks to the avail-ability of large amount of RGB-D (colors and depth) data collectedwith dedicated depth sensors and the emergence of synthetic data,several deep networks nowadays achieve impressive results in depthprediction. The effectiveness of deep learning approaches even allowsto solve, to some extent, ill-posed problem such as depth predictionfrom a single image.

    Indeed, most of the current deep architecture do not use a coupleof images as classical approaches do but infer depth from a single im-age. However, multiple-image acquisition tends to become a standardin vision-based applications and systems. Personal and public imagecollections are continuously growing, with even more redundanciesand overlapping between images. Even shooting a simple photogra-phy with a smartphone often implies acquiring multiple frames beforecombining them, sometimes from different sensors. That’s why this

    1

  • 2 CHAPTER 1. INTRODUCTION

    thesis will focus on developing a novel deep architecture for depthprediction from a couple of images.

    1.2 Societal Impact

    Improving the depth prediction can have a huge societal impact. In-deed, this can be easily adapted for the self driving cars or even forimproving classical cars. With one or two different cameras on the car,the solution developed in this thesis will be able to build the depth ofthe scene in front of the car. This depth map can be used to localizespatially obstacles such as pedestrians, other cars, or even dogs cross-ing the road. Thanks to this detection, the trajectory of the vehicle canbe adapted to avoid collisions. This can help to reduce the numberof accidents. This solution can be adapted to help blind or partiallysighted persons in everyday life.

    1.3 Ethical consideration

    One of the key of the deep learning approaches lies in the amount ofdata and the accuracy of those data for the given problematic. How-ever, if we consider the self-driving cars problem for example, in orderto detect efficiently the pedestrians crossing the road, there should beenough representations of pedestrians in the data. But from an ethicalpoint of view, the collection of those data, can be done only with theagreement of the pedestrians. This is why, one should always verifywhen using a dataset that all the agreements have been collected. Ifthe agreements have not been collected, the faces can be blurred out.

    1.4 Objectives

    Due to the performance of the recent deep learning methods and theirvery recent outcomes , this thesis will focus first on a review of the dif-ferent preexisting approaches for depth prediction from a single imageand for a couple or a sequence of images. This review aims to comparethe different architectures and to highlight how they differ from clas-sical approaches, in order to understand from a qualitative point ofview why they are so efficient.

  • CHAPTER 1. INTRODUCTION 3

    Once those approaches synthesized, this thesis will propose a newdeep architecture. The model will tackle only two different depth pre-diction problematics :

    • Recovering depth from two stereoscopic images.

    • Recovering depth from two sequential images.

    1.5 Validation of the results

    In order to compare the efficiency of our approach with the preexistingmethods, its performance will be evaluated on two different academicdatasets :

    • Real sparse RGD-D dataset Kitti

    • Synthetic dense RGB-D dataset Virtual Kitti

    One of the main advantages of those two datasets is to provide imagesfrom car driving sequences which illustrates how well these methodscan be used for self driving cars.

  • Chapter 2

    Theoretical Background

    This chapter presents the different formulas and notations necessaryto understand the deep neural networks introduced later.

    2.1 Definition of the problem

    This master thesis aims to propose a model in order to predict depthfrom RGB images. Mathematically speaking, the depth prediction ofan image by a given model follows this scheme :

    F (I, Supp) = Z (2.1)

    where F represents the function of the model, I the image for whichthe depth need to be predicted, Supp the supplementary images tohelp the prediction and Z the predicted depth map. The ideal depthprediction model given a certain error criterion fulfill the next condi-tion :

    F

    ideal

    = min

    F

    {X

    i2S

    E{F (Ii

    , Supp

    i

    ), Z

    0i

    }} (2.2)

    with E the error criterion, S the dataset of images on which we wantto test the model, and Z 0 the depth ground truth.

    Depending on the data provided, three different situations will beexplored. These situations differ on how the supplementary imagesare constitued :

    • In most deep learning approaches, only a single image is used topredict the depth. In this this case Supp = {}. This is defined asmonocular prediction.

    4

  • CHAPTER 2. THEORETICAL BACKGROUND 5

    • Most classical approaches use a single supplementary image froma different calibrated camera. I

    l

    and Ir

    represent the images pro-vided respectively by the left and by the right camera. In thiscase, Supp = I

    l

    or Ir

    , and this is named stereoscopic prediction.

    • The last case represents the situation where the supplementaryimages come from the same camera but at a different time. I

    t

    willrepresent the image of interest, on which we compute the pre-diction and I

    t�1,...,t�n the n previous frames. This will be calledsequential prediction.

    2.2 3D vision

    Along this master thesis, several geometric operations will be per-formed. In order to be as clear as possible, this section introduces thedifferent mathematical notions which will be used.

    2.2.1 Mathematical NotationsA point in a given image will be represented by two values x and ywhich represents the two usual axis. For instance I(x, y) representsthe pixels located at the coordinates (x, y) in the image I .

    The coordinates of a 3D point will be notated by capital letters X ,Y and Z where Z represents the depth.

    One can convert the 3D point coordinates to the image coordinatesby the usual projection matrix which will be noted P by the followingformula :

    0

    @x

    y

    1

    1

    A= kP

    0

    BB@

    X

    Y

    Z

    1

    1

    CCA (2.3)

    where k represents a scaling factor. The projection matrix P containsall the intrinsic parameters of the camera such as the focal length. Wedo not explain the exact composition of the matrix P as it won’t benecessary in the following.

    In the case where the camera space origin and axis of the two dif-ferent cameras are not aligned, the rotation and translation betweenthe two different geometrical spaces need to be taken into account. Rwill represent the rotation matrix and T the translation vector between

  • 6 CHAPTER 2. THEORETICAL BACKGROUND

    those two. In this situation, the conversion from 3D point coordinatesto image coordinates will contain another term :

    0

    @x

    y

    1

    1

    A= kP

    0

    BB@

    t

    x

    R3⇥3 tyt

    z

    0 0 0 1

    1

    CCA

    0

    BB@

    X

    Y

    Z

    1

    1

    CCA (2.4)

    where k still represents a scaling factor.

    2.2.2 Stereoscopic visionThe stereoscopic case, that is to say when images come from differenthorizontal positions, presents a specific notion. This notion is calleddisparity and defined by the following implicit equation :

    I

    l

    (x, y) = I

    r

    (x� ⇢(x, y), y) (2.5)

    where ⇢ represents the disparity function. The main advantage to usethe disparity in our case, is the formula which links the disparity withthe depth. Indeed, one can deduce from spatial transformation that :

    ⇢(x, y) =

    fb

    Z(x, y)

    (2.6)

    where f is the focal length of both cameras (one of the intrinsic param-eter) and b the distance between the two different cameras. We can seefrom equations 2.5 and 2.6 that we have an indirect way to solve thedepth prediction problem by first estimating the disparity.

    2.2.3 Temporal monocular visionA similar concept exists in the sequential situation known as opticalflow. The optical flow links two images from the same cameras at twodifferent time-stamps. As for the disparity, the optical flow is definedby an implicit equation :

    I

    t+1(x, y) = It(x+ Vx(x, y), y + Vy(x, y)) (2.7)

    where Vx

    and Vy

    represents the optical flow along the two differentaxis.

  • CHAPTER 2. THEORETICAL BACKGROUND 7

    The first difference between those two different concepts is the di-mensional difference. The calibration of the cameras in the stereo-scopic case limits the equation 2.5 along a single axis instead of thetwo dimensions for the equation 2.7. This is why it is easier to esti-mate disparity rather than the optical flow. The second difference isthat there does not exist a direct formula which links optical flow anddepth.

    The study of the optical flow can seem useless but as both imagescome from the same camera, given the displacement of the camera -that is to say the rotation R and translation T - it is possible to link theoptical flow, and the 3D coordinates of the points which depends ofthe depth.

    2.3 Deep Learning

    This section aims to provide a quick overview on how a deep neuralnetwork is build.

    2.3.1 Composition of a deep convolutional neural net-work

    Microscopic approach

    A convolutional neural network can be decomposed as a succession ofthree different kinds of operations. This stack of operations composeswhat is called a layer of the network. Mathematically speaking if Frepresents the function of the network and f

    i

    the function of each layer,the following equation describes the behavior of the network :

    F (X) = f

    n

    � ... � f1(X) (2.8)

    n represents the total number of layers of the network and X representsits input. Most often, the input is usually a 3 dimensional matrix andrepresents an image. The first dimension is called the depth or chan-nels, the two others are the height and width of the input image. Each2 dimensional matrix for each channel value is called a feature map.

    The width and the height of those features maps are most of thetime reduced during the neural network. This reduction can be per-formed by different operations of the network (convolution, max-pooling).

  • 8 CHAPTER 2. THEORETICAL BACKGROUND

    The max pooling operation consists in keeping the max value for eachsquare of size (2x2) of the feature map. This will result in a reductionby a factor of 2.

    The fist generic operation of a layer is the convolution. It aims tofilter the input to extract the useful information. The explicit formulaof the convolution is the following one:

    (X ⇤ C(k, s))(z, x, y) =NX

    n=1

    kX

    i=1

    kX

    j=1

    X(n, s.x, s.y).C

    z

    (n, s.x� i, s.y � j)

    (2.9)The convolution has different parameters. The first one is the kernelsize k which can be seen as the size of the receptive field. The secondone s is called the stride. Usually s = 1 but when s = 2 the outputwidth and height is reduced by 2. The depth of the output is the lastparameter which can be set.

    The second operation of the layer is a batch normalization layer,introduced by Ioffe and Szegedy [9]. This operation aims to normalizeand center the coefficients of the input tensor. It has two different pa-rameters : the batch mean called m and the standard deviation of thebatch called �. The operation is the classic normalization :

    BN(X) =

    X �m�

    (2.10)

    where BN represents the batch normalization function. This operationis not compulsory but is present most of the time to increase eitherperformance or speed convergence of the network.

    The last operation of the layer is a non linear operation. It can beperformed by different functions such as:

    • ReLU : f(x) = max(0, x)

    • Sigmoid : f(x) = (1 + e�x)�1

    • Hyperbolic tangent : f(x) = tanh(x)

    The ReLU function is the most used due to its efficiency comparedwith its computational cost.

    Macroscopic structure

    Most often the structure of a deep neural network is very similar, notdepending on its goal. The first part of the network usually aims to

  • CHAPTER 2. THEORETICAL BACKGROUND 9

    encode the input images. This encoding results in spatially small fea-tures. The width and the height of these features can be for example 32times smaller than the one of the original image. However we usuallysee a high number of feature maps at this level of encoding, contain-ing useful information on the input with a reduced total number ofdimensions.

    Then depending on what function the network aims to fulfill, thenetwork performs a decoding procedure. This decoding decrease thenumber of channels and transform the features to the desired form ofoutput. For classification, for instance, it will result in a probabilisticvector for each classes. In the depth prediction case it will result in animage which will represent the depth map.

    Training procedure

    In order to achieve its goal, a network need to be trained for a specifictask on a dataset. Initially all the different parameters of the network(the coefficients of the convolutions and batch normalization) are ini-tialized randomly. Then the network is trained by minimizing a lossfunction. This loss function can be formulated as :

    Loss =

    X

    (X,Z0)2S

    Error(F (X � Z 0)) (2.11)

    where X represents the input image tensor, Z 0 the ground truth and Sthe dataset. Usually the norm 1 or 2 are chosen for the Error functionbut more complex functions can also be used.

    The minimization of this function is performed by stochastic gra-dient descent. All the experiments conducted during this thesis havebeen performed by using the Adam optimization introduced by Kingmaand Ba [10], which is one of the most efficient stochastic gradient methodfor training neural networks. The gradient is computed by backprop-agation, introduced first for neural network by LeCun et al. [13].

    2.3.2 Experimental implementationOne of the main drawback of deep neural network methods is thecomputational cost. Some specific libraries such as Torch, Tensorflow,Caffe, or Pytorch have been developed during the last 5 years. All thedifferent experiments performed during this thesis have been imple-mented with Pytorch. This very new Python library created by Paszke

  • 10 CHAPTER 2. THEORETICAL BACKGROUND

    et al. [16], have been released early 2017. It implements efficiently onGPUs to speed up the training of all the different operations that wediscussed in the previous section.

  • Chapter 3

    Related Work

    3.1 ResNet

    Before reviewing the different preexisting deep neural networks fordepth prediction, it is essential to explain briefly one of the main recentarchitecture that revolutionized the images classification challenge. Itis named ResNet and was first introduced by He et al. [8]. This net-work is essential for us because all the different networks explained inthis chapter reuse its architecture in their encoding part.

    The main contribution of this network is the creation of a new mod-ule with a skip connection : an additional branch performing the iden-tity function and skipping some convolutions. The feature maps re-sulting from the skipping branch will then added to the features mapsof the main branch.

    This module exists in two forms : a skip module (Figure 3.1) and askip projection module (Figure 3.5). The only difference is the presenceor not of a convolution in the skip connection. This convolution is

    Figure 3.1: Illustration of Resnet Skip, one of the residual module in-troduced by He et al. [8]. This module does not perform any change inthe number of channels or resolution of the features.

    11

  • 12 CHAPTER 3. RELATED WORK

    Figure 3.2: Illustration of Resnet Proj, one of the residual module in-troduced by He et al. [8]. This module is used to increase the numberof features and decrease the resolution.

    present only to make sure that before the addition, features have thesame size.

    The basic idea behind this module is to ease the convergence of thenetwork. Indeed due to the large number of layers, during the back-propagation, the gradient tends to vanish. The skip connection lim-its this effect by giving "shortcuts" paths during the back-propagationand so allows to train deeper networks, increasing further the perfor-mances achieved.

    Nowadays, most of the state-of-the-art networks, no matter thefunction to achieve, encode the input with Resnet modules before ap-plying specific modules.

    Different versions of this network exist depending on the numberof layers. ResNet50 will refer to ResNet with 50 layers, for instance.

    3.2 Prediction from a single image

    Most of the recent deep learning approaches to tackle the problem ofdepth prediction use only a single image, instead of two images as itis the case with classical approaches.

    3.2.1 Eigen NetworkThe first network which tackled the problem of depth prediction waspublished by Eigen, Puhrsch, and Fergus [2]. The Figure 3.3 detailsthe composition of this network. It can be decomposed in two smallnetworks, the first one in blue in Figure 3.3 aims to predict a coarsedepth map. The second one in orange, refines the coarse result.

  • CHAPTER 3. RELATED WORK 13

    Figure 3.3: Architecture of Eigen network with all the different opera-tions performed.

    Several differences can be observed between those two. The firstdifference is the presence in the coarse network of two fully connectedlayers. These layers perform operations which links every coefficientof the features. The local specificity will be lost but global structurewill be observed. The second network performs only several convolu-tions with a small kernel to refine the information locally.

    This network was the first major contribution for monocular depthprediction using deep neural networks.

    3.2.2 Laina NetworkThe state-of-the-art network for monocular depth prediction has beenimplemented by Laina et al. [12]. Figure 3.4 illustrates the microscopiccomposition of it. It presents a classical encoder-decoder structure asexplained in the previous section.

    The encoding operation is performed by ResNet50.The decoding operation is specific to this network. The basic idea

    is to reduce the number of features during the upsampling. To do so,Laina et al. [12] introduces a new module called up-projection. Thismodule performs first an unpooling in order to increase the width andheight of the features and then convolutions to decode the informationby reducing the number of features. The unpoling operation consistsin filling the value of a feature map in a two times larger feature mapwith only zeros. The coefficients are filled only one column on twoand one row on two.

    Figure 3.5 illustrates the microscopic structure of this module. Onecan notice the presence of a skip connection identical to the one of the

  • 14 CHAPTER 3. RELATED WORK

    Figure 3.4: Architecture of Laina network with all the different opera-tions performed.

    Figure 3.5: Illustration of the up-projection module introduced byLaina et al. [12]. The number of channels is decreased at the end ofthis module and the resolution of the features is multiplied by 2.

    residual module of ResNet. It was designed to achieve the same goal.This network architecture is quite simple in the sense that it inputs

    a single image and outputs its depth map. It achieves state-of-the-artresults on several academic datasets such as NYU depth dataset [15]and justify the encoding-decoding approach for depth prediction. Allthe following will use similar architecture.

    3.3 Stereoscopic depth inference

    Very recently some deep neural networks have integrated the stereo-scopic information. Two publications tackled this problem.

  • CHAPTER 3. RELATED WORK 15

    Figure 3.6: Illustration of the macro architecture of the nework intro-duced by Kuznietsov, Stückler, and Leibe [11].

    3.3.1 Kuznietsov networkKuznietsov, Stückler, and Leibe [11] published an article, within whichthey introduce a network achieving state-of-the-art performance onKitti dataset. Its structure is similar to Laina net. The only differenceis in the loss function.

    Indeed, Kuznietsov, Stückler, and Leibe [11] use the stereoscopicimage of the pair, that is to say the image issued from the other cali-brated camera, to add a term in the error. According to Equation 2.6,if the depth is known, one can retrieve the disparity and doing so canreconstruct the other image of the pair with Equation 2.5. The lossfunction they introduce is the following :

    Loss = ||F (Il

    )� Z 0||+ ||Ir

    (x� ⇢(x, y), y)� Il

    || (3.1)

    where F represents the network function, Z 0 the ground truth and ⇢the disparity computed based on the depth prediction. This loss func-tion forces the model not only to predict depth with pixel values closefrom the ground truth but also depth which leads to consistent stereoreconstruction.

    It is interesting to notice that one can keep only the second term ofthe loss function : ||I

    r

    (x�⇢(x, y), y)�Il

    ||. By doing so, the ground truthZ

    0 is no longer necessary to train the model. The model can though be

  • 16 CHAPTER 3. RELATED WORK

    Figure 3.7: Illustration of training procedure of networks introducedby Godard, Mac Aodha, and Brostow [6].

    trained with a dataset containing only stereoscopic images. This iscalled unsupervised learning.

    3.3.2 Godart NetworkAnother network published by Godard, Mac Aodha, and Brostow [6]use stereoscopic images in a non supervised way to train their networkfor depth prediction.

    This network has a very similar structure : an encoding part per-formed by ResNet50 module and then a specific decoding. The train-ing procedure is slightly different because it takes into account bothpredictions from left image and right image to make more consistenttheir depth prediction. This network is trained without depth groundtruth. Figure 3.7 illustrates how the training procedure works. Firstthe network will output two different disparities dl and dr, which willbe used to reconstruct the right image from the left image left and in-verse. The loss function is composed of three different terms. The firstone is an error on the two stereo reconstructions and defined by thisequation :

    Loss

    reconstruction

    = ||Ir

    (x� dr(x, y), y)� Il

    ||+ ||Il

    (x� dl(x, y), y)� Ir

    ||(3.2)

  • CHAPTER 3. RELATED WORK 17

    The second term is a consistent term on the disparities :

    Loss

    disparities

    = ||dl(x, y)� dr(x+ dl(x, y), y)|| (3.3)

    The loss function contains also a smooth cost which aims to superposethe gradient of the depth prediction on the gradient of the image I .

    Loss

    smoothness

    = ||@x

    d||e�|@xI| + ||@y

    d||e�|@yI| (3.4)

    The total loss function is a sum of the three different terms for both leftand right disparities.

    This network is the most efficient network for non supervised depthprediction nowadays on the Kitti dataset.

    3.4 Sequential inference

    From the same idea, very recent networks tried to use sequential in-formation to increase the performance of their network.

    3.4.1 Vijayanarasimhan NetworkVijayanarasimhan et al. [18] published a network this year which usesequential information to better their predictions. Figure 3.8 presentsits architecture. It is quite different from the previous network. Indeeda first network predicts the depth from a single image. A second net-work will input a sequential pair of images and will predict an objectmask, the motion from the camera and each object between the twoframes. The idea of this network is to compute the optical flow fromthe depth and motion estimation to add a reconstruction error in theloss function as it was already the case in the stereoscopic situationwith the disparity.

    To do so one can compute for each pixels the 3D coordinates, giventhe parameters of the camera, by using the Equation 2.3. From those3D points, one can use Equation 2.4 to project the 3D points on theimage t + 1. The rotation and translation matrices are obtained by thecomposition of the objects and camera motion predictions. Once theprojection on the image t+1 is performed, one can compute the opticalflow by comparing the displacement of each pixels between the twoframes.

  • 18 CHAPTER 3. RELATED WORK

    Figure 3.8: Architecture of network introduced by Vijayanarasimhanet al. [18].

    The loss function contains a term based on the reconstruction errorusing the optical flow. The loss function will in this case be of thefollowing form :

    Loss = ||F (It

    )� Z 0||+ ||It+1(x+ Vx(x, y), y + Vy(x, y))� It|| (3.5)

    where V corresponds to the estimated optical flow, Z 0 the ground truth,and I

    t

    the image at time t. Similarly than both previous networks, thisnetwork can be trained from an unsupervised way, namely withoutdepth ground truth if the first term of the loss function is deleted.

    Another network introduced by Zhou et al. [19], this year, presentsa similar architecture and training procedure. The only difference isthe absence of the object mask and object motions. This network re-quires that the only motion between two different instants is the cam-era motion, unlike the one of Vijayanarasimhan Network which alsointegers object motions.

    3.5 Extension to similar problems

    All the different networks from the previous sections have all the samecommon point: they only input a single image to predict depth. Fromthe best of our knowledge, there is no state-of-the-art network using

  • CHAPTER 3. RELATED WORK 19

    Figure 3.9: Architecture of Flownet introduced by Dosovitskiy et al.[1].

    a couple of images as input. This is why this thesis aims to propose anovel architecture to perform so.

    However we found some networks taking a pair of images as in-put, but that are not designed for depth prediction.

    3.5.1 FlowNetFor instance, Dosovitskiy et al. [1] introduced a network to predict op-tical flow from a sequential pair. This network presents two specifici-ties in its architecture. This architecture is illustrated Figure 3.9. Thefirst one is the presence of a correlation module. Indeed both imagesare first encoded independently and are then correlated (yellow oper-ations in Figure 3.9). This correlation operation will be explained fur-ther in the next chapter. Roughly, it can be understood as a scalar prod-uct between features along the depth dimension. Then the correlationfeatures are encoded deeper and injected in a refinement module. Thisrefinement module is illustrated Figure 3.10. The second specificity isthe fact that the network will output four different optical flow mapsat different resolutions. Each optical flow maps, once predicted, arere-injected back in the network to help the decoding.

    3.6 Synthesis

    This section aims to synthesize the different core elements of the previ-ous networks in order to build the most efficient network as possible.

    • All the recent and most efficient networks presented in the pre-vious sections are based on an encoding-decoding architecture.

  • 20 CHAPTER 3. RELATED WORK

    Figure 3.10: Architecture of the refinement module introduced byDosovitskiy et al. [1].

    The encoding is performed by layers coming from ResNet50. Inorder to perform the best, the network proposed by this thesiswill reuse this scheme.

    • Several comparisons can be made. Both Eigen Network and FlowNetuse a global to fine approach even though they do not achieve thesame function. This can seem similar to pre-existing disparity es-timation methods.

    • The performance of the prediction can be increased if temporalor stereoscopic information are available. These information areintegrated via the loss function, most of the time. On the otherhand, FlowNet inputs directly the temporal information and per-forms a specific operation: a correlation. As the goal of this thesisis to build a model which input a pair of images, the correlationmodule should be a part of the network design.

  • Chapter 4

    Approach

    This chapter presents a novel network for double inference : MSDOSNet (Multi Scale Depth Optimization Strategy Network). This net-work aims to predict depth from a pair of images (sequential or stereo-scopic). The design of this network is inspired by the previous state-of-the-art networks. Three new contributions to preexisting networksare introduced with MSDOS Net, in order to answer the drawbacksof preexisting networks : a pyramidal structure inspired by classicaldisparity estimations, a new decoding module called EnF-DED, and anew training strategy.

    4.1 MSDOS-Net Architecture

    The overall MSDOS-Net architecture is presented Figure 4.1. The modelcan be decomposed into three separated macroscopic modules, de-tailed in the sections hereafter.

    The first module transposes a classical pyramidal correlation ap-proach into several encoders from different image resolutions. En-coded images are first concatenated, then correlated with the secondimage features using the correlation layer firstly introduced by FlowNetDosovitskiy et al. [1]. This module will be explained in the followingsection. Resulting features are concatenated with a part of the encodedimages so that some monocular information move forward in the net-work.

    In order to enforce the robustness of the correlation, the same oper-ation is performed on feature maps picked up at different resolutionsin our encoding network pyramid (showed in red and green in Figure

    21

  • 22 CHAPTER 4. APPROACH

    Figure 4.1: MSDOS-Net overall architecture: a coarse to fine depthmap prediction from a pyramidal left-right encoding. Correlationsare performed at multiple resolution (in blue, green and red in thefigure) and integrated successively in the corresponding Expand andFuse modules.

  • CHAPTER 4. APPROACH 23

    4.1). Each output of this Multi-Scale Coarse-to-Fine (MSCF) moduleis treated in a Depth Encoder-Decoder (DED) component, to predicta depth map for the corresponding level of details. Their inference isperformed sequentially, in a coarse-to-fine manner: outputs from thefirst DED module allow initializing the second one and so on.

    Between each DED is the last key component of our architecture,an up-sampling module which doubles the size of the prior depth mapand features, then concatenates these two with both the correspondingcorrelation result and a down-sampled instance of the reference inputimage (typically the left one for our stereoscopic study). This moduleis then referred as the Expand and Fuse (EnF) module.

    A last EnF-DED-like sequence allows refining the depth map inhalf the resolution of the input images.

    To serve the clarity of explanations, the pyramidal decompositionis limited to three levels in this thesis, but the proposed approach couldbe easily adapted to any input resolution.

    4.1.1 Muli-Scale Coarse to Fine (MSCF) moduleStereoscopic or sequential inputs are first down-sampled following apyramidal decomposition scheme, with each level being half the reso-lution of the previous one.

    The resulting images are encoded until the desired feature mapsize. This encoding part is partially composed of modules comingfrom pre-trained ResNet50 on ImageNet dataset to ease the conver-gence. This encoding process is very similar to the preexisting networkdetailed in the previous section. Table 4.1 shows the specific architec-ture for each input resolution. One can notice that deeper encodingare performed with higher resolutions so as to obtain the same finaldimensions.

    Considering left and right inputs separately, feature maps of equalresolutions are picked from the down-sampled encoding branches andconcatenated. The number of these samples is limited the followingway: every encoder contributes to the coarser level (in blue Figure4.1]), while two out of three provide mid-size features and only onethe finest output.

    Left and right aggregated features are then matched at each resolu-tion. A correlation layer repeats the principle detailed in FlowNet [1],introducing a module that performs multiplicative patch comparisons

  • 24 CHAPTER 4. APPROACH

    Figure 4.2: Detail of the Correlation layer which consists in the innerproduct of the left and right encoded images. The number of outputsis fix to 473 features, no matter the size of the correlation; extra featurescome from a single convolution of the left image.

    by convolving left and right data. Thus, it has no trainable weights.Dosovitskiy et al. [1] detail the correlation of two patches of size ⌦,centered at x1, and x2 in the feature maps f1 and f2, as follows :

    C(x1, x2) =

    sX

    o2⌦

    f1(x1 + o).f2(x2 + o) (4.1)

    This operation is repeated all around x2, in a neighborhood ex-pected to contain the effective matching displacement. The prospectedarea should depend on both the considered application, and the down-sampling level. Indeed, working on stereoscopic pairs implies a priorknowledge on the disparity direction, while temporal overlapping doesnot provide such a constraint.

    Doing so, the output of the correlation layer is sized by the sur-rounding neighborhood explored. For instance, computing the corre-lation in a 7 ⇥ 7 pixel area will result in 49 correlation features. In themulti-scale framework, the number of these outputs is fixed to 473 fea-tures, no matter the sizes of the correlation. Extra features come froma single convolution on the encoded images (see Figure 4.2]). Table 4.2summarizes the effective correlation parameters set up used to trainon Virtual KITTI for temporal inter-frame consistency.

  • CHAPTER 4. APPROACH 25

    Image 1 : 1 Image 1 : 2 Image 1 : 4Conv

    72(3, 64)

    Conv

    52(64, 64)

    High-res corr.Conv

    52(64, 64) Conv

    72(3, 64)

    Resnet1(64, 256) Conv52(64, 64)

    Mid-res corr. Mid-res corr.Resnet2(256, 512) Conv

    52(64, 64) Conv

    72(3, 64)

    Resnet1(64, 256) Conv52(64, 64)

    Low-res corr. Low-res corr. Low-res corr.

    Table 4.1: Architecture of the networks for the different resolution in-puts where Convk

    s

    (channels

    in

    , channels

    out

    ) represents the convolutionof stride s kernel k which takes channels

    in

    and returns channelsout

    . Allthe convolutions are followed by a batch normalization step and a nonlinearity ReLu. Resnet

    i

    represents the i-th layer of ResNet50 which iscomposed of 4 different global layers.

    Input res size Corr. feat. Mono. feat.Low-res (7,7) 49 424Mid-res (11,11) 121 352Full-res (21,21) 441 32

    Table 4.2: Size of every correlations and output feature details w.r.t.the input resolution.

  • 26 CHAPTER 4. APPROACH

    4.1.2 Coarse-to-Fine InferenceThe basic idea that led the design of this network is the multi-scaleinference.

    As explained above, the successive downsampled inputs and theircorrelations at different level of encoding realizes the multi-scale en-coder. On the other hand the refinement module, i.e. the decoding partof our network, also integrates a multi-scale strategy with a coarse-to-fine prediction, inspired by Dosovitskiy et al. [1].

    After a first encoding-decoding step, the network will output acoarse depth map at a low resolution, then refined with the help of thenext correlation result, and so on. There will be four different depthmaps predicted at the end.

    The results of the next two correlations are successively injected inthe refinement, concatenated with the prior depth map and the refer-ence input image down-sampled to the right resolution. As all thisinformation come from different sources, to our refinement model ashort encoder network has been added each time after the incorpora-tion of new information. This module aims to homogenize them inorder to increase the quality of our prediction.

    In practice, correlated data at every resolutions are deeper encoded,decoded and up-scaled, by alternating DED and EnF modules.

    The first one, a ResNet projection showed in Figure 3.2, aims toincrease the number of features and lower the resolution. Two ResNetskips (Figure 3.1) are then added to increase the depth. DED decodingsfollow the original up-projection scheme proposed in Laina et al. [12](Figure 3.5)

    While DED does not change the overall resolution of feature maps,it adds a dedicated output branch that transforms features into a depthmap.

    The Expand and Fuse (EnF) module initializes every higher-resprediction. To do so, it combines three functional parts. First, an up-scaling layer re-sizes the prior depth map by a factor two. An otherup-projection coming from Laina et al. [12] multiplies the resolutionof the input feature maps by two. Lastly, a fusion stage concatenatesthe up-scaled depth map and up-projected features with both the cor-responding MSCF outputs, and the reference input image previouslydown-sampled.

    The overall sequence is as follows. From the lowest correlation

  • CHAPTER 4. APPROACH 27

    Figure 4.3: Detail of the Expand and Fuse (EnF) module.

    layer, a first DED module provides a coarse depth map. Then, threeadditional EnF-DED couples refine these prediction up to half the in-put image resolution. It should be noted that the last EnF module doesnot aggregate any correlation data.

    Table 4.3 summaries the different operations performed in the re-finement.

    4.2 Multi Scale Training Approach

    For the training procedure, this thesis propose a new approach whichconsist in training the network to sequentially learn each output reso-lution, starting with the lowest one instead of training all the predic-tions together or just having one full resolution target. This approachis inspired by the pyramidal technique which is commonly used forclassical image processing (contrast enhancement for instance) andaims to first train our network to predict the global structure of thedepth map and gradually refine it thanks to the images, intermediatedepth maps predictions and correlation injected in the refinement.

    To process this multi-scale training, this thesis introduces an evo-lutive loss depending on the resolution of the output we want to train

  • 28 CHAPTER 4. APPROACH

    Module Type Channels in Channels out ScaleUp projection 1024 512 x2Up projection 512 256 x4Concatenation 256 516 x4Resnet Proj 516 1032 x2Resnet Skip 1032 1032 x2Resnet Skip 1032 1032 x2Up projection 1032 516 x4Up projection 516 128 x8Concatenation 128 196 x8Resnet Proj 196 392 x4Resnet Skip 392 392 x4Resnet Skip 392 392 x4Up projection 392 196 x8Up projection 169 64 x16Concatenation 64 68 x16Resnet Proj 68 136 x8Resnet Skip 136 136 x8Resnet Skip 136 136 x8Up projection 136 68 x16Convolution 68 1 x16

    Table 4.3: Summary of the different operations in the refinement andthe evolution of the number of channel.

  • CHAPTER 4. APPROACH 29

    :

    L

    i

    =

    iX

    k=1

    1

    2

    2i�1 ||Zi � Z0i

    ||22 (4.2)

    where Z 0i

    is the ground truth depth,Zi

    our prediction and i the res-olution : i = 1 represents the lowest resolution and i = 4 the high-est. The different coefficients have been computed in order to put thesame weight in the error for every pixels of every maps. The trainingphase begins during 2 epochs with L1, then with L2 during 2 epochs,5 epochs with L3, and finally L4.

  • Chapter 5

    Experimentation and results

    The training approach and network architecture have been evaluatedon two different datasets : the raw sequences of KITTI introduced byGeiger et al. [4] and synthetic sequences of Virtual KITTI designed byGaidon et al. [3]. KITTI, one of the most used dataset for depth pre-diction, is composed of several stereo RGB sequences from differentenvironments such as suburbs, cities, or highways. Along them, thedataset provides 3D laser measurements from a LIDAR which resultsin sparse depth maps for each image. The training, validation and testsplit is the one proposed by Eigen, Puhrsch, and Fergus [2].

    On the other hand, Virtual KITTI, provides perfect dense depthmaps for each images of the synthetic mono sequences. One sequencehas been isolated to build the test set, and the four others constitutethe training.

    Experimenting the model on those two different datasets exploreseveral approaches such as training on sparse and dense labels, andinfer with stereo and temporal inferences in order to prove the effi-ciency of this network on different contexts.

    5.1 Implementation Details

    All the residual layers of the encoding part of the network is initializedwith pre-trained ResNet-50. The decoding part is initialized accordingthe commonly used Xavier initalization Glorot and Bengio [5]. The op-timization has been performed with Adam as optimizer with param-eters 0.9 and 0.99. The training used a Nvidia GT1080 with 8GB ram.The learning rate has been fixed to 0.0001 and exponentially decayed

    30

  • CHAPTER 5. EXPERIMENTATION AND RESULTS 31

    after 15 epochs.For KITTI, crops of size 480x192 for training and crops of size 320x256

    for Virtual KITTI, have been used. For testing, the images have notbeen re-sized for both datasets and so are inferred at full resolution.Every training have been realized according the proposed new multi-scale training approach.

    5.2 Evaluation Metrics

    The proposed model for each dataset was evaluated according to thefollowing criteria :

    RMSE :

    vuut 1T

    TX

    i=1

    ||Z 0(xi

    )� Z(xi

    )||22 (5.1)

    ARD :

    1

    T

    TX

    i=1

    ||Z 0(xi

    )� Z(xi

    )||1Z(x

    i

    )

    (5.2)

    SRD :

    1

    T

    TX

    i=1

    ||Z 0(xi

    )� Z(xi

    )||22Z(x

    i

    )

    (5.3)

    Accuracy :

    |{{i 2 {1, ..., T}|max(Z0(x

    i

    )Z(x

    i

    ) ,Z(x

    i

    )Z

    0(xi

    )) = � < thr}|T

    (5.4)

    where T represents the total number of pixels where the ground truthis available, that means all the pixels of the image if the ground truth isdense or only a small subset if it is sparse. Z 0 represents the predictionmade by our model and Z the correct depth. ARD stands for absoluterelative difference and SRD square relative difference. All the depthhave been capped to 80m.

    5.3 Virtual KITTI

    To evaluate the efficiency of the model according to perfect dense depthmaps, it was first trained using the Virtual KITTI dataset. As the stereoimages are not provided two consecutive frames of the same sequencehas been used to feed the network. In order to compare the efficiencyof the proposed model, the state-of-the-art networks for monoculardepth prediction [12] has been trained on this specific dataset.

  • 32 CHAPTER 5. EXPERIMENTATION AND RESULTS

    Figure 5.1: Illustration of the multi scale approach on Virtual Kitti.From top to down, the figures represent one of the input images, theground truth, and the four predictions from the the lowest resolutionto the highest

  • CHAPTER 5. EXPERIMENTATION AND RESULTS 33

    Figure 5.2: Illustration of our best model on Virtual KITTI. The leftabove image represents one of the RGB images of the sequentialpair. The left below image represents the prediction performed by thestate-of-the-art network. The right above image represents the denseground truth which has been capped to 80m. The image in the lastcorner is the MSDOS net prediction.

    Approach RMSE ARD SRDLaina et al. [12] 14.110 0.627 8.036Ours 8.612 0.459 4.153

    Table 5.1: Quantitative results of our method and the state-of-the-artmethod on the test set of Vitual KITTI. The ground truth depth havebeen limited to 80 m to be similar to KITTI dataset

  • 34 CHAPTER 5. EXPERIMENTATION AND RESULTS

    Approach RMSE ARD SRDEigen, Puhrsch, and Fergus [2] fine 7.156 0.215 -Liu et al. [14] 6.986 0.217 1.841Godard, Mac Aodha, and Brostow [6] 5.849 0.141 1.369Kuznietsov, Stückler, and Leibe [11] supervised only 4.815 0.122 0.763Kuznietsov, Stückler, and Leibe [11] semi supervised 4.621 0.113 0.741MSDOS stereo input 3.102 0.098 0.520MSDOS stereo input + Virtual Kitti 3.050 0.091 0.472MSDOS sequential input 4.612 0.157 1.082Approach � < 1.25 � < 1.252 � < 1.253

    Eigen, Puhrsch, and Fergus [2] fine 0.692 0.899 0.967Liu et al. [14] 0.647 0.882 0.961Godard, Mac Aodha, and Brostow [6] 0.818 0.929 0.966Kuznietsov, Stückler, and Leibe [11] supervised only 0.845 0.957 0.987Kuznietsov, Stückler, and Leibe [11] semi supervised 0.862 0.960 0.986MSDOS stereo input 0.914 0.969 0.987MSDOS stereo input + Virtual Kitti 0.921 0.968 0.987MSDOS sequential input 0.813 0.932 0.972

    Table 5.2: Quantitative results of our method and previous state-of-the-art methods on the test set of KITTI according to the split of Eigen,Puhrsch, and Fergus [2]. Best results are shown in bold.

    Table 5.1 summarizes the different scores obtained on the test setof Virtual KITTI. Since the dataset contains no stereo pairs and for faircomparison, the model was trained and tested only with sequentialinformation. Different predictions on test set with both models are il-lustrated in Figure 5.2. The results are better and show an ameliorationof almost 40% between these two networks. Visually, the depth predic-tions by the proposed model are much sharper than the one inferredby the state-of-the-art network [12]. Thin details such as the ramifica-tions of the different trees appear in the proposed model predictionsand cannot be seen in the others.

    5.4 KITTI

    MSDOS net has also been evaluated on the KITTI dataset where onlysparse ground truth is available. This evaluation can be performedfrom two different ways as KITTI provides stereo pairs and sequential

  • CHAPTER 5. EXPERIMENTATION AND RESULTS 35

    information.

    5.4.1 Comparison with the State-of-the-artMost of the state-of-the-art deep neural networks for depth predictionuse a single image for the inference and use its stereo image in a semi-supervised loss to achieve more consistent predictions. In order tocompare our performance with these networks, MSDOS net was firsttrained with stereo pairs. To do so, two different initialization strate-gies have been tried: the same strategy as the one described for train-ing on Virtual KITTI and another one with the coefficients of the modelpre-trained with Virtual KITTI.

    Table 5.2 summarizes the performance obtained for MSDOS netmodel and state-of-the-art methods for depth prediction. The the-sis model pre-trained on Virtual KITTI achieve a new state-of-the-artscore for every evaluation metrics for depth prediction.

    Figure 5.3 illustrates the predictions of MSDOS model on the test-set. To illustrate the depth ground truth, the sparse ground truth hasbeen densified by triangulation and linear interpolation, a bilateral fil-ter is then applied to increase the sharpness. One interesting thing tonotice on the results is the fact that the predictions from MSDOS netseem to fit better the images than the reconstruction from ground truthalthough only these sparse points has been used for training.

    5.4.2 Generalization to temporal sequencesIn order to compare the performance of MSDOS model in the casewhere the input images are coming from the same camera, the modelwas also trained on Kitti with sequential inputs. The different scoresobtained in this situations can be found Table 5.2.

    Once again, MSDOS architecture achieve good results, even if thoseresults are less better than the one obtained in the stereoscopic case.This can be explained by the fact that more 3D information is con-tained in a stereo pair than in a sequential one.

  • 36 CHAPTER 5. EXPERIMENTATION AND RESULTS

    Figure 5.3: Illustration of our best model on KITTI, that is to say pre-trained on Virtual Kitti and stereo inputs. The left above image repre-sents one of the RGB images of a stereo. The left below image repre-sents the sparse ground truth of depths. The right above image repre-sents the dense ground truth obtained by a Delaunay triangulation onthe sparse distribution and linear interpolation followed by a bilateralfilter to increase the sharpness. The image in the last corner is MSDOSnet prediction.

  • Chapter 6

    Conclusion

    The huge amount of data available, nowadays, and the developmentof new depth sensors enable the possibility to design algorithms whichcould predict depth with a great accuracy. However most of them onlyuse images from a single camera which is an ill posed problem andconsequently hard to solve.

    Using several images at two different time frame from the samecamera or a couple of images from stereo cameras (most of embeddedsystems for vision provide stereo cameras) should improve the accu-racy of the prediction. This is why the core of this thesis is to proposea new method for depth prediction from several inputs and compareit with single image based algorithms.

    To do so, this thesis introduces a novel architecture: MSDOS net,which achieves new state-of-the-art results for multi-ocular depth pre-diction on two famous datasets for self driving cars : KITTI and virtualKITTI. The results show an improvement in both accuracy and visualquality.

    To achieve such results, different modules of its architecture havebeen inspired from state-of-the-art networks : the correlation module,the coarse to fine scheme, the multi-resolution prediction, ... and sev-eral new modules such as the pyramidal architecture, decoding mod-ule and training procedure have been introduced.

    However, due to the fact that there does not exist yet several multi-input depth prediction system, it’s hard to estimate how much the gapbetween those and single input systems in terms of performance.

    These contributions can easily be adapted for other tasks such assemantic segmentation or optical flow estimation. Hence, we think

    37

  • 38 CHAPTER 6. CONCLUSION

    that one of the main extension to this thesis is to adapt a similar archi-tecture to other deep learning challenges.

    Another possible extension could be to explore the potential of theMSDOS model through non supervised settings, as it is the case withseveral recent deep neural networks for depth prediction.

  • Bibliography

    [1] Alexey Dosovitskiy et al. “FlowNet: Learning Optical Flow withConvolutional Networks”. In: 2015 IEEE International Conferenceon Computer Vision, ICCV 2015, Santiago, Chile, December 7-13,

    2015. 2015, pp. 2758–2766. DOI: 10.1109/ICCV.2015.316.

    [2] David Eigen, Christian Puhrsch, and Rob Fergus. “Depth MapPrediction from a Single Image using a Multi-Scale Deep Net-work”. In: Advances in Neural Information Processing Systems 27:Annual Conference on Neural Information Processing Systems 2014,

    December 8-13 2014, Montreal, Quebec, Canada. 2014, pp. 2366–2374.

    [3] A Gaidon et al. “Virtual Worlds as Proxy for Multi-Object Track-ing Analysis”. In: CVPR. 2016.

    [4] Andreas Geiger et al. “Vision meets Robotics: The KITTI Dataset”.In: International Journal of Robotics Research (IJRR) (2013).

    [5] Xavier Glorot and Yoshua Bengio. “Understanding the difficultyof training deep feedforward neural networks”. In: Proceedings ofthe Thirteenth International Conference on Artificial Intelligence and

    Statistics. Ed. by Yee Whye Teh and Mike Titterington. Vol. 9.Proceedings of Machine Learning Research. Chia Laguna Resort,Sardinia, Italy: PMLR, 13–15 May 2010, pp. 249–256.

    [6] Clément Godard, Oisin Mac Aodha, and Gabriel J. Brostow. “Un-supervised Monocular Depth Estimation with Left-Right Con-sistency”. In: CVPR. 2017.

    [7] Richard Hartley and Andrew Zisserman. Multiple view geometryin computer vision. Cambridge university press, 2003.

    [8] Kaiming He et al. “Deep Residual Learning for Image Recogni-tion”. In: arXiv preprint arXiv:1512.03385 (2015).

    39

  • 40 BIBLIOGRAPHY

    [9] Sergey Ioffe and Christian Szegedy. “Batch normalization: Ac-celerating deep network training by reducing internal covari-ate shift”. In: International Conference on Machine Learning. 2015,pp. 448–456.

    [10] Diederik Kingma and Jimmy Ba. “Adam: A method for stochas-tic optimization”. In: arXiv preprint arXiv:1412.6980 (2014).

    [11] Yevhen Kuznietsov, Jörg Stückler, and Bastian Leibe. “Semi-SupervisedDeep Learning for Monocular Depth Map Prediction”. In: CoRRabs/1702.02706 (2017).

    [12] Iro Laina et al. “Deeper depth prediction with fully convolu-tional residual networks”. In: 3D Vision (3DV), 2016 Fourth In-ternational Conference on. IEEE. 2016, pp. 239–248.

    [13] Yann LeCun et al. “Backpropagation applied to handwritten zipcode recognition”. In: Neural computation 1.4 (1989), pp. 541–551.

    [14] Fayao Liu et al. “Learning Depth from Single Monocular Im-ages Using Deep Convolutional Neural Fields”. In: IEEE Trans.Pattern Anal. Mach. Intell. 38.10 (2016), pp. 2024–2039. DOI: 10.1109/TPAMI.2015.2505283.

    [15] Pushmeet Kohli Nathan Silberman Derek Hoiem and Rob Fer-gus. “Indoor Segmentation and Support Inference from RGBDImages”. In: ECCV. 2012.

    [16] Adam Paszke et al. PyTorch. 2017.

    [17] Daniel Scharstein and Richard Szeliski. “A taxonomy and evalu-ation of dense two-frame stereo correspondence algorithms”. In:International journal of computer vision 47.1-3 (2002), pp. 7–42.

    [18] Sudheendra Vijayanarasimhan et al. “SfM-Net: Learning of Struc-ture and Motion from Video”. In: CoRR abs/1704.07804 (2017).

    [19] Tinghui Zhou et al. “Unsupervised learning of depth and ego-motion from video”. In: arXiv preprint arXiv:1704.07813 (2017).

  • TRITA EECS-EX-2018:698

    www.kth.se