Self-Supervised Depth Learning Improves Semantic...

3
Self-Supervised Depth Learning Improves Semantic Segmentation Huaizu Jiang, Erik Learned-Miller Univ. of Massachusetts, Amherst Amherst MA 01003 [hzjiang,elm]@cs.umass.edu Gustav Larsson, Michael Maire, Greg Shakhnarovich Toyota Technological Institute Chicago, IL 60637 [email protected],[mmaire,greg]@ttic.edu 1. Introduction How does a newborn agent learn about the world? When an animal (or robot) moves, its visual system is exposed to a shower of information. Usually, the speed with which something moves in the image is inversely proportional to its depth. 1 As an agent continues to experience visual stimuli under its own motion, it is natural for it to form associations between the appear- ance of objects and their relative motion in the image. For example, an agent may learn that objects that look like mountains typically don’t move in the image (or change appearance much) as the agent moves. Objects like nearby cars and people, however, appear to move rapidly in the image as the agent changes position rel- ative to them. This continuous pairing of images with motion acts as a kind of “automatic” supervision that could eventually allow an agent both to understand the depth of objects and to group pixels into objects by this predicted depth. That is, by moving through the world, an agent may gather training data sufficient for learning to understand static scenes. In this work, we explore the relationship between the appearance of objects and their relative depth in the image. To estimate object depths without human su- pervision, we use traditional techniques for estimating depth from motion, applied to simple videos taken from cars and trains. Since these depth estimation tech- niques produce depth up to an unknown scale factor, we output estimates of relative depth, that group pixels into 10 bins of progressively greater depths. Once we have depth estimates for these video images, we train a deep network to predict the depth of each pixel from a single image, i.e., to predict the depth without the benefit of motion. One might expect such a network to learn that an image patch that looks like a house and spans 20 pixels of an image is about 100 meters away, while a pedestrian that spans 100 image pixels is perhaps 10 meters away. Figure 1 illustrates this pre- diction task and shows example results obtained using 1 Strictly speaking, this statement is true only after one has compensated for camera rotation, individual object motion, and image position. We address these issues in the paper. Figure 1. Sample frames from collected videos and their corresponding depth maps, where brightness encodes rel- ative depth. From top to bottom: input image, relative depth images recovered using [2], and predicted depth bins using our trained model. There is often a black blob around the center of the image, a singularity in depth estimation caused by the focus of expansion. a standard convolutional neural network (CNN) in this setting. For example, in the leftmost image of Fig. 1, an otherwise unremarkable utility pole is clearly high- lighted by its depth profile, which stands out from the background. To excel at relative-depth estimation, the CNN will benefit by learning to recognize such struc- tures. The goal of our work, however, is not to show im- provements to depth prediction from a single image. Rather, it is to show that pre-training a network to do depth prediction is a powerful surrogate (proxy) task for learning visual representations. In particular, we show that a network pre-trained for depth prediction is a powerful starting point from which to train a deep network for semantic segmentation. This regime allows us to obtain significant performance gains on seman- tic segmentation benchmarks including KITTI [9, 8], CamVid [4, 3], and CityScapes [5], compared to train- ing a segmentation model from scratch. In fact, our performance on these benchmarks comes very close to that of equivalent architectures pre-trained with Ima- geNet [6], a massive labeled dataset. Also, we com- pare our proposed self-supervised model with previous ones, including learning to inpaint [18], doing auto- matic colorization [13, 21, 14, 22], predicting sound from video frames [16], and a set of motion-based meth- ods [20, 1, 17, 15]. 1

Transcript of Self-Supervised Depth Learning Improves Semantic...

Page 1: Self-Supervised Depth Learning Improves Semantic Segmentationsunw.csail.mit.edu/abstract/Self-Supervised_Depth_Learning.pdf · use the curated annotations of the CamVid dataset re-leased

Self-Supervised Depth Learning Improves Semantic Segmentation

Huaizu Jiang, Erik Learned-MillerUniv. of Massachusetts, Amherst

Amherst MA 01003[hzjiang,elm]@cs.umass.edu

Gustav Larsson, Michael Maire, Greg ShakhnarovichToyota Technological Institute

Chicago, IL [email protected],[mmaire,greg]@ttic.edu

1. IntroductionHow does a newborn agent learn about the world?

When an animal (or robot) moves, its visual system isexposed to a shower of information. Usually, the speedwith which something moves in the image is inverselyproportional to its depth.1 As an agent continues toexperience visual stimuli under its own motion, it isnatural for it to form associations between the appear-ance of objects and their relative motion in the image.For example, an agent may learn that objects that looklike mountains typically don’t move in the image (orchange appearance much) as the agent moves. Objectslike nearby cars and people, however, appear to moverapidly in the image as the agent changes position rel-ative to them. This continuous pairing of images withmotion acts as a kind of “automatic” supervision thatcould eventually allow an agent both to understandthe depth of objects and to group pixels into objectsby this predicted depth. That is, by moving throughthe world, an agent may gather training data sufficientfor learning to understand static scenes.

In this work, we explore the relationship between theappearance of objects and their relative depth in theimage. To estimate object depths without human su-pervision, we use traditional techniques for estimatingdepth from motion, applied to simple videos taken fromcars and trains. Since these depth estimation tech-niques produce depth up to an unknown scale factor,we output estimates of relative depth, that group pixelsinto 10 bins of progressively greater depths. Once wehave depth estimates for these video images, we traina deep network to predict the depth of each pixel froma single image, i.e., to predict the depth without thebenefit of motion. One might expect such a networkto learn that an image patch that looks like a houseand spans 20 pixels of an image is about 100 metersaway, while a pedestrian that spans 100 image pixels isperhaps 10 meters away. Figure 1 illustrates this pre-diction task and shows example results obtained using

1Strictly speaking, this statement is true only after one hascompensated for camera rotation, individual object motion, andimage position. We address these issues in the paper.

Figure 1. Sample frames from collected videos and theircorresponding depth maps, where brightness encodes rel-ative depth. From top to bottom: input image, relativedepth images recovered using [2], and predicted depth binsusing our trained model. There is often a black blob aroundthe center of the image, a singularity in depth estimationcaused by the focus of expansion.

a standard convolutional neural network (CNN) in thissetting. For example, in the leftmost image of Fig. 1,an otherwise unremarkable utility pole is clearly high-lighted by its depth profile, which stands out from thebackground. To excel at relative-depth estimation, theCNN will benefit by learning to recognize such struc-tures.

The goal of our work, however, is not to show im-provements to depth prediction from a single image.Rather, it is to show that pre-training a network to dodepth prediction is a powerful surrogate (proxy) taskfor learning visual representations. In particular, weshow that a network pre-trained for depth predictionis a powerful starting point from which to train a deepnetwork for semantic segmentation. This regime allowsus to obtain significant performance gains on seman-tic segmentation benchmarks including KITTI [9, 8],CamVid [4, 3], and CityScapes [5], compared to train-ing a segmentation model from scratch. In fact, ourperformance on these benchmarks comes very close tothat of equivalent architectures pre-trained with Ima-geNet [6], a massive labeled dataset. Also, we com-pare our proposed self-supervised model with previousones, including learning to inpaint [18], doing auto-matic colorization [13, 21, 14, 22], predicting soundfrom video frames [16], and a set of motion-based meth-ods [20, 1, 17, 15].

1

Page 2: Self-Supervised Depth Learning Improves Semantic Segmentationsunw.csail.mit.edu/abstract/Self-Supervised_Depth_Learning.pdf · use the curated annotations of the CamVid dataset re-leased

#train #val. #test resolution #classesKITTI 100 NA 45 370 × 1226 11

CamVid 367 101 233 720 × 960 11CityScapes 2975 500 1525 1024 × 2048 19Table 1. Summarization of datasets used in our experiment.

2. Learning from Relative DepthSelf-supervised Depth. To generate approximate

relative depth images from which our depth predictingnetwork could be trained, we collected 413 unlabeledvideos from Youtube, taken from the cabs of movingtrains across European cities and transit trains aroundChicago, as well as driving cars in US cities. The sta-bility of the camera in these videos makes them rel-atively easy for our depth estimation code. Depth isestimated by using [2] to segment the image into trans-lational motion fields, and then using perspective pro-jection equations [10] to convert these into depth maps(up to scale). In rural scenes, common structures in-clude mountains, trees, and lamps. In urban scenes,there are abundant man-made structures and some-times pedestrians and cars. We analyze frame pairsonly if they have moderate motion (neither too slow nortoo fast). To eliminate near duplicate frames, two con-secutive depth maps must be at least 10 frames apart.We gathered 2.3M video frames with a typical resolu-tion of 360 × 640 (Fig. 1).

Predicting depth from a single image. Giventhe RGB image I, our goal is train a CNN to predictits depth map. As we need to do pixel-wise predictions,we make the following changes to the standard AlexNetarchitecture. First, we use the convolutional versionsof the fc6 and fc7 layers. For the fc6 layer, we add apad value of 3. Second, we replace the last classificationlayer with another convolution layer with kernel size of3 and pad size of 1. Another 2× bilinear upsamplinglayer is also added to produce the depth predictions.Instead of doing regression on the depth values, wequantize the depth values of an image into 10 bins. Ifa pixel has a bin index of 5, it means its depth value isgreater than 50 percent of other pixels.

We use 2.1M samples for training and 0.2M for val-idation. The input image to the CNN is a randomcrop of 352 × 352 over the entire image. A randomhorizontal flip is also performed. We use negative loglikelihood loss for each pixel. The network is trained10 epochs using the SGD optimizer with momentumof 0.9 and weight decay of 0.0005. The initial learningrate is 0.001 and decreased by factor of 5 at the 6thepoch and another factor of 5 at the 8th epoch.

3. ExperimentsWe consider three datasets commonly used for eval-

uating semantic segmentation. Their main character-istics are summarized in Table 1.

method supervision CityScapes KITTI CamVidsupervised ImageNet lab. 48.1 46.2 57.4scratch - 40.7 39.6 44.0tracking [20] motion 41.9 42.1 50.5moving [1] ego-motion 41.3 40.9 49.7watch-move [17] motion seg. 41.5 40.8 51.7frame-order [15] motion 41.5 39.7 49.6context [18] appearance 39.7 3.0 37.8object-centric [7] appearance 39.6 39.1 48.0colorization [13, 14] chroma 42.9 35.8 53.2cross-channel [22] misc. 36.8 40.8 46.3audio [16] video sound 39.6 40.7 51.5Ours depth 43.4 42.9 52.1Table 2. Comparisons of self-supervised models using theFCN32s model based on the AlexNet architecture.

The first two datasets are much too small to pro-vide sufficient data for “from scratch” training of adeep model; CityScapes is larger, but we show belowthat all three datasets benefit from pre-training. Weuse the curated annotations of the CamVid dataset re-leased by [12]. As a classical CNN-based model forsemantic segmentation, we report results of the FullyConvolutional Network (FCN) [19] with a up-samplingfactor of 32 (FCN32s). Following [19], we use the meanIoU (intersection over union) scores over all classes asevaluation metrics.

During training, the input are random crops of352 × 352 for KITTI. Each variant is trained for 20Kiterations with a batch size of 16. For CamVid, theinput are random crops of 704 × 704. Each variant istrained for 40K iterations with a batch size of 4. ForCityScapes, to ease the computation burden, we usethe half-resolution input images. The input to the net-work are random crops of 512 × 512. Each variant istrained for 80K iterations with a batch size of 6. Inaddition to the random crop, random horizontal flip isalso performed.

We compare our results to those obtained with otherself-supervision strategies, summarized inTable 2. Allthe results are obtained with AlexNet architecture2.We obtain new state-of-the-art results on CityScapesand KITTI segmentation among methods that use self-supervised pre-training. Our model outperforms allother self-supervised models with motion cues (the firstfour self-supervised models in Table 2). On CamVidour results are competitive with the best alternative,pre-training on image colorization. Moreover, our pre-trained model performs significantly better the modellearned from scratch on all three datasets, validatingthe effectiveness of our pre-training.

References[1] P. Agrawal, J. Carreira, and J. Malik. Learning to see

by moving. In CVPR, pages 37–45, 2015.

2We were unable to get meaningful results with [11].

2

Page 3: Self-Supervised Depth Learning Improves Semantic Segmentationsunw.csail.mit.edu/abstract/Self-Supervised_Depth_Learning.pdf · use the curated annotations of the CamVid dataset re-leased

[2] P. Bideau and E. Learned-Miller. It’s moving! A prob-abilistic model for causal motion segmentation in mov-ing camera videos. In ECCV, 2016.

[3] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semanticobject classes in video: A high-definition ground truthdatabase. Pattern Recognition Letters, 2008.

[4] G. J. Brostow, J. Shotton, J. Fauqueur, and R. Cipolla.Segmentation and recognition using structure frommotion point clouds. In ECCV, pages 44–57, 2008.

[5] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. En-zweiler, R. Benenson, U. Franke, S. Roth, andB. Schiele. The cityscapes dataset for semantic urbanscene understanding. In CVPR, 2016.

[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, andL. Fei-Fei. ImageNet: A large-scale hierarchical imagedatabase. CVPR, 2009.

[7] R. Gao, D. Jayaraman, and K. Grauman. Object-centric representation learning from unlabeled videos.In ACCV, pages 248–263, 2016.

[8] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vi-sion meets robotics: The KITTI dataset. InternationalJournal of Robotics Research (IJRR), 2013.

[9] A. Geiger, P. Lenz, and R. Urtasun. Are we readyfor autonomous driving? the KITTI vision benchmarksuite. In CVPR, 2012.

[10] B. K. P. Horn. Robot Vision. MIT Press, Cambridge,MA, USA, 1986.

[11] D. Jayaraman and K. Grauman. Learning image rep-resentations tied to ego-motion. In ICCV, pages 1413–1421, 2015.

[12] A. Kundu, V. Vineet, and V. Koltun. Feature space op-timization for semantic video segmentation. In CVPR,pages 3168–3175, 2016.

[13] G. Larsson, M. Maire, and G. Shakhnarovich. Learningrepresentations for automatic colorization. In ECCV,2016.

[14] G. Larsson, M. Maire, and G. Shakhnarovich. Col-orization as a proxy task for visual understanding. InCVPR, 2017.

[15] I. Misra, C. L. Zitnick, and M. Hebert. Unsupervisedlearning using sequential verification for action recog-nition. 2016.

[16] A. Owens, J. Wu, J. H. McDermott, W. T. Freeman,and A. Torralba. Ambient sound provides supervisionfor visual learning. In ECCV, 2016.

[17] D. Pathak, R. B. Girshick, P. Dollar, T. Darrell, andB. Hariharan. Learning features by watching objectsmove. In CVPR, 2017.

[18] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell,and A. Efros. Context encoders: Feature learning byinpainting. In CVPR, 2016.

[19] E. Shelhamer, J. Long, and T. Darrell. Fully convolu-tional networks for semantic segmentation. TPAMI.

[20] X. Wang and A. Gupta. Unsupervised learning of vi-sual representations using videos. In ICCV, 2015.

[21] R. Zhang, P. Isola, and A. A. Efros. Colorful imagecolorization. In ECCV, 2016.

[22] R. Zhang, P. Isola, and A. A. Efros. Split-brain autoen-coders: Unsupervised learning by cross-channel predic-tion. In CVPR, 2017.

3