arXiv:2008.07008v1 [cs.CV] 16 Aug 2020 · Eslam Mohamed 1, Mahmoud Ewaisha 1, Mennatullah Siam2,...

11
InstanceMotSeg: Real-time Instance Motion Segmentation for Autonomous Driving Eslam Mohamed *1 , Mahmoud Ewaisha *1 , Mennatullah Siam 2 , Hazem Rashed 1 , Senthil Yogamani 1 and Ahmad El-Sallab 1 * Equal contribution 1 Valeo 2 University of Alberta Abstract Moving object segmentation is a crucial task for au- tonomous vehicles as it can be used to segment objects in a class agnostic manner based on its motion cues. It will enable the detection of objects unseen during train- ing (e.g., moose or a construction truck) generically based on their motion. Although pixel-wise motion segmentation has been studied in the literature, it is not dealt with at in- stance level, which would help separate connected segments of moving objects leading to better trajectory planning. In this paper, we proposed a motion-based instance segmen- tation task and created a new annotated dataset based on KITTI, which will be released publicly. We make use of the YOLACT model to solve the instance motion segmentation network by feeding inflow and image as input and instance motion masks as output. We extend it to a multi-task model that learns semantic and motion instance segmentation in a computationally efficient manner. Our model is based on sharing a prototype generation network between the two tasks and learning separate prototype coefficients per task. To obtain real-time performance, we study different efficient encoders and obtain 39 fps on a Titan Xp GPU using Mo- bileNetV2 with an improvement of 10% mAP relative to the baseline. A video demonstration of our work is available in https://youtu.be/CWGZibugD9g. 1. INTRODUCTION Motion cues play a significant role in automated driv- ing. Most of the modern automated driving systems lever- age HD maps to perceive the static infrastructure [33] and moving objects are more critical to be detected accurately. Due to the dynamic and interactive nature of moving ob- jects, it is necessary to understand the motion model of each surrounding object to predict their future state and plan the ego-vehicle trajectory. Historically, the initial prototypes of automated driving relied on depth and motion estimation without any appearance based detection [13, 24, 39]. There- Figure 1: Overview of the proposed network architecture fore motion segmentation is an important perception task for autonomous driving [44, 41, 36, 38]. It can also to be used as an auxiliary task for end-to-end systems. Detection of moving objects can be seen as an alternate way to detect objects and thus providing redundancy needed for a safe and robust system. In this case we call it class agnostic segmen- tation as it is solely dependant on motion cues regardless of semantics which would scale better to unknown objects. Since it is practically infeasible to collect data for every pos- sible object category, this class agnostic segmentation can detect such unseen objects (e.g. rare animals like moose or kangaroo or rare vehicles like construction trucks). Semantic segmentation and object detection are mature algorithms due to the large scale datasets available for these tasks [42]. Relatively, motion segmentation has limited datasets [50, 44] and they are primarily focused on vehi- cles. Additionally, there are fundamental challenges in mo- tion estimation which are not studied in detail: (1) Camera on moving vehicle makes it challenging to decouple ego- motion from motion of other moving objects. (2) Motion parallax and ambiguity in degenerate cases where objects are moving parallel to the ego-vehicle. (3) Limited com- 1 arXiv:2008.07008v1 [cs.CV] 16 Aug 2020

Transcript of arXiv:2008.07008v1 [cs.CV] 16 Aug 2020 · Eslam Mohamed 1, Mahmoud Ewaisha 1, Mennatullah Siam2,...

Page 1: arXiv:2008.07008v1 [cs.CV] 16 Aug 2020 · Eslam Mohamed 1, Mahmoud Ewaisha 1, Mennatullah Siam2, Hazem Rashed1, Senthil Yogamani1 and Ahmad El-Sallab1 Equal contribution 1Valeo 2University

InstanceMotSeg: Real-time Instance Motion Segmentation forAutonomous Driving

Eslam Mohamed∗1, Mahmoud Ewaisha∗1, Mennatullah Siam2, Hazem Rashed1,Senthil Yogamani1 and Ahmad El-Sallab1

∗Equal contribution 1Valeo 2University of Alberta

Abstract

Moving object segmentation is a crucial task for au-tonomous vehicles as it can be used to segment objectsin a class agnostic manner based on its motion cues. Itwill enable the detection of objects unseen during train-ing (e.g., moose or a construction truck) generically basedon their motion. Although pixel-wise motion segmentationhas been studied in the literature, it is not dealt with at in-stance level, which would help separate connected segmentsof moving objects leading to better trajectory planning. Inthis paper, we proposed a motion-based instance segmen-tation task and created a new annotated dataset based onKITTI, which will be released publicly. We make use of theYOLACT model to solve the instance motion segmentationnetwork by feeding inflow and image as input and instancemotion masks as output. We extend it to a multi-task modelthat learns semantic and motion instance segmentation in acomputationally efficient manner. Our model is based onsharing a prototype generation network between the twotasks and learning separate prototype coefficients per task.To obtain real-time performance, we study different efficientencoders and obtain ∼ 39 fps on a Titan Xp GPU using Mo-bileNetV2 with an improvement of 10% mAP relative to thebaseline. A video demonstration of our work is available inhttps://youtu.be/CWGZibugD9g.

1. INTRODUCTION

Motion cues play a significant role in automated driv-ing. Most of the modern automated driving systems lever-age HD maps to perceive the static infrastructure [33] andmoving objects are more critical to be detected accurately.Due to the dynamic and interactive nature of moving ob-jects, it is necessary to understand the motion model of eachsurrounding object to predict their future state and plan theego-vehicle trajectory. Historically, the initial prototypesof automated driving relied on depth and motion estimationwithout any appearance based detection [13, 24, 39]. There-

Figure 1: Overview of the proposed network architecture

fore motion segmentation is an important perception taskfor autonomous driving [44, 41, 36, 38]. It can also to beused as an auxiliary task for end-to-end systems. Detectionof moving objects can be seen as an alternate way to detectobjects and thus providing redundancy needed for a safe androbust system. In this case we call it class agnostic segmen-tation as it is solely dependant on motion cues regardlessof semantics which would scale better to unknown objects.Since it is practically infeasible to collect data for every pos-sible object category, this class agnostic segmentation candetect such unseen objects (e.g. rare animals like moose orkangaroo or rare vehicles like construction trucks).

Semantic segmentation and object detection are maturealgorithms due to the large scale datasets available for thesetasks [42]. Relatively, motion segmentation has limiteddatasets [50, 44] and they are primarily focused on vehi-cles. Additionally, there are fundamental challenges in mo-tion estimation which are not studied in detail: (1) Cameraon moving vehicle makes it challenging to decouple ego-motion from motion of other moving objects. (2) Motionparallax and ambiguity in degenerate cases where objectsare moving parallel to the ego-vehicle. (3) Limited com-

1

arX

iv:2

008.

0700

8v1

[cs

.CV

] 1

6 A

ug 2

020

Page 2: arXiv:2008.07008v1 [cs.CV] 16 Aug 2020 · Eslam Mohamed 1, Mahmoud Ewaisha 1, Mennatullah Siam2, Hazem Rashed1, Senthil Yogamani1 and Ahmad El-Sallab1 Equal contribution 1Valeo 2University

putational power available in embedded platforms wherean efficient CNN model for all tasks has to be designed[3]. Pure geometric approaches were not able to fullysolve the motion segmentation problem and most recentwork [50, 41, 38] focus on learning based approaches.

Semantic segmentation and motion segmentation do notseparate instances of objects, a segment could comprise ofseveral objects. In order to plan an optimal trajectory, it isnecessary to extract individual objects. In case of seman-tics, this is handled by instance segmentation task whichcan be broadly categorized as one-stage or two-stage meth-ods [16, 26, 2]. One-stage methods such as YOLACT [2]are computationally more efficient than two-stage ones.YOLACT [2] formulates instance segmentation as learningprototypes which form a basis of a vector space and thenlearns the linear combination weights of these prototypesper instance. In this paper, we build upon this idea of us-ing prototypes and adapt it for class agnostic instance seg-mentation using motion cues. We propose to jointly learnsemantic and class agnostic (motion) instance segmenta-tion using a shared backbone and protonet (prototype gen-eration branch) weights with shallow prediction heads foreach task. The prediction heads are responsible for predict-ing bounding boxes, classes and prototype coefficients. Tothe best of our knowledge, this is the first attempt to per-form joint semantic and class agnostic instance segmenta-tion that would create a step toward fail safe systems. Fig-ure 1 shows an overview of our method. Compared to pre-vious motion segmentation approaches like SMSNet [50]and MODNet [44], we do not learn completely separate de-coders for two tasks and do not perform pixel-wise segmen-tation which lacks the ability to separate instances. Thuswe are able to design a computationally efficient solution toperform semantic and motion instance segmentation jointly.To summarize, the contributions of this work include:

• A novel instance motion segmentation network thatcan deal with segmenting instances in case of objectsoccluding each other.

• A real-time multi-task learning model for semanticand class agnostic instance segmentation using motion.Our method relies on learning different prototype co-efficients per task based on YOLACT [2].

• Comparative study of different network architecturesto combine appearance and motion and an analysisof various backbones to find the optimal accuracy vsspeed trade-off.

• We release an improved version of KittiMoSeg dataset[38]. The new dataset has 12919 images with im-proved annotations. Annotations include tracking in-formation for both static and dynamic objects.

2. Related Work

2.1. Motion Segmentation

Motion analysis approaches can be roughly categorizedinto geometry, tracking or learning based methods. Geom-etry based approaches are typically pixel-based constrainedwith some global regularization and they typically do notexploit global context. It has been extensively studied in[49, 25, 32, 53, 1]. Scott et. al. [53] relied on a homogra-phy based method but it is simplistic for handling complexmotion in automotive scenes. Bideau et. al. [1] utilized op-tical flow and relied on perspective geometry and featurematching to perform motion segmentation. Tracking basedmethods [27, 4, 34] rely on computing object trajectoriesand can perform clustering to perform the final motion seg-mentation. These methods are computationally expensive.

Most of the current progress is focused on learning basedmethods [48, 47, 50, 43]. Tokmakov et al. initially ex-plored optical flow based single stream fully convolutionalnetwork [48] and then extended it to use both motion andappearance by learning a bidirectional recurrent model withmemory [47]. Vertens et. al. [50] proposed a method basedon FlowNet [11, 20] to jointly learn semantic segmentationand motion segmentation. The model was trained on theirdataset CityMotion based on Cityscapes [8]. ConcurrentlySiam et. al. [44] proposed a motion and appearance basedmultitask learning framework called MODNet to performobject detection and motion segmentation. MODNet wastrained on KITTIMoseg that they released based on KITTIdataset [14]. It was further extended using ShuffleNet en-coder for computational efficiency [43]. Rashed et. al. re-leased a larger scale KITTIMoseg and experimented on us-ing LIDAR flow as another mean to improve motion seg-mentation [38]. A similar approach to [44, 37] that can gen-erate motion segmentation weakly supervised annotationswas used on WoodScape fisheye dataset [56].

2.2. Image Instance Segmentation

Instance segmentation methods are mainly categorizedas one-stage [9, 26, 2] and two-stage methods [16, 29, 19].Two-stage methods such as He et. al. [16] rely on having aseparate stage for region proposals estimation followed byextracting features for regions of interest (ROIs) for the fi-nal segmentation mask. Although two-stage methods aremore accurate, they are less efficient than one-stage meth-ods. One-stage methods such as YOLACT [2] provide acomputationally efficient method to perform instance seg-mentation. The main idea of the work is to rely on learningprototypes which are then linearly combined using learnedcoefficients to construct instance masks.

2

Page 3: arXiv:2008.07008v1 [cs.CV] 16 Aug 2020 · Eslam Mohamed 1, Mahmoud Ewaisha 1, Mennatullah Siam2, Hazem Rashed1, Senthil Yogamani1 and Ahmad El-Sallab1 Equal contribution 1Valeo 2University

Figure 2: Proposed multi-task network architecture. Color and optical flow are fed as inputs to the network. Mask coefficientsare learnt for two tasks separately using two prediction heads, namely semantic instance segmentation and motion instancesegmentation. The coefficients of both prediction heads are combined with the prototypes generating semantic and motioninstance masks. Our model has a real-time performance of 34 fps using MobileNet-V2 encoder.

2.3. Video Object Segmentation (VOS)

Video object segmentation has been extensively stud-ied in the literature on the DAVIS benchmark [35]. It canbe categorized into semi-supervised and unsupervised ap-proaches. Semi-supervised methods utilize a mask initial-ization to track the objects of interest through the videosequence [52, 5, 22]. Unsupervised approaches do not re-quire any initialization mask and they segment the primaryobject in the sequence based on appearance and motionsaliency [23, 47, 21, 48]. Most of the initial work in videoobject segmentation did not predict at instance level. Thefirst attempts to perform instance level video object segmen-tation was proposed by Hu et. al. [18] relying on recurrentneural networks.

A new instance level VOS task that utilizes object cate-gories along with predicting instance masks was proposedin video instance segmentation (VIS) by Yang et. al. [55].They mainly proposed a new dataset called Youtube-VISand their method relied on Mask-RCNN [16] with an ad-ditional branch for tracking based on cosine similarity anda memory module. A similar dataset for multiple objecttracking and segmentation (MOTS) was released by Voigt-laender et. al. [51] with a smaller scale number of categoriesbut focused on autonomous driving setting based on KITTIdataset [14].

However, unlike previous work in video object segmen-tation our focus is on two aspects: (1) Considering motion

segmentation as a class agnostic segmentation that can belearned jointly with semantic instance segmentation. (2)Building a computationally efficient multitask model [45, 7]that performs both class agnostic and semantic instancesegmentation through sharing the prototype generation net-work and learning separate coefficients per task. The gen-eral unsupervised object level video object segmentation fo-cuses on the most salient object through a video sequenceand does not separate different instances. VOS setting is notbest suited for automated driving, for example moving ob-jects which are at a larger distance don’t appear to be salientin the scene. Thus we focus better on KITTIMoSeg [44, 38]and its extension with instance masks and semantic labelsthroughout our experiments. Nonetheless we do performcertain experiments on DAVIS’17 dataset which providesinstance-level masks but does not solely focus on segmen-tation moving objects.

3. Proposed MethodIn this section, we discuss the dataset preparation and the

proposed architecture for our experiments.

3.1. KITTI Instance-MoSeg Dataset

In order to segment moving objects, our algorithm re-quires three parts. (1) Appearance information which helpsthe algorithm understand the scene semantics. (2) Motioninformation which is crucial for classifying the objects as

3

Page 4: arXiv:2008.07008v1 [cs.CV] 16 Aug 2020 · Eslam Mohamed 1, Mahmoud Ewaisha 1, Mennatullah Siam2, Hazem Rashed1, Senthil Yogamani1 and Ahmad El-Sallab1 Equal contribution 1Valeo 2University

moving or static. (3) Motion instance masks ground truthso that we can penalize the error for improving classifica-tion accuracy. Siam et al. [44] provided KITTIMoSeg withmotion masks and bounding box annotations for 1300 im-ages only on KITTI dataset sequences. Valada et al. [50]provided annotations for 255 frames on KITTI and 3,475frames on Cityscapes [8] dataset. Rashed et al. [38] ex-tended KITTIMoSeg with weak annotations for 12,919 im-ages which we utilize in our method as it is significantlylarger than other public datasets available for motion seg-mentation. Although this dataset is currently the largestmotion segmentation dataset in terms of size, it is weaklyannotated and has some shortcomings. One of which is thatannotations are provided only for vehicles class. We ex-tend the dataset with semantic instance segmentation masksfor 5 classes including car, pedestrian, bicycle, truck, bus.Moreover, we improve the motion segmentation masks byestimating motion in 3D world coordinate system insteadof the Velodyne coordinate system which significantly im-proved annotations accuracy over the method in [38] as ithad erroneous annotations due to motion of the Velodynesensor itself. Unlike [38], objects have been processed inobject level and not in pixel level which provides easier in-terface during data-loading . Additionally, each object maskis associated with instance ID which is consistent for eachobject across the whole sequence. Tracking annotations areprovided for both static and dynamic objects. Our contri-bution significantly improved the annotations accuracy, andadded more information to the dataset that makes it suitablefor tracking applications as well. We feed the network withmotion information using two approaches and we provide acomparative study for their impact on motion instance seg-mentation. (1) Implicit method in which the network ac-cepts a stack of images (t,t+1) and learns to model the mo-tion implicitly as referred to by [36]. (2) Explicit methodwhere we use optical flow map highlighting pixels motion.In this approach, we make use of FlowNet 2.0 [20] modelto compute optical flow.

3.2. Instance-Level Class Agnostic Segmentation

Our baseline model from YOLACT [2] has a feature ex-tractor network, a feature pyramid network [28] and a pro-tonet module that learns to generate prototypes. We exper-iment with different feature extractor networks as shownin section 4. The feature pyramid network improves theaccuracy by combining low resolution feature maps withrich semantic information and high resolution feature mapsthrough a top down path with lateral connections. The high-est resolution output is used as input to protonet that pre-dicts prototypes with one fourth the original size. In par-allel to protonet, a prediction head is learned that outputsbounding boxes, classes and coefficients that are linearlycombined with the prototypes to predict the instance masks.

We train this model for the class agnostic (motion) in-stance segmentation task to serve as a baseline for com-parative purposes. The model is originally trained for se-mantic instance segmentation discarding the images withno mask annotations. On the other hand, in class agnos-tic (motion) instance segmentation dataset we use, there areapproximately 7k out of the 12k frames that do not havemoving objects. Utilization of such frames will provide alot of negative samples of static cars which will help themodel understand the appearance of static vehicles as welland thus reduce overfitting. In order to solve this problem,the loss function has been updated to reduce the confidencescore of the predicted object in case of a false positive. Themask segmentation loss is formulated as binary cross en-tropy, while the bounding box regression uses a smooth L1loss similar to SSD [30]. We refer to this baseline modelRGB-Only since the input modality is only appearance in-formation.

Our main objective is to predict instance masks for mov-ing objects. Hence, a motion representation, in our caseoptical flow, should be fed into the network in addition toappearance. As motion cues are important in automateddriving, the standard automotive embedded platforms in-cluding Nvidia Xavier, TI TDA4x and Renesas V3H have ahardware accelerator for computing dense optical flow andit can be leveraged without requiring additional process-ing. Feature fusion method is adopted in our experiments inwhich we construct two separate feature extractors to pro-cess appearance and motion separately. The fusion is per-formed on the feature level which has provided significantimprovement in [37, 12, 38, 15]. We refer to this feature-fusion model as RGB+OF. The same architecture has beenused to evaluate the impact of implicit motion modelling inwhich two sequential RGB images (t,t+1) are input to thetwo-stream feature extractor. This model is referred to asRGB+RGB.

3.3. Multi-task Learning of Semantics and ClassAgnostic Instances

We further propose the joint model to learn semanticand class agnostic instance segmentation. The motivation isto provide a more generic representation for the surround-ing environment including both static and moving instanceswhile both tasks are learnt jointly. Separate predictionsheads Ds and Dm are used for semantic and motion instancesegmentation tasks respectively as shown in Figure 2. Dur-ing training, we alternate between k steps for training Dsusing KITTIMoSeg Extended with semantic labels, and ksteps for training Dm using KITTIMoSeg motion labels orthe generic DAVIS dataset. One other approach to tacklethe problem of learning both semantic and motion instancesegmentation would be to pose it as a multi-label classifi-cation problem. However, to allow for the flexibility for

4

Page 5: arXiv:2008.07008v1 [cs.CV] 16 Aug 2020 · Eslam Mohamed 1, Mahmoud Ewaisha 1, Mennatullah Siam2, Hazem Rashed1, Senthil Yogamani1 and Ahmad El-Sallab1 Equal contribution 1Valeo 2University

training the model from different datasets that do not nec-essarily have the joint annotations and to allow for trainingmotion instance segmentation as a class agnostic segmen-tation we follow our proposed approach. Since the sepa-rate motion instance segmentation head would be respon-sible for segmenting generally moving objects even if theyare unknown for the semantic head. The feature extractionnetwork, feature pyramid network and protonet are sharedamong the two tasks to ensure computational efficiency ofour proposed multi-task learning system. Thus the learnedprototypes are the same for both tasks but the learned co-efficients to construct masks and the learned box regressionare different.

4. Experimental Results

In this section, we provide the details of our experimen-tal setup and results on KITTIMoSeg [38] and KITTISe-mantics. Our model is able to provide semantic and motioninstance segmentation in a computationally efficient man-ner. It provides 9 fps speedup over the state of the art mo-tion segmentation methods while additionally providing in-stance level details.

4.1. Experimental Setup

We train our model using SGD with momentum using aninitial learning rate of 10−4, momentum of 0.9 and a weightdecay of 5x10−4. A learning rate scheduling is used whereit is divided by 10 at iterations 280k and 600k. We train withbatch size 4 for 150 number of epochs using ImageNet [10]pretrained weights. The rest of the hyper-parameters anddata augmentation technique are similar to YOLACT [2].

For the datasets, we use both KITTIMoSeg [38] andDAVIS’17 [6]. KITTIMoSeg is further extended with in-stance motion masks and semantic labels as explained insection 3.1. We plan to release the augmented labels forKITTIMoSeg to benefit researchers working on motion in-stance segmentation for autonomous driving. Finally, wereport frame rate and time in milliseconds for all modelsrunning on Titan Xp GPU on image resolution 550 × 550including the models proposed by [41, 44].

4.2. Experimental Results

In this work, we provide two ablation studies in Table 1.(A) Study on the impact of different input modalities to themodel. In this study, we provide results which help us eval-uate the importance of using optical flow against stack ofimages where the network learns to model motion implic-itly. (B) Study on the impact of various backbones to esti-mate the optimal performance in terms of accuracy vs speedtrade-off. For this purpose, we explore ResNet101 [17],ResNet50 [17], EfficientNet [46], ShuffleNetV2 [31] andMobileNetV2 [40] as backbone architectures.

Table 1 demonstrates quantitative results evaluated onKITTIMoSeg dataset using various network architectures.Significant improvement of 11% in mean average precisionhas been observed with feature-fusion which confirms theconclusions of [44, 54, 37]. Qualitative evaluation is il-lustrated in Figure 3 showing visual comparison betweenvarious network architectures, where RGB-only experimentis used as a baseline. Due to limitations of embedded de-vices used in autonomous driving, we focus our experi-ments on achieving the best balance between accuracy andspeed using various network architectures. MobileNetV2has shown to outperform other backbones while maintain-ing the second best mean average precision. The best per-forming backbone is ShufflenetV2 with an insignificant im-provement over MobileNetV2 with 0.3%, but lower framesper second. Thus, MobileNetV2 provides the best speed-accuracy trade-off.

Since this is the first attempt at semantic and motion in-stance segmentation in the literature of motion segmenta-tion for autonomous driving, we are not able to directlycompare in terms of accuracy (mean average precision).However, the closest work in motion segmentation literatureis RTMotSeg [43] and MODNet [44], which we compareagainst in terms of frames per second. Our MobileNetV2model is able to deliver more efficient solution ∼ 39 fps,while RTMotSeg [43] runs at ∼ 30 fps and MODNet [44]at ∼ 7 fps. Additionally, our model is able to provide an im-proved output with instance segmentation instead of pixel-wise segmentation. The final multitask semantic and mo-tion instance segmentation model performs at ∼ 34 fps and18 fps using MobileNetV2 backbone and ResNet101 re-spectively. We also compare the motion head of our mul-titask model against RTMotSeg [43] in terms of mIoU bycompositing the produced motion instance masks into a sin-gle map with pixel-wise motion labels. Our model achievesa mIoU of 78.9% outperforming RTMotSeg which achieves74.5%.

Figure 3 shows our qualitative results on KITTI dataset.(a) shows the ground truth of the moving instances usingweak annotations in [38]. (b) demonstrates inability of thebaseline network to distinguish between moving and staticobjects using only color images without feeding the net-work with any motion signal. (c),(d) demonstrate our re-sults using different backbones where MobileNetV2 showthe best accuracy vs speed trade-off as shown in Table 1.

Finally, Table 2 illustrates our results for the multi-tasksemantic and class agnostic instance segmentation. Wecompare against the baseline models trained only for onetask either semantic or class agnostic instance segmentationon KITTIMoSeg Extended. Generally the multitask modelperforms on par with the class agnostic single task base-line, but with the added benefit of jointly predicting bothsemantic and class agnostic masks. Although we notice a

5

Page 6: arXiv:2008.07008v1 [cs.CV] 16 Aug 2020 · Eslam Mohamed 1, Mahmoud Ewaisha 1, Mennatullah Siam2, Hazem Rashed1, Senthil Yogamani1 and Ahmad El-Sallab1 Equal contribution 1Valeo 2University

Table 1: Quantitative comparison between different network architectures for class agnostic (motion) instance segmentation.

Model Backbone FPS arams (M) Time Mask BoxAP AP50 AP75 AP AP50 AP75

[A] Comparison of Different ModelsRGB-Only ResNet101 34.71 49.6 47.9 28.38 40.88 33.93 30.12 41.67 36.65RGB+RGB ResNet101 20.3 95.9 32.05 31.2 32.05 35.55 31.2 48.94 36RGB+OF ResNet101 20.3 95.9 49.2 39.26 60.02 45.76 41.24 60.59 49.28

[B] Comparison of Different BackbonesRGB+OF WS-ResNet101 20.54 49.6 48.6 43.1 62.3 48.5 43.6 61.86 53.26RGB+OF ResNet50 29.74 57.9 33.6 39.74 58.26 46.36 40.64 59.03 49.07RGB+OF EfficientNet 35.06 21.2 28.5 40.41 59.56 43.05 38.57 59.38 44.43RGB+OF MobileNetV2 39.18 12.9 25.5 43.55 68.34 48.59 43.93 67.73 51.56RGB+OF ShuffleNetV2 38 11.1 26.3 43.9 68.43 48.49 44.9 68.26 51.77

Table 2: Quantitative Comparison between multi-task joint semantic and class agnostic instance segmentation against thebaseline single tasks Baseline-S (semantic instance segmentation only) and Baseline-M (class agnostic (motion) instancesegmentation only).

Model Backbone Motion DatasetSemantic Class Agnostic

Mask Box Mask BoxAP AP50 AP75 AP AP50 AP75 AP AP50 AP75 AP AP50 AP75

Baseline-M ResNet101 KITTIMoSeg - - - - - - 39.26 60.02 45.76 41.24 60.59 49.28Baseline-S ResNet101 KITTIMoSeg 28.7 50.3 27.4 32.5 61.6 30.4 - - - - - -Multi-Task ResNet101 KITTIMoSeg 22.5 39.5 22.2 27.7 54.5 25.7 44.3 66.7 50.4 44.8 66.8 55.1Baseline-M MobileNetV2 KITTIMoSeg - - - - - - 43.55 68.34 48.59 43.93 67.73 51.56Baseline-S MobileNetV2 KITTIMoSeg 25.9 47.0 25.0 30.0 58.3 53.0 - - - - - -Multi-Task MobileNetV2 KITTIMoSeg 21.9 36.5 22.6 26.5 53.6 24.2 45.3 69.0 57.4 44.3 69.4 52.4

Table 3: Quantitative Evaluation of Class Agnostic and Se-mantic Instance Segmentation on KITTIMoSeg Extendedand DAVIS’17-val, reported AP50.

Backbone Motion Dataset Class Agnostic SemanticMask Box Mask Box

ResNet101 DAVIS 23.5 45.4 42.9 56.2MobileNetV2 DAVIS 28.2 45.3 42.9 55.6

degradation in the semantic instance segmentation perfor-mance from the baseline model to the multitask model, wethink this is due to one main reaosn. Since KITTIMoSegmotion masks are solely for vehicle classes and does notcover pedestrian or cyclist, while the extended dataset withsemantic labels has semantic labels for these extra classes.Thus the joint training leads to an increase in class imbal-ance which can be remedied by improving the KITTIMoSegExtended dataset to include other classes as moving objects.

Nonetheless the scope of the current work is to show thepower of learning jointly class agnostic and semantic in-stance segmentation tasks as an initial step to handle un-known objects which was demonstrated in previous exper-iments. Figure 4 further shows the multi-task model infer-ring semantic labels on KITTI, while still being able to seg-ment unknown moving objects that exist in DAVIS dataset.The multi-task model has the advantage of sharing the en-coder for both the tasks and reduces computational cost

drastically as encoder is the most intensive part. Our multi-task model is able to run at 34 fps out-performing MODNetwhich runs at 7 fps. In order to assess the power of classagnostic segmentation on more general moving objects thatare outside the labels within KITTIMoSeg we report resultson alternate training the model between DAVIS’17 and KIT-TIMoSeg in Table 3.

4.3. Prototype and Coefficients Analysis

In order to better understand the model output, we per-form an analysis on the common prototypes and coefficientslearned for both motion and semantic instance segmenta-tion. Figure 5 shows the output for the basis learned with atotal of 32 prototypes in (c) for two different frames. Theoutput basis are organized in a 6 × 5 grid which correspondto the 6 × 5 grid for the coefficients. We denote the differentgrid cells using cs

i, j, cmi, j and pi, j for semantic coefficients,

motion coefficients and prototypes respectively. Here i in-dicates row index starting from top to bottom and j indi-cates column index starting from left to right. Both seman-tic and motion coefficients are shown in (a, b), where theycan be negative or positive which can help to mask or addcertain objects to the final mask. The predicted final seman-tic and motion instance masks for the corresponding twoframes are shown last row in (c, d) which are constructedas a linear combination using these basis and coefficients.The output shows meaningful learned basis that can help in

6

Page 7: arXiv:2008.07008v1 [cs.CV] 16 Aug 2020 · Eslam Mohamed 1, Mahmoud Ewaisha 1, Mennatullah Siam2, Hazem Rashed1, Senthil Yogamani1 and Ahmad El-Sallab1 Equal contribution 1Valeo 2University

(a) (b)

(c) (d)

Figure 3: Qualitative results using our model on KITTI dataset. (a) Ground Truth, (b) RGB-only, (c) Feature-Fusion Effi-cientNet, (d) Feature-Fusion MobileNet

Figure 4: Left: Multitask Model Semantic Instance Masks. Right: Multitask Model Class Agnostic Instance Masks.

7

Page 8: arXiv:2008.07008v1 [cs.CV] 16 Aug 2020 · Eslam Mohamed 1, Mahmoud Ewaisha 1, Mennatullah Siam2, Hazem Rashed1, Senthil Yogamani1 and Ahmad El-Sallab1 Equal contribution 1Valeo 2University

(a) (b) (c)

(d) (e)

Figure 5: Prototypes and Coefficients Analysis. First two rows show prototypes and coefficients for two frames. Last rowshows the corresponding motion and instance masks output. (a, b) 6 × 5 Motion and Semantic coefficients in a grid, noticethat coefficients can be positive or negative. (c) 6 × 5 Prototypes grid corresponding to coefficients of 32 basis prototypes.(d, e) motion and semantic instance segmentation masks.

constructing the final masks. Interestingly, the comparisonbetween motion and semantic coefficients can explain whatthe model is learning.

First row in Figure 5 shows that p5,4 highlights the staticcar in the front and p4,6 highlights all the cars in the scene.Both cm

4,6 and cs4,6 show large positive value as it captures

all region of interests in the scene. However, cm5,4 shows

a lower negative value than cs5,4 as it tries to mask out the

static car for the motion mask while it is still used for thesemantic one. The second row shows a similar behaviourwhen we look at prototype p4,6. It highlights all regions ofinterests while prototype p4,3 includes the static cars in the

back and left of the scene. It is shown that cs4,3 has a pos-

itive value unlike cm4,3 as it should mask the static objects.

However, our model fails to capture the bicycle instance inthis frame. Another interesting observation is that some ofthe prototypes are not significantly useful in highlightingan object. This indicates that we can use lesser number ofprototypes to improve efficiency further. Our problem set-ting has a maximum of 8 classes in the semantic instancesegmentation which is much simpler compared to the 80classes setting in MS-COCO where YOLACT [2] proposedusing 32 prototypes.

8

Page 9: arXiv:2008.07008v1 [cs.CV] 16 Aug 2020 · Eslam Mohamed 1, Mahmoud Ewaisha 1, Mennatullah Siam2, Hazem Rashed1, Senthil Yogamani1 and Ahmad El-Sallab1 Equal contribution 1Valeo 2University

5. ConclusionsIn this paper, we developed the first instance motion seg-

mentation model on KITTIMoSeg dataset. By using ef-ficient MobileNetv2 encoder, we were able to obtain 39fps on a Titan Xp GPU. We also extended it to a multi-task learning model which jointly performs instance se-mantic segmentation and instance motion segmentation andachieves 34 fps. Novel design for multi-task learningbased on sharing encoders and prototype generation net-work while learning separate coefficients for constructingthe final masks is proposed. We obtain promising results butthe research area of instance segmentation in general andinstance motion segmentation in particular is relatively lessmature. KITTIMoSeg contains only vehicles and a datasetwith more diverse moving objects like pedestrians, animals,etc is necessary for development of a mature solution.

ACKNOWLEDGEMENTSWe would like to thank our colleagues Letizia Mariotti

and Lucie Yahiaoui for reviewing the paper and providingfeedback.

References[1] Pia Bideau and Erik Learned-Miller. Its moving! a prob-

abilistic model for causal motion segmentation in movingcamera videos. In European Conference on Computer Vi-sion, pages 433–449. Springer, 2016.

[2] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee.Yolact: real-time instance segmentation. In Proceedingsof the IEEE International Conference on Computer Vision,pages 9157–9166, 2019.

[3] Alexandre Briot, Prashanth Viswanath, and Senthil Yoga-mani. Analysis of efficient cnn design techniques for seman-tic segmentation. In Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition Workshops, pages663–672, 2018.

[4] Thomas Brox and Jitendra Malik. Object segmentation bylong term analysis of point trajectories. In European confer-ence on computer vision, pages 282–295. Springer, 2010.

[5] Sergi Caelles, Kevis-Kokitsi Maninis, Jordi Pont-Tuset,Laura Leal-Taixe, Daniel Cremers, and Luc Van Gool.One-shot video object segmentation. arXiv preprintarXiv:1611.05198, 2016.

[6] Sergi Caelles, Jordi Pont-Tuset, Federico Perazzi, AlbertoMontes, Kevis-Kokitsi Maninis, and Luc Van Gool. The2019 davis challenge on vos: Unsupervised multi-object seg-mentation. arXiv:1905.00737, 2019.

[7] Sumanth Chennupati, Ganesh Sistu, Senthil Yogamani, andSamir A Rawashdeh. Multinet++: Multi-stream feature ag-gregation and geometric loss strategy for multi-task learning.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition Workshops, pages 0–0, 2019.

[8] Marius Cordts, Mohamed Omran, Sebastian Ramos, TimoRehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe

Franke, Stefan Roth, and Bernt Schiele. The cityscapesdataset for semantic urban scene understanding. In Proceed-ings of the IEEE conference on computer vision and patternrecognition, pages 3213–3223, 2016.

[9] Jifeng Dai, Kaiming He, Yi Li, Shaoqing Ren, and Jian Sun.Instance-sensitive fully convolutional networks. In EuropeanConference on Computer Vision, pages 534–549. Springer,2016.

[10] J Deng, W Dong, R Socher, LJ Li, K Li, and L Imagenet Fei-Fei. A large-scale hierarchical image database. computer vi-sion and pattern recognition, 2009. cvpr 2009. In IEEE Con-ference on Computer Vision and Pattern Recognition, pages248–255.

[11] Alexey Dosovitskiy, Philipp Fischer, Eddy Ilg, PhilipHausser, Caner Hazirbas, Vladimir Golkov, Patrick VanDer Smagt, Daniel Cremers, and Thomas Brox. Flownet:Learning optical flow with convolutional networks. In Pro-ceedings of the IEEE international conference on computervision, pages 2758–2766, 2015.

[12] K. El Madawi, H. Rashed, A. El Sallab, O. Nasr, H. Kamel,and S. Yogamani. Rgb and lidar fusion based 3d semanticsegmentation for autonomous driving. In 2019 IEEE Intelli-gent Transportation Systems Conference (ITSC), pages 7–12,Oct 2019.

[13] Uwe Franke, Dariu Gavrila, Steffen Gorzig, Frank Lindner,F Puetzold, and Christian Wohler. Autonomous driving goesdowntown. IEEE Intelligent Systems and Their Applications,13(6):40–48, 1998.

[14] Andreas Geiger, Philip Lenz, Christoph Stiller, and RaquelUrtasun. Vision meets robotics: The kitti dataset. Interna-tional Journal of Robotics Research (IJRR), 2013.

[15] Caner Hazirbas, Lingni Ma, Csaba Domokos, and DanielCremers. Fusenet: Incorporating depth into semantic seg-mentation via fusion-based cnn architecture. In Asian con-ference on computer vision, pages 213–228. Springer, 2016.

[16] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-shick. Mask r-cnn. In Proceedings of the IEEE internationalconference on computer vision, pages 2961–2969, 2017.

[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition (CVPR), pages 770–778, 2016.

[18] Yuan-Ting Hu, Jia-Bin Huang, and Alexander Schwing.Maskrnn: Instance level video object segmentation. In Ad-vances in neural information processing systems, pages 325–334, 2017.

[19] Zhaojin Huang, Lichao Huang, Yongchao Gong, ChangHuang, and Xinggang Wang. Mask scoring r-cnn. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 6409–6418, 2019.

[20] Eddy Ilg, Nikolaus Mayer, Tonmoy Saikia, Margret Keuper,Alexey Dosovitskiy, and Thomas Brox. Flownet 2.0: Evo-lution of optical flow estimation with deep networks. 2017IEEE Conference on Computer Vision and Pattern Recogni-tion (CVPR), pages 1647–1655, 2016.

[21] Suyog Dutt Jain, Bo Xiong, and Kristen Grauman. Fusion-seg: Learning to combine motion and appearance for fully

9

Page 10: arXiv:2008.07008v1 [cs.CV] 16 Aug 2020 · Eslam Mohamed 1, Mahmoud Ewaisha 1, Mennatullah Siam2, Hazem Rashed1, Senthil Yogamani1 and Ahmad El-Sallab1 Equal contribution 1Valeo 2University

automatic segmentation of generic objects in videos. In 2017IEEE conference on computer vision and pattern recognition(CVPR), pages 2117–2126. IEEE, 2017.

[22] Anna Khoreva, Federico Perazzi, Rodrigo Benenson, BerntSchiele, and Alexander Sorkine-Hornung. Learning videoobject segmentation from static images. arXiv preprintarXiv:1612.02646, 2016.

[23] Yeong Jun Koh and Chang-Su Kim. Primary object segmen-tation in videos based on region augmentation and reduction.

[24] Varun Ravi Kumar, Stefan Milz, Christian Witt, Martin Si-mon, Karl Amende, Johannes Petzold, Senthil Yogamani,and Timo Pech. Monocular fisheye camera depth estimationusing sparse lidar supervision. In 2018 21st InternationalConference on Intelligent Transportation Systems (ITSC),pages 2853–2858. IEEE, 2018.

[25] Abhijit Kundu, K Madhava Krishna, and JayanthiSivaswamy. Moving object detection by multi-viewgeometric techniques from a single camera mounted robot.In 2009 IEEE/RSJ International Conference on IntelligentRobots and Systems, pages 4306–4312. IEEE, 2009.

[26] Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei.Fully convolutional instance-aware semantic segmentation.In Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pages 2359–2367, 2017.

[27] Tsung-Han Lin and Chieh-Chih Wang. Deep learning ofspatio-temporal features with geometric-based moving pointdetection for motion segmentation. In 2014 IEEE Inter-national Conference on Robotics and Automation (ICRA),pages 3058–3065. IEEE, 2014.

[28] Tsung-Yi Lin, Piotr Dollar, Ross Girshick, Kaiming He,Bharath Hariharan, and Serge Belongie. Feature pyra-mid networks for object detection. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, pages 2117–2125, 2017.

[29] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia.Path aggregation network for instance segmentation. In Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition, pages 8759–8768, 2018.

[30] Wei Liu, Dragomir Anguelov, Dumitru Erhan, ChristianSzegedy, Scott Reed, Cheng-Yang Fu, and Alexander CBerg. Ssd: Single shot multibox detector. In European con-ference on computer vision, pages 21–37. Springer, 2016.

[31] Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun.Shufflenet v2: Practical guidelines for efficient cnn architec-ture design. In Proceedings of the European Conference onComputer Vision (ECCV), pages 116–131, 2018.

[32] Moritz Menze and Andreas Geiger. Object scene flow for au-tonomous vehicles. In Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition, pages 3061–3070, 2015.

[33] Stefan Milz, Georg Arbeiter, Christian Witt, Bassam Abdal-lah, and Senthil Yogamani. Visual slam for automated driv-ing: Exploring the applications of deep learning. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition Workshops, pages 247–257, 2018.

[34] Peter Ochs and Thomas Brox. Object segmentation in video:a hierarchical variational approach for turning point trajecto-

ries into dense regions. In 2011 International Conference onComputer Vision, pages 1583–1590. IEEE, 2011.

[35] F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M.Gross, and A. Sorkine-Hornung. A benchmark dataset andevaluation methodology for video object segmentation. InComputer Vision and Pattern Recognition, 2016.

[36] Mohamed Ramzy, Hazem Rashed, Ahmad El Sallab, andSenthil Yogamani. Rst-modnet: Real-time spatio-temporalmoving object detection for autonomous driving. arXivpreprint arXiv:1912.00438, 2019.

[37] Hazem Rashed, Ahmad El Sallab, Senthil Yogamani, andMohamed ElHelw. Motion and depth augmented seman-tic segmentation for autonomous navigation. In Proceed-ings of the IEEE Conference on Computer Vision and PatternRecognition Workshops, pages 0–0, 2019.

[38] Hazem Rashed, Mohamed Ramzy, Victor Vaquero, AhmadEl Sallab, Ganesh Sistu, and Senthil Yogamani. Fusemod-net: Real-time camera and lidar based moving object detec-tion for robust low-light autonomous driving. In The IEEEInternational Conference on Computer Vision (ICCV) Work-shops, Oct 2019.

[39] B Ravi Kiran, Luis Roldao, Benat Irastorza, Renzo Ve-rastegui, Sebastian Suss, Senthil Yogamani, Victor Talpaert,Alexandre Lepoutre, and Guillaume Trehard. Real-time dy-namic object detection for autonomous driving using prior3d-maps. In Proceedings of the European Conference onComputer Vision (ECCV), pages 0–0, 2018.

[40] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zh-moginov, and Liang-Chieh Chen. Mobilenetv2: Invertedresiduals and linear bottlenecks. In Proceedings of theIEEE conference on computer vision and pattern recogni-tion, pages 4510–4520, 2018.

[41] Mennatullah Siam, Sara Eikerdawy, Mostafa Gamal, Moe-men Abdel-Razek, Martin Jagersand, and Hong Zhang.Real-time segmentation with appearance, motion and geom-etry. In 2018 IEEE/RSJ International Conference on Intel-ligent Robots and Systems (IROS), pages 5793–5800. IEEE,2018.

[42] Mennatullah Siam, Sara Elkerdawy, Martin Jagersand, andSenthil Yogamani. Deep semantic segmentation for auto-mated driving: Taxonomy, roadmap and challenges. In 2017IEEE 20th International Conference on Intelligent Trans-portation Systems (ITSC), pages 1–8. IEEE, 2017.

[43] Mennatullah Siam, Mostafa Gamal, Moemen Abdel-Razek,Senthil Yogamani, and Martin Jagersand. Rtseg: Real-timesemantic segmentation comparative study. In 2018 25thIEEE International Conference on Image Processing (ICIP),pages 1603–1607. IEEE, 2018.

[44] Mennatullah Siam, Heba Mahgoub, Mohamed Zahran,Senthil Yogamani, Martin Jagersand, and Ahmad El-Sallab.MODNet: Motion and appearance based moving object de-tection network for autonomous driving. In Proceedings ofthe 21st International Conference on Intelligent Transporta-tion Systems (ITSC), pages 2859–2864, 2018.

[45] Ganesh Sistu, Isabelle Leang, Sumanth Chennupati, SenthilYogamani, Ciaran Hughes, Stefan Milz, and SamirRawashdeh. Neurall: Towards a unified visual perception

10

Page 11: arXiv:2008.07008v1 [cs.CV] 16 Aug 2020 · Eslam Mohamed 1, Mahmoud Ewaisha 1, Mennatullah Siam2, Hazem Rashed1, Senthil Yogamani1 and Ahmad El-Sallab1 Equal contribution 1Valeo 2University

model for automated driving. In 2019 IEEE IntelligentTransportation Systems Conference (ITSC), pages 796–803.IEEE, 2019.

[46] Mingxing Tan and Quoc V Le. Efficientnet: Rethinkingmodel scaling for convolutional neural networks. arXivpreprint arXiv:1905.11946, 2019.

[47] Pavel Tokmakov, Karteek Alahari, and Cordelia Schmid.Learning video object segmentation with visual memory. InProceedings of the IEEE International Conference on Com-puter Vision, pages 4481–4490, 2017.

[48] Pavel Tokmakov, Karteek Alahariz, and Cordelia Schmid.Learning motion patterns in videos. arXiv preprintarXiv:1612.07217, 2016.

[49] Philip HS Torr. Geometric motion segmentation and modelselection. Philosophical Transactions of the Royal Societyof London A: Mathematical, Physical and Engineering Sci-ences, 356(1740):1321–1340, 1998.

[50] Johan Vertens, Abhinav Valada, and Wolfram Burgard. Sm-snet: Semantic motion segmentation using deep convolu-tional neural networks. In Proceedings of the IEEE Interna-tional Conference on Intelligent Robots and Systems (IROS),Vancouver, Canada, 2017.

[51] Paul Voigtlaender, Michael Krause, Aljosa Osep, JonathonLuiten, Berin Balachandar Gnana Sekar, Andreas Geiger,and Bastian Leibe. Mots: Multi-object tracking and segmen-tation. In Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition, pages 7942–7951, 2019.

[52] Paul Voigtlaender and Bastian Leibe. Online adaptation ofconvolutional neural networks for video object segmenta-tion. arXiv preprint arXiv:1706.09364, 2017.

[53] Scott Wehrwein and Richard Szeliski. Video segmentationwith background motion models. In BMVC, 2017.

[54] Marie Yahiaoui, Hazem Rashed, Letizia Mariotti, GaneshSistu, Ian Clancy, Lucie Yahiaoui, Varun Ravi Kumar, andSenthil Yogamani. Fisheyemodnet: Moving object detectionon surround-view cameras for autonomous driving. arXivpreprint arXiv:1908.11789, 2019.

[55] Linjie Yang, Yuchen Fan, and Ning Xu. Video instance seg-mentation. CoRR, abs/1905.04804, 2019.

[56] Senthil Yogamani, Ciaran Hughes, Jonathan Horgan, GaneshSistu, Padraig Varley, Derek O’Dea, Michal Uricar, Ste-fan Milz, Martin Simon, Karl Amende, et al. Woodscape:A multi-task, multi-camera fisheye dataset for autonomousdriving. In Proceedings of the IEEE International Confer-ence on Computer Vision, pages 9308–9318, 2019.

11