arXiv:2105.01928v3 [cs.CV] 23 May 2021

14
Instances as Queries Yuxin Fang 1* , Shusheng Yang 1,2 * , Xinggang Wang 1, Yu Li 2 , Chen Fang 3 , Ying Shan 2 , Bin Feng 1 , Wenyu Liu 1 1 School of EIC, Huazhong University of Science & Technology 2 Applied Research Center (ARC), Tencent PCG 3 Tencent Abstract Recently, query based object detection frameworks achieve comparable performance with previous state-of- the-art object detectors. However, how to fully leverage such frameworks to perform instance segmentation remains an open problem. In this paper, we present QueryInst (In- stances as Queries), a query based instance segmentation method driven by parallel supervision on dynamic mask heads. The key insight of QueryInst is to leverage the in- trinsic one-to-one correspondence in object queries across different stages, as well as one-to-one correspondence be- tween mask RoI features and object queries in the same stage. This approach eliminates the explicit multi-stage mask head connection and the proposal distribution incon- sistency issues inherent in non-query based multi-stage in- stance segmentation methods. We conduct extensive ex- periments on three challenging benchmarks, i.e., COCO, CityScapes, and YouTube-VIS to evaluate the effectiveness of QueryInst in instance segmentation and video instance segmentation (VIS) task. Specifically, using ResNet-101- FPN backbone, QueryInst obtains 48.1 box AP and 42.8 mask AP on COCO test-dev, which is 2 points higher than HTC in terms of both box AP and mask AP, while runs 2.4 times faster. For video instance segmentation, QueryInst achieves the best performance among all online VIS ap- proaches and strikes a decent speed-accuracy trade-off. Code is available at https://github.com/hustvl/ QueryInst. 1. Introduction Instance segmentation is a fundamental yet challenging computer vision task that requires an algorithm to assign a pixel-level mask with a category label for each instance of interest in image. Prevalent state-of-the-art instance seg- * Equal contributions. This work was done while Shusheng Yang was interning at Applied Research Center (ARC), Tencent PCG. Corresponding author, E-mail: [email protected]. 35 40 45 50 COCO Mask AP 0 5 10 QueryInst Cascade Mask R-CNN HTC Mask R-CNN SOLO V2 CondInst Frame Per Second (FPS) Figure 1: AP vs. FPS on COCO test-dev. QueryInst out- performs current state-of-the-art methods in terms of both accuracy and speed. The speed is measured using a single Titan Xp GPU. mentation methods are based on high performing object de- tectors and follow a multi-stage paradigm. Among which, the Mask R-CNN family [21, 24, 30, 5, 9, 42] is the most successful one, where the regions-of-interest (RoI) for in- stance segmentation is extracted via a region-wise pooling operation (e.g., RoIPool [22, 18] or RoIAlign [21]) based on the box-level localization information from the region pro- posal network (RPN) [39], or the previous stage bounding- box prediction [4, 5]. The final instance mask is obtained via feeding the RoI feature into the mask head, which is a small fully convolutional network (FCN) [33]. Recently, DETR [7] is proposed to reformulate object detection as a query based direct set prediction problem, whose input is mere 100 learned object queries. Follow-up works [62, 42, 43, 17, 58, 14] in object detection improve this query based approach and achieve comparable perfor- mance with state-of-the-art detectors such as Cascade R- CNN [4]. The results show that query based instance-level perception is a very promising research direction. Thus, enabling query based detection framework to perform in- stance segmentation is highly desirable. However, we find arXiv:2105.01928v3 [cs.CV] 23 May 2021

Transcript of arXiv:2105.01928v3 [cs.CV] 23 May 2021

Page 1: arXiv:2105.01928v3 [cs.CV] 23 May 2021

Instances as Queries

Yuxin Fang1∗, Shusheng Yang1,2*, Xinggang Wang1†, Yu Li2,Chen Fang3, Ying Shan2, Bin Feng1, Wenyu Liu1

1School of EIC, Huazhong University of Science & Technology2Applied Research Center (ARC), Tencent PCG 3Tencent

Abstract

Recently, query based object detection frameworksachieve comparable performance with previous state-of-the-art object detectors. However, how to fully leveragesuch frameworks to perform instance segmentation remainsan open problem. In this paper, we present QueryInst (In-stances as Queries), a query based instance segmentationmethod driven by parallel supervision on dynamic maskheads. The key insight of QueryInst is to leverage the in-trinsic one-to-one correspondence in object queries acrossdifferent stages, as well as one-to-one correspondence be-tween mask RoI features and object queries in the samestage. This approach eliminates the explicit multi-stagemask head connection and the proposal distribution incon-sistency issues inherent in non-query based multi-stage in-stance segmentation methods. We conduct extensive ex-periments on three challenging benchmarks, i.e., COCO,CityScapes, and YouTube-VIS to evaluate the effectivenessof QueryInst in instance segmentation and video instancesegmentation (VIS) task. Specifically, using ResNet-101-FPN backbone, QueryInst obtains 48.1 box AP and 42.8mask AP on COCO test-dev, which is 2 points higher thanHTC in terms of both box AP and mask AP, while runs 2.4times faster. For video instance segmentation, QueryInstachieves the best performance among all online VIS ap-proaches and strikes a decent speed-accuracy trade-off.Code is available at https://github.com/hustvl/QueryInst.

1. Introduction

Instance segmentation is a fundamental yet challengingcomputer vision task that requires an algorithm to assigna pixel-level mask with a category label for each instanceof interest in image. Prevalent state-of-the-art instance seg-

*Equal contributions. This work was done while Shusheng Yang wasinterning at Applied Research Center (ARC), Tencent PCG.

†Corresponding author, E-mail: [email protected].

35

40

45

50

CO

CO

Mas

k A

P

0 5 10

QueryInst

Cascade Mask R-CNNHTCMask R-CNN

SOLO V2CondInst

Frame Per Second (FPS)

Figure 1: AP vs. FPS on COCO test-dev. QueryInst out-performs current state-of-the-art methods in terms of bothaccuracy and speed. The speed is measured using a singleTitan Xp GPU.

mentation methods are based on high performing object de-tectors and follow a multi-stage paradigm. Among which,the Mask R-CNN family [21, 24, 30, 5, 9, 42] is the mostsuccessful one, where the regions-of-interest (RoI) for in-stance segmentation is extracted via a region-wise poolingoperation (e.g., RoIPool [22, 18] or RoIAlign [21]) based onthe box-level localization information from the region pro-posal network (RPN) [39], or the previous stage bounding-box prediction [4, 5]. The final instance mask is obtainedvia feeding the RoI feature into the mask head, which is asmall fully convolutional network (FCN) [33].

Recently, DETR [7] is proposed to reformulate objectdetection as a query based direct set prediction problem,whose input is mere 100 learned object queries. Follow-upworks [62, 42, 43, 17, 58, 14] in object detection improvethis query based approach and achieve comparable perfor-mance with state-of-the-art detectors such as Cascade R-CNN [4]. The results show that query based instance-levelperception is a very promising research direction. Thus,enabling query based detection framework to perform in-stance segmentation is highly desirable. However, we find

arX

iv:2

105.

0192

8v3

[cs

.CV

] 2

3 M

ay 2

021

Page 2: arXiv:2105.01928v3 [cs.CV] 23 May 2021

that it is inefficient to integrate the previous successful prac-tices in Cascade Mask R-CNN [5] and HTC [9], which arestate-of-the-art mask generation solutions in the non-querybased paradigm, directly into query based detectors for in-stance mask generation. Therefore, an instance segmenta-tion method tailored for the query based end-to-end frame-work is urgently needed.

To bridge this gap, we propose QueryInst (Instances asQueries), a query based end-to-end instance segmentationmethod driven by parallel supervision on dynamic maskheads [25, 44, 42]. The key insight of QueryInst is toleverage the intrinsic one-to-one correspondence in objectqueries across different stages, and one-to-one correspon-dence between mask RoI features and object queries in thesame stage. Specifically, we set up dynamic mask heads inparallel with each other, which transform each mask RoIfeature adaptively according to the corresponding query,and are simultaneously trained in all stages. The mask gra-dient not only flows back to the backbone feature extractor,but also to the object query, which is intrinsically one-to-one interlinked in different stages. The queries implicitlycarry the multi-stage mask information, which is read byRoI features in dynamic mask heads for final mask genera-tion. There is no explicit connection between different stagemask heads or mask features. Moreover, the queries areshared between object detection and instance segmentationsub-networks in each stage, enabling cross-task communi-cations that one task can take advantage of the informationfrom the other task. We demonstrate that this shared querydesign can fully leverage the synergy between object detec-tion and instance segmentation. When the training is com-pleted, we throw away all the dynamic mask heads in theintermediate stages and only use the final stage predictionsfor inference. Under such a scheme, QueryInst surpassesthe state-of-the-art HTC in terms of AP while runs muchfaster. Concretely, our main contributions are summarizedas follows:

• We attempt to solve instance segmentation from anew perspective that uses parallel dynamic mask headsin the query based end-to-end detection framework.This novel solution enables such a new frameworkto outperform well-established and highly-optimizednon-query based multi-stage schemes such as CascadeMask R-CNN and HTC in terms of both accuracy andspeed (see Fig. 1). Specifically, using ResNet-101-FPN backbone [23, 27], QueryInst obtains 48.1 APbox

and 42.8 APmask on COCO test-dev, which is 2 pointhigher than HTC in terms of both box AP and maskAP, while runs 2.4× faster. Without bells and whistles,our best model achieves 50.4 APbox and 46.6 APmask

on COCO test-dev.

• We set up a task-joint paradigm for query based ob-

ject detection and instance segmentation by leverag-ing the shared query and multi-head self-attention de-sign. This paradigm establishes a kind of communi-cation and synergy between detection and segmenta-tion tasks, which encourages this two tasks to benefitsfrom each other. We demonstrate that our architecturedesign can also significantly improve the object detec-tion performance.

• We extend the QueryInst to video instance seg-mentation task (VIS) [57] task by simply adding avanilla track head. Experiments on YouTube-VISdataset [57] indicate that with same tracking approach,our methods outperforms MaskTrack R-CNN [57] andSipMask-VIS [6] by a large margin. QueryInst-VIScan even outperform well-designed VIS approachessuch as STEm-Seg [1] and VisTR [53].

2. Related WorkQuery Based Methods. Recently, query based methodsemerged to tackle the set-prediction problems. Concretely,DETR [7] first introduces the query based methods withtransformer architecture to object detection. DeformableDETR [62], UP-DETR [14], ACT [58] and TSP [43] im-prove the performance on the top of DETR. The recentlyproposed Sparse R-CNN [42] builds a query based set-prediction framework upon R-CNN [19, 18, 39] baseddetector. For segmentation, VisTR [53] introduces aquery based sequence matching and segmentation methodto video instance segmentation, building a fully end-to-end framework for instance segmentation in video. Max-DeepLab[49] presents the first box-free end-to-end panop-tic segmentation model with a global memory as externalquery. Trackformer [35] and Transtrack [41] build a querybased multiple object tracktor upon DETR and DeformableDETR, respectively, and attain comparable results to thenon-query based methods. AS-Net [11] introduces a querybased set-prediction pipeline to human object interactionand obtains promising results. Despite query based set-prediction method is being widely used to many computervision tasks, few efforts are conducted to build a successfulquery based instance segmentation framework. We aim toachieve this goal in this paper.

Object Detection. Object detection is a fundamentalcomputer vision task which aims to detect visual objectswith bounding boxes. With the propose of R-CNN [19],Fast R-CNN [18] and Faster R-CNN [39], anchor basedmethods [4, 38, 34, 28, 31] dominate object detection fora long period. CenterNet [60] and FCOS [45] establishanchor-free detectors with competitive detection perfor-mance. Recently, with the proposed DETR [7], query basedset-prediction methods catch lots of attentions. Deformable

Page 3: arXiv:2105.01928v3 [cs.CV] 23 May 2021

DETR [62] introduces deformable convolution [61] to theDETR framework, achieving better performance with fastertraining convergence. UP-DETR [14] extends DETR to un-supervised scenarios. ACT [58] and TSP [43] introduce theadaptive clustering module and a new bipartite matchingmethod to DETR. Sparse R-CNN [42] build a query baseddetector on top of R-CNN architecture, while OneNet [40]and DeFCN [50] are end-to-end detector built upon the one-stage FCOS [45]. In this work, we present a query basedinstance segmentation method on the top of the query basedSparse R-CNN detector.

Instance Segmentation. Instance segmentation is a fun-damental yet challenging computer vision task that requiresan algorithm to assign a pixel-level mask with a categorylabel for each instance of interest in image. Mask R-CNN [21] introduces a fully convolutional mask head toFaster R-CNN [18] detector. Casacde Mask R-CNN [5]simply combine the Casacde R-CNN [4] with Mask R-CNN. HTC [9] presents interleaved execution and mask in-formation flow and achieves state-of-the-art performance.In addition to R-CNN based methods, YOLACT [3, 2], Sip-Mask [6], CondInst [46] and SOLO [51, 52] build one-stageinstance segmentation framework on the top of one-stageframework, achieving comparable results with favorable in-ference speed. Following the R-CNN based methods, wepresent a query based instance segmentation framework.

3. Instances as QueriesWe propose QueryInst (Instances as Queries), a query

based end-to-end instance segmentation method. QueryInstconsists of a query based object detector and six dynamicmask heads driven by parallel supervision. Our key insightis to leverage the intrinsic one-to-one correspondence inqueries across different stages. This correspondence existsin all query based framework [47, 15, 37, 8, 7] regardlessof the specific instantiations and applications. The overallarchitecture of QueryInst is illustrated in Fig. 2 (c).

3.1. Query based Object Detector

QueryInst can be built on any multi-stage query basedobject detector [7, 62, 42]. We choose Sparse R-CNN [42]as our default instantiation, which has six query stages. Theobject detection pipeline is depicted in Fig. 2 (a) and can beformulated as follows:

xboxt ←Pbox

(xFPN, bt−1

),

q∗t−1 ←MSAt

(qt−1

),

xbox∗t , qt ← DynConvbox

t

(xboxt , q∗t−1

),

bt ← Bt(xbox∗t

),

(1)

where q ∈ RN×d denotes the object query. N and d de-note the length (number) and dimension of query q. At

(c) QueryInst with dynamic mask head

(b) Sparse R-CNN with vanilla mask head

(a) Sparse R-CNN

Figure 2: Overview of QueryInst. The red arrows indicatemask branches. Please note that QueryInst consists of 6stage in parallel, i.e., t = {1, 2, 3, 4, 5, 6}. The figure onlyshows 2 stages.

stage t, a pooling operator Pbox extracts the current stagebounding box features xbox

t from FPN [27] features xFPN

under the guidance of previous stage bounding box predic-tions bt−1. Meanwhile, a multi-head self-attention mod-ule MSAt is applied to the input query qt−1 to get thetransformed query q∗t−1. Then, a box dynamic convolutionmodule DynConvbox

t takes xboxt and q∗t−1 as inputs and en-

hances the xboxt by reading q∗t−1 while generating qt for

the next stage. Finally, the enhanced bounding box featuresxbox∗t are fed into the box prediction branch Bt for current

bounding box prediction bt.

3.2. Mask Head Architecture

3.2.1 Vanilla Mask Head

For instance mask prediction, we first adopt the widely usedvanilla mask head architecture design in Mask R-CNN [21]as our instance segmentation baseline. The model architec-ture is depicted in Fig. 2 (b). Based on the object detection

Page 4: arXiv:2105.01928v3 [cs.CV] 23 May 2021

Figure 3: Illustrations of DynConvmaskt at stage t. xmask∗

t

is enhanced by two consecutive conv-layers, whose kernelparameters are produced by q∗t−1.

pipeline described in Sec. 3.1, the mask generation processcan be expressed as follows:

xmaskt ←Pmask

(xFPN, bt

),

mt ←Mt

(xmaskt

),

(2)

where bt is the bounding box predictions from the objectdetector. Pmask denotes a region-wise pooling operator formask RoI features extraction. Mt indicates the mask FCNhead consisting of a stack of four consecutive conv-layers,one dconv-layer and one 1× 1 conv-layer for mask genera-tion [21]. mt is the current stage mask predictions.

Overall, this vanilla design is an analogy of CascadeMask R-CNN[5] in a query based framework. However, wefind that this design is not as effective as the original Cas-cade Mask R-CNN. Moreover, establishing explicit maskflow following HTC [9] on top of this design (Fig. 2 (b)) canonly bring moderate improvements at a cost of large dropsin both training and inference speed. Part of the reasonsmay be the number of queries in our framework is muchsmaller than the number of proposals in Cascade Mask R-CNN and HTC, resulting in limited availability of trainingsamples.

3.2.2 Dynamic Mask Head

Our goal is to design a mask prediction head tailor-madefor query based instance segmentation frameworks. To thisend, we propose to leverage dynamic mask heads drivenby parallel supervision to replace the vanilla design inSec. 3.2.1. The dynamic mask head at stage t consistsof a dynamic mask convolution module DynConvmask

t (seeFig. 3) [42] following by a vanilla mask headMt [21]. Themask generation pipeline is reformulated as follows:

xmaskt ←Pmask

(xFPN, bt

),

xmask∗t ← DynConvmask

t

(xmaskt , q∗t−1

),

mt ←Mt

(xmask∗t

).

(3)

It is noteworthy that the only difference between the pro-posed dynamic mask head and vanilla mask head is the exis-tence of DynConvmask

t . We demonstrate that DynConvmaskt

enables (1) per-mask information flow in queries driven byparallel mask branch supervision, and (2) communication

and synergy for joint detection and instance segmentation inthe following two sub-sections, respectively. The effective-ness of these two properties is verified in our experiments.

3.3. Per-mask Information Flow with Parallel Su-pervision

In query based models such as [7, 62, 42], the modellearns different specialization for each query slot [7], i.e.,qt[s] is the transformed and refined version of previousstage qt−1[s] in the same s-th slot. Moreover, xmask

t [s] cor-responds to and is refined by qt[s] [42]. Therefore there isan one-to-one correspondence across different stage queriesinherent in these frameworks, as well as one-to-one corre-spondence between mask RoI features and object queries inthe same stage.

QueryInst is driven by parallel supervision on dynamicmask heads, which fully leverages the intrinsic one-to-onecorrespondence in object queries across different stages.Specifically, we set up dynamic mask heads in parallel witheach other, which transform each mask RoI feature xmask

t

adaptively in DynConvmaskt according to the corresponding

query q∗t−1, and are simultaneously trained in all stages. In-side DynConvmask

t , the query acts as memory and is readby mask RoI features xmask

t in the forward pass and writtenby xmask

t in the backward pass.During training, the per-mask information (i.e., the mask

gradient) not only flows back to mask RoI features xmaskt ,

but also to the object query q∗t−1, which is intrinsicallyone-to-one interlinked in different stages. Therefore theper-mask information flow is naturally established by lever-aging the inherent properties of query based frameworks,with no additional connection needed. After the training iscompleted, the information for mask prediction is stored inqueries.

During inference, we throw away all dynamic maskheads in 5 intermediate stages and only use the final stagepredictions for inference. The queries implicitly carry themulti-stage information for mask prediction, which is readby mask RoI features xmask

t in dynamic mask convolutionDynConvmask

t at the last stage for final mask generation.Without DynConvmask

t , the link between mask RoI fea-tures and the query is lost, and mask heads in differentstages are isolated. Even though parallel supervision is ap-plied to all mask heads, the information related to maskgeneration cannot flow into queries. In this condition,QueryInst degenerates to Cascade Mask R-CNN with afixed number (i.e., N ) of proposals across all stages.

3.4. Shared Query and MSA for Joint Detection andSegmentation

At stage t, a multi-head self-attention MSAt is appliedto the query qt−1. MSAt projects the query qt−1 to a highdimensional embedding space and its output q∗t−1 is read

Page 5: arXiv:2105.01928v3 [cs.CV] 23 May 2021

by the dynamic box convolution DynConvboxt and dynamic

mask convolution DynConvmaskt respectively, to enhance

the task-specific features xboxt and xmask

t .Throughout this process, the query and MSA are shared

between detection and instance segmentation tasks. Bothdetection and segmentation information flow back into thequery through MSA. This task-joint paradigm establishes akind of communication and synergy between detection andsegmentation tasks, which encourages these two tasks tobenefits from each other. The query learns a better instance-level representation with the guidance of two highly corre-lated tasks. We observe a performance decrease using sep-arate queries or MSA in our experiments.

3.5. Comparisons with Cascade Mask R-CNN andHTC

Cascade Mask R-CNN [5] presents a multi-stage ar-chitecture to resample with higher intersection over union(IoU) thresholds for the latter stages, and progressively re-fine the training distributions of region proposals [4, 5].This resampling operation guarantees the availability of alarge number of accurate localized proposals for the fi-nal stage. Despite outperforming non-cascade counterparts,the effectiveness of Cascade Mask R-CNN mainly stemsfrom the progressively refined proposal recall. Whereas, themask heads across different stages are isolated, and the in-put feature for mask head always come from the same FPNfeature regardless of the stage.

To mitigate the aforementioned drawback of CascadeMask R-CNN, HTC [9] improves Cascade Mask R-CNN byintroducing direct and explicit connections across the maskheads at different stages. The current stage mask featuresare combined with the accumulated mask features from allprevious stages. Equivalently, the mask head of the finalstage is 3× deeper than the first stage. Therefore the fi-nal stage mask prediction can benefit from deeper features.Establishing direct mask information flow as HTC can al-leviate the issue in Cascade Mask R-CNN to some extent.However, this explicit connection across mask heads at dif-ferent R-CNN stages results in inefficient training and in-ference.

There are some potential issues inherent in the aforemen-tioned non-query based instance segmentation paradigms.For Cascade Mask R-CNN and HTC, the quality of propos-als in different stages is refined in the statistical sense [4, 5,9]. For each stage, the number and distribution of train-ing samples are quite different, and there is no explicitand intrinsic correspondence for each individual proposalacross different stages [59]. Moreover, there is also a mis-match between the training and inference sample distribu-tion [48]. Therefore, introducing direct connections at thearchitecture-level is necessary for the mask heads in differ-ent stages to explicitly learn the correspondence [9].

Our method does not directly solve the aforementionedissues, but bypasses them. For QueryInst, the connectionsacross stages are naturally established by one-to-one cor-respondence inherent in queries. This approach eliminatesthe explicit multi-stage mask head connection and the pro-posal distribution inconsistency issues. We show that theproposed new paradigm can surpass Cascade Mask R-CNNand HTC in terms of both accuracy and speed.

3.6. QueryInst-VIS for Video Instance Segmenta-tion

Video instance segmentation (VIS) [57] is a highly rel-evant task to still-image instance segmentation that aims atdetecting, classifying, segmenting and tracking visual in-stances over video frames. We demonstrate that QueryInstcan be easily extended to VIS with minimal modifica-tions by simply adding the vanilla track head in Mask-Track R-CNN baseline [57]. The proposed model coinedas QueryInst-VIS can perform video instance segmentationin an online manner while operating at real-time. The to-tal training and inference pipeline keep the same as Mask-Track R-CNN. We evaluate QueryInst-VIS on the challeng-ing YouTube-VIS [57] benchmark to demonstrate its effec-tiveness.

4. Experiments4.1. Dataset

COCO. Most of our experiments are conducted on the chal-lenging COCO dataset [29]. Following the common prac-tice, we use the COCO train2017 split (115k images) fortraining and the val2017 split (5k images) as validation forour ablation study. We report our main results on the test-dev split (20k images).Cityscapes. Cityscapes [12] is an ego-centric street-scenedataset with 8 categories, 2975 train images, and 500 val-idation images for instance segmentation. The images arewith higher resolution (1024× 2048 pixels) compared withCOCO, and have more pixel-accurate ground-truth.YouTube-VIS. In addition to static-image instance segmen-tation, we demonstrate the effectiveness of our QueryInst onvideo instance segmentation. YouTube-VIS [57] is a chal-lenging dataset for video instance segmentation task, whichhas a 40-category label set, 4, 883 unique video instancesand 131k high-quality manual annotations. There are 2, 238training videos, 302 validation videos, and 343 test videos.

4.2. Implementation Details

Training Setup. Our implementation is based on MMDe-tection [10] and Detectron2 [54]. Following [42], the de-fault training schedule is 36 epochs and the initial learningrate is set to 2.5 × 10−5, divided by 10 at 27-th epoch and33-th epoch, respectively. We adopt AdamW optimizer with

Page 6: arXiv:2105.01928v3 [cs.CV] 23 May 2021

Method Backbone Aug. Epochs APbox AP AP50 AP75 APS APM APL FPSMask R-CNN [21]

ResNet-50-FPN 640 ∼ 800 36

41.3 37.5 59.3 40.2 21.1 39.6 48.3 14.0CondInst w/ sem. [46] – 38.6 60.2 41.4 20.6 41.0 51.1 14.1SOLOv2 [51] 40.4 38.8 59.9 41.7 16.5 41.7 56.2 13.8QueryInst (5 Stage, 100 Queries) 44.5 39.9 62.2 43.0 22.9 41.7 51.9 13.5Cascade Mask R-CNN [5]

ResNet-50-FPN 640 ∼ 800 36

44.5 38.6 60.0 41.7 21.7 40.8 49.6 10.4HTC [9] 44.9 39.7 61.4 43.1 22.6 42.2 50.6 3.1QueryInst (100 Queries) 44.8 40.1 62.3 43.4 23.3 42.1 52.0 10.5QueryInst (300 Queries) 45.6 40.6 63.0 44.0 23.4 42.5 52.8 7.0Cascade Mask R-CNN

ResNet-101-FPN 640 ∼ 800 3645.7 39.8 61.6 43.0 22.4 42.2 50.8 8.7

HTC 46.2 40.7 62.7 44.2 23.1 43.4 52.7 2.5QueryInst (300 Queries) 47.0 41.7 64.4 45.3 24.2 43.9 53.9 6.1Cascade Mask R-CNN

ResNet-101-FPN 480 ∼ 800w/ crop 36

46.2 40.0 61.7 43.5 22.5 42.5 51.2 8.7HTC 46.3 40.8 62.6 44.3 23.0 43.5 52.6 2.5Sparse R-CNN (300 Queries) 46.3 − − − − − − 6.9QueryInst (300 Queries) 48.1 42.8 65.6 46.7 24.6 45.0 55.5 6.1

QueryInst (300 Queries) ResNeXt-101-FPNw/ DCN

480 ∼ 800w/ crop 36 50.4 44.6 68.1 48.7 26.6 46.9 57.7 3.1

QueryInst (300 Queries) @ val Swin-L 400 ∼ 1200w/ crop 50 56.1 48.9 74.0 53.9 30.8 52.6 68.3 3.3>

QueryInst (300 Queries) Swin-L 400 ∼ 1200w/ crop 50 56.1 49.1 74.2 53.8 31.5 51.8 63.2 3.3>

Table 1: Main results on COCO test-dev. The numbers under “Aug.” indicate the scale range of the shorter size of inputswith a stride of 32. APbox denotes box AP. AP without superscript denotes mask AP. The best results are in bold for eachconfiguration. Superscript “>” indicates the FPS data are measured on a single RTX 2080Ti GPU with batch size 1.

1 × 10−4 weight decay. Hyper-parameters, configurationsas well as the label assignment procedures follow the settingin [7, 62, 42]. In total, the R-CNN head of QueryInst con-tains 6 stages in parallel as [42]. The mask head is trainedby minimizing dice loss [36]. Without special mention-ing, we adopt QueryInst model trained with 100 queries andResNet-50-FPN [23, 27] as backbone in our experiments inthe ablation studies.Inference. Given an input image, QueryInst directly out-puts top 100 bounding box predictions with their scoresand corresponding instance masks without further post-processing. For inference, we use the final stage masksas the predictions and ignore all the parallel DynConvmask

at the intermediate stages. The inference speed reported ismeasured using a single Titan Xp GPU with inputs resizedto have their shorter side being 800 and their longer sideless or equal to 1333.

4.3. Main Results

Comparisons on COCO Instance Segmentation.The comparison of QueryInst with the state-of-the-art in-stance segmentation methods on COCO test-dev are listedin Tab. 1. We have tested different backbones and dataaugmentations. CondInst [46] (with auxiliary semanticbranch) and SOLOv2 [52] are the latest state-of-the-art in-

stance segmentation approach based on dynamic convolu-tions. A 5-stage QueryInst trained with 100 queries out-performs them with over 1.1 mask AP gain under simi-lar inference speed. QueryInst trained with 100 queriescan also surpass Cascade Mask R-CNN [21] by 1.5 maskAP while runs with the same FPS. For fair comparisonswith HTC [9], we train HTC using the 36 epochs train-ing schedule and multi-scale data augmentations followingthe standard setting in [21, 54], yielding ∼ 1 higher maskAP than original results reported in [9]. Under same ex-perimental conditions, QueryInst outperforms the state-of-the-art HTC in terms of both accuracy and speed. More-over, QueryInst outperforms HTC in terms of AP at dif-ferent IoU thresholds (AP50 and AP75) as well as AP atdifferent scales (APS , APM and APL), regardless of the ex-perimental configuration. We also find that compared withCascade Mask R-CNN and HTC, the query based QueryInstcan benefit more from stronger data argumentation usedin [7, 62, 42]1. Specifically, using ResNet-101-FPN [27]backbone and stronger multi-scale data argumentation withrandom crop, QueryInst surpasses HTC by 2.0 mask APand 1.8 box AP while runs 2.4× faster. Further, QueryInst

1We experimentally study the effects of different training schedules anddata augmentations to the Mask R-CNN family in the Appendix.

Page 7: arXiv:2105.01928v3 [cs.CV] 23 May 2021

Method Backbone APval AP AP50 person rider car trunk bus train mcycle bicycleMask R-CNN [21] ResNet-50 36.4 32.0 58.1 34.8 27.0 49.1 30.1 40.9 30.9 24.1 18.7BShapeNet+ [26] ResNet-50 − 32.9 58.8 36.6 24.8 50.4 33.7 41.0 33.7 25.4 17.8UPSNet [56] ResNet-50 37.8 33.0 59.7 35.9 27.4 51.9 31.8 43.1 31.4 23.8 19.1CondInst [46] ResNet-50 37.5 33.2 57.2 35.1 27.7 54.5 29.5 42.3 33.8 23.9 18.9CondInst [46] w/ sem. DCN-101-BiFPN 39.3 33.9 58.2 35.6 28.1 55.0 32.1 44.2 33.6 24.5 18.6QueryInst ResNet-50 39.4 34.4 59.6 40.4 30.7 56.8 29.1 40.5 30.8 26.0 21.1

Table 2: Instance segmentation results on Cityscapes val (APval column) and test (remain columns) split. The best resultsare in bold.

Method Backbone AP AP50 AP75 AR1 AR10 FPSMaskTrack R-CNN [57] ResNet-50 30.3 51.1 32.6 31.0 35.5 22.1SipMask-VIS [6] ResNet-50 32.5 53.0 33.3 33.5 38.9 30.9SipMask-VIS∗ ResNet-50 33.7 54.1 35.8 35.4 40.1 30.9STEm-Seg [1] ResNet-50 30.6 50.7 33.5 31.6 37.1 4.4STEm-Seg ResNet-101 34.6 55.8 37.9 34.4 41.6 2.1CompFeat [16] ResNet-50 35.3 56.0 38.6 33.1 40.3 –VisTR [53] ResNet-50 34.4 55.7 36.5 33.5 38.9 30.0VisTR ResNet-101 35.3 57.0 36.2 34.3 40.4 27.7QueryInst-VIS ResNet-50 34.6 55.8 36.5 35.4 42.4 32.3QueryInst-VIS∗ ResNet-50 36.2 56.7 39.7 36.1 42.9 32.3

Table 3: Comparisons with state-of-the-art video instance segmentation methods on YouTube-VIS val set. Methods withsuperscript “∗” indicates using multi-scale data argumentation during training. The best results are in bold.

with deformable ResNeXt-101-FPN backbone [55, 13, 61]achieves 44.6 mask AP and 50.4 box AP without bells andwhistles.

We demonstrate that the instance segmentation perfor-mance of QueryInst is not simply come from the accuratebounding box provided by Sparse R-CNN [42] object de-tector. On the contrary, QueryInst can largely improve thedetection performance. The best result of Sparse R-CNN(ResNet-101-FPN, 300 queries, 480 ∼ 800 w/ crop, 36epochs) reported in [10] is 46.3 box AP. Under the sameexperimental setting, QueryInst can achieve 48.1 box AP,which outperform Sparse R-CNN by 1.8 box AP. We alsoshow in the ablation study that QueryInst can outperformCascade Mask R-CNN and HTC based on a weaker querybased detector.

We also apply QueryInst to the recent state-of-the-artSwin Transformer [32] backbone without further modifi-cations, and we find the proposed model is quite capa-ble of adapting with Swin-L. Without bells and whistles,QueryInst can achieve the art performance in instance seg-mentation2 as well as object detection3. For the first time,we demonstrate that an end-to-end query based framework

2https://paperswithcode.com/sota/instance-segmentation-on-coco

3https://paperswithcode.com/sota/object-detection-on-coco

driven by parallel supervision is competitive with well-established and highly-optimized methods in instance-levelrecognition tasks.Comparisons on Cityscapes Instance Segmentation.We also conduct experiments on Cityscapes dataset todemonstrate the generalization of QueryInst. Following thestandard setting in [46, 21], all models are first pre-trainedon COCO train2017 split then finetuned on Cityscapesusing fine annotations for 24k iterations with batch size8 (1 image per GPU). The initial learning rate is linearlyscaled to 1.25 × 10−5 and is reduced by a factor of 10 atstep 18k.

The results are shown in Tab. 2. QueryInst achieves39.4 AP on val split and 34.4 AP on test split, sur-passing several strong baselines. Notably, compared tothe dynamic convolution based method CondInst [46],QueryInst with ResNet-50 backbone outperforms CondInstwith both ResNet-101-DCN-BiFPN backbone and seman-tic branch. Overall, our QueryInst achieves leading resultson Cityscapes dataset without bells and whistles.Video Instance Segmentation Results on YouTube-VIS.Tab. 3 shows the video instance segmentation results onYouTube-VIS val set. Following the standard settingin [57], we first pre-train the instance segmentation modelon COCO train2017, then we finetune the corresponding

Page 8: arXiv:2105.01928v3 [cs.CV] 23 May 2021

Type CascadeMask Head [5]

HTCMask Flow [9] DynConvmask Fig. APbox ∆box APmask ∆mask FPS ∆FPS

Non-query Based X 44.3 38.5 10.4X 44.4 + 0.1 39.3 + 0.8 3.1 − 7.3

Query Based

X Fig. 2 (b) 43.8 37.9 11.1X 43.8 + 0.0 38.9 + 1.0 6.0 − 5.1

X Fig. 2 (c) 44.5 + 0.7 39.8 + 1.9 10.5 − 0.6X X 44.4 + 0.6 40.0 + 2.1 5.4 − 5.7

Table 4: Impacts of different mask head architectures on different frameworks. The setting in blue is our default instantiation.

Parallel DynConvmask Fig. APbox APmask FPSFig. 2 (b) 43.5 37.4 11.1

X Fig. 2 (b) 43.8 37.9 11.1X Fig. 2 (c) 43.8 38.8 10.5

X X Fig. 2 (c) 44.5 39.8 10.5

Table 5: Impacts of parallel supervision and DynConvmask.

Shared MSA Shared Query APbox ∆box APmask ∆mask

43.4 38.1X 43.9 + 0.5 38.3 + 0.2

X 44.1 + 0.7 39.5 + 1.4X X 44.5 + 1.1 39.8 + 1.7

Table 6: Impacts of using shared query and MSA.

VIS model on YouTube-VIS train set for 12 epochs. Themaximum number of instances in one frame in YouTube-VIS dataset is 10, so we set the number of queries to 10 inQueryInst-VIS. The setting enables the model to operate atreal-time (> 30 FPS).

As mentioned in Sec. 3.6, QueryInst-VIS adopts thevanilla track method of MaskTrack R-CNN [57] andSipMask-VIS [6], while it obtains 4.3 AP improvementcompared to MaskTrack R-CNN and 2.1 AP improvementcompared to SipMask-VIS. Moreover, QueryInst can out-perform many well-established and highly-optimized VISapproaches, such as STEm-Seg, CompFeat and VisTR interms of both accuracy and speed.

4.4. Ablation Study

Study of Parallel Supervision and DynConvmask.We show that applying parallel mask head supervisionand DynConvmask are both indispensable for good perfor-mance. As shown in Tab. 5, using parallel supervisionon vanilla mask head cannot bring large improvement, be-cause mask heads in different stages are isolated and thereis no cross-stage per-mask information flow established(Sec. 3.2.1). Using DynConvmask without parallel super-

Figure 4: Effects of DynConvmask. The first row showsmask features xmask directly extracted from FPN. The sec-ond row shows mask features xmask∗ enhanced by queriesin DynConvmask. Last row is ground-truth instance masks.The results show that mask features enhanced by queriesyield more genuine and accurate details and carry more in-formation of instances.

vision on each stage can only bring moderate improvement,for mask gradients injected from the final stage cannot fullydriven the per-mask information flow across queries in allstages. When DynConvmask in all stages are driven by par-allel supervision simultaneously, QueryInst achieves signif-icant improvement in accuracy with only little drops in in-ference speed. The reason is that during inference, we throwaway all the parallel DynConvmask in the intermediate stageand only use the final stage mask predictions. The per-maskinformation is written and preserved in queries during train-ing, which only need to be read out at the final stage duringinference.Study of Query and MSA.Tab. 6 studies the impact of using shared query and MSA.As expected in Sec. 3.4, using shared query and MSA si-multaneously establishes a kind of communication and syn-ergy between detection and segmentation tasks, which en-courages this two tasks to benefits from each other andachieves the highest box AP and mask AP. Moreover, thisconfiguration consumes minimal parameters and computa-tion budgets. Therefore we choose using shared query andMSA as the default instantiation of our QueryInst.Study of Different Mask Head.Tab. 4 studies the impact of different mask head archi-

Page 9: arXiv:2105.01928v3 [cs.CV] 23 May 2021

(a) Separate query,Separate MSA.

(c) Share query,Separate MSA.

(d) Share query,Share MSA.

(b) Separate query,Share MSA.

Figure 5: Illustration of 4 different query and MSA configurations. We use (d) as the default instantiation of QueryInst.

Methods Sched. Training Time APHTC [9] 3× ∼ 41 hours 39.7QueryInst (100 Queries) 3× ∼ 35 hours 40.1QueryInst (300 Queries) 3× ∼ 38 hours 40.6

Table 7: Training time comparisons between QueryInst andHTC. All models use ResNet-50-FPN [23, 27] as back-bone and are trained with 3× schedule (∼ 36 epochs) on8 NVIDIA V100 GPUs (2 images per GPU). QueryInstachieves better performance while needs less training time.

tectures on query and non-query based frameworks. Allstages are simultaneously trained. For non-query basedframeworks, the 1-st row is the results of Cascade MaskR-CNN [5] and the 2-nd row is HTC [9]. We have the fol-lowing 3 major observations.

First, we find that directly integrating cascade maskhead [5] and HTC mask flow [9] into the query based modelis not as effective as in its original framework. When cas-cade mask head is applied (3-th row), the query based modelis 0.5 APbox and 0.6 APmask lower than the original Cas-cade Mask R-CNN (1-th row). When HTC mask flow isapplied (4-th row), the query based model is 0.6 APbox and0.4 APmask lower than the original HTC (2-th row). Theseresults demonstrate that previous successful empirical prac-tice from non-query based multi-stage models is possiblyinadequate for query based models (Sec. 3.2.1).

Second, when the proposed parallel DynConvmask is ap-plied to the query based model, QueryInst (5-th row) outper-forms the baseline (3-th row) by 0.7 APbox and 1.9 APmask,while maintaining a high FPS. Moreover, QueryInst alsobeat original HTC (2-th row) in terms of both APbox andAPmask while runs about 3× faster. Fig. 4 demonstrates theeffects of DynConvmask qualitatively.

Last, we also find that for query based approaches, HTCmask flow cannot bring further improvement on top of par-allel DynConvmask architecture (6-th row). This indicates

Metrics Aug. 3× 4× 5× 6×

APbox 640 ∼ 800 42.9 42.5 42.4 42.7480 ∼ 800, w/ crop 42.7 43.9

APmask 640 ∼ 800 38.6 38.2 38.1 38.3480 ∼ 800, w/ crop 38.6 39.4

Table 8: Object detection and instance segmentation per-formance of Mask R-CNN with ResNet-101-FPN back-bone using training schedules from 2× (180k iterations) to6× (540k iterations) and different data augmentations onCOCO val. The numbers under “Aug.” indicate the scalerange of the shorter size of inputs with a stride of 32.

that the proposed parallel DynConvmask enables adequatemask information flow propagating across queries in differ-ent stage for high quality mask generation. Therefore es-tablishing explicit mask feature flow as HTC is redundantand is harmful for model efficiency. In consideration of thespeed-accuracy trade-off, we choose Fig. 2 (c) as as the de-fault instantiation of our QueryInst.

5. Conclusion

In this paper, we propose an efficient query based end-to-end instance segmentation framework, QueryInst, drivenby parallel supervision on dynamic mask heads. To ourknowledge, QueryInst is the first query based instance seg-mentation method that outperforms previous state-of-the-artnon-query based instance segmentation approaches. Exten-sive study proves that parallel mask supervision can bringgreat performance improvement without any decent for in-ference speed, while dynamic mask head with both sharedquery and MSA joints two sub-tasks of detection and seg-mentation naturally. We hope this work can strength the un-derstanding of query based frameworks and facilitate futureresearch.

Page 10: arXiv:2105.01928v3 [cs.CV] 23 May 2021

Figure 6: Object detection and instance segmentation qualitative results on COCO val split.

Appendix

Illustration of 4 Different Query and MSAConfigurations

We study the impact of using shared query and shared MSAin the paper. Fig. 5 gives an illustration for our ablationstudy. Fig. 5 from left to right corresponds to configurationsin Tab. 6 in our paper from top to bottom. Using the sharedquery and shared MSA configuration, QueryInst achievesthe best performance in terms of both box AP and mask APwith the least additional overhead.

Training Time of QueryInst

Here, we compare the training time of QueryInst with thestate-of-the-art instance segmentation method HTC [9]. Asshown in Tab 7, under the same experimental configuration,QueryInst outperforms HTC using less training time.

Effects of Different Training Schedules and Data Aug-mentations to the Mask R-CNN Family

In Tab. 1, we find that for Cascade Mask R-CNN andHTC, stronger data augmentation cannot bring significantimprovements under the 3× training schedule. Here wegive a detailed experimental study in Tab. 8.

We choose Mask R-CNN with ResNet-101-FPN back-bone as a representative. Using modest data augmentation(640 ∼ 800), the 3× schedule works well for the model toconverge to near optimum, the performance begins to de-generate under longer training schedules. These findingsare consistent with [20].

Using stronger data augmentation (480 ∼ 800, w/ crop)cannot bring further improvements under the 3× schedule,but can boost the performance as the training time becomeslonger.

Compared with the Mask R-CNN family (Mask R-CNN,Cascade Mask R-CNN & HTC), the proposed QueryInstcan benefit from stronger data augmentation even under the

Page 11: arXiv:2105.01928v3 [cs.CV] 23 May 2021

Figure 7: Object detection and instance segmentation qualitative results on Cityscapes test split.

3× schedule. We will conduct further study of these prop-erties in the future.

Qualitative Results on COCO

We provide some qualitative results on COCO [29] val splitin Fig. 6.

Qualitative Results on Cityscapes

Fig. 7 gives some qualitative results on Cityscapes [12]test split.

Additional Visualization of Dynamic Mask Feature

In Fig. 8, we provide more visualization results to study theeffects of DynConvmask. The first row shows mask featuresxmask directly extracted from FPN. The second row showsmask features xmask∗ enhanced by queries in DynConvmask.The last row is ground-truth instance masks.

Page 12: arXiv:2105.01928v3 [cs.CV] 23 May 2021

Figure 8: More visualization results on the study of DynConvmask.

Page 13: arXiv:2105.01928v3 [cs.CV] 23 May 2021

References[1] Ali Athar, S. Mahadevan, Aljosa Osep, L. Leal-Taixe, and B.

Leibe. Stem-seg: Spatio-temporal embeddings for instancesegmentation in videos. In ECCV, 2020.

[2] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee.YOLACT++: better real-time instance segmentation. ArXiv,2019.

[3] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee.YOLACT: real-time instance segmentation. In ICCV, 2019.

[4] Zhaowei Cai and Nuno Vasconcelos. Cascade R-CNN: delv-ing into high quality object detection. In CVPR, 2018.

[5] Zhaowei Cai and Nuno Vasconcelos. Cascade r-cnn: Highquality object detection and instance segmentation. TPAMI,2019.

[6] Jiale Cao, Rao Muhammad Anwer, Hisham Cholakkal, Fa-had Shahbaz Khan, Yanwei Pang, and Ling Shao. Sipmask:Spatial information preservation for fast image and video in-stance segmentation. In ECCV, 2020.

[7] Nicolas Carion, Francisco Massa, Gabriel Synnaeve, NicolasUsunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers. In ECCV, 2020.

[8] William Chan, Chitwan Saharia, Geoffrey Hinton, Moham-mad Norouzi, and Navdeep Jaitly. Imputer: Sequence mod-elling via imputation and dynamic programming. In ICML,2020.

[9] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox-iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi,Wanli Ouyang, et al. Hybrid task cascade for instance seg-mentation. In CVPR, 2019.

[10] Kai Chen, Jiaqi Wang, Jiangmiao Pang, Yuhang Cao, YuXiong, Xiaoxiao Li, Shuyang Sun, Wansen Feng, Ziwei Liu,Jiarui Xu, Zheng Zhang, Dazhi Cheng, Chenchen Zhu, Tian-heng Cheng, Qijie Zhao, Buyu Li, Xin Lu, Rui Zhu, Yue Wu,Jifeng Dai, Jingdong Wang, Jianping Shi, Wanli Ouyang,Chen Change Loy, and Dahua Lin. MMDetection: Openmmlab detection toolbox and benchmark. arXiv preprintarXiv:1906.07155, 2019.

[11] Mingfei Chen, Yue Liao, Si Liu, Zhiyuan Chen, Fei Wang,and Chen Qian. Reformulating hoi detection as adaptive setprediction. arXiv preprint arXiv:2103.05983, 2021.

[12] Marius Cordts, Mohamed Omran, Sebastian Ramos, TimoRehfeld, M. Enzweiler, Rodrigo Benenson, Uwe Franke, S.Roth, and B. Schiele. The cityscapes dataset for semanticurban scene understanding. In CVPR, 2016.

[13] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, GuodongZhang, Han Hu, and Yichen Wei. Deformable convolutionalnetworks. In ICCV, 2017.

[14] Zhigang Dai, Bolun Cai, Yugeng Lin, and Junying Chen.Up-detr: Unsupervised pre-training for object detection withtransformers. arXiv preprint arXiv:2011.09094, 2020.

[15] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and KristinaToutanova. Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprintarXiv:1810.04805, 2018.

[16] Yang Fu, Linjie Yang, Ding Liu, Thomas S Huang, andHumphrey Shi. Compfeat: Comprehensive feature aggre-

gation for video instance segmentation. arXiv preprintarXiv:2012.03400, 2020.

[17] Peng Gao, Minghang Zheng, Xiaogang Wang, Jifeng Dai,and Hongsheng Li. Fast convergence of detr with spatiallymodulated co-attention. arXiv preprint arXiv:2101.07448,2021.

[18] Ross Girshick. Fast r-cnn. In ICCV, 2015.[19] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra

Malik. Rich feature hierarchies for accurate object detectionand semantic segmentation. In Proceedings of the IEEE con-ference on computer vision and pattern recognition, pages580–587, 2014.

[20] Kaiming He, Ross Girshick, and Piotr Dollar. Rethinkingimagenet pre-training. In ICCV, 2019.

[21] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross B.Girshick. Mask r-cnn. In ICCV, 2017.

[22] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Spatial pyramid pooling in deep convolutional networks forvisual recognition. TPAMI, 2015.

[23] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.Deep residual learning for image recognition. In CVPR,2016.

[24] Zhaojin Huang, Lichao Huang, Yongchao Gong, ChangHuang, and Xinggang Wang. Mask scoring r-cnn. In CVPR,2019.

[25] Xu Jia, Bert De Brabandere, Tinne Tuytelaars, and LucVan Gool. Dynamic filter networks. In NeurIPS, 2016.

[26] Ha Young Kim and Ba Rom Kang. Instance segmentationand object detection with bounding shape masks. CoRRabs/1810.10327, 2018.

[27] Tsung-Yi Lin, Piotr Dollar, Ross B. Girshick, Kaiming He,Bharath Hariharan, and Serge J. Belongie. Feature pyramidnetworks for object detection. In CVPR, 2017.

[28] Tsung-Yi Lin, Priya Goyal, Ross B. Girshick, Kaiming He,and Piotr Dollar. Focal loss for dense object detection. InICCV, 2017.

[29] Tsung-Yi Lin, Michael Maire, Serge J. Belongie, JamesHays, Pietro Perona, Deva Ramanan, Piotr Dollar, andC. Lawrence Zitnick. Microsoft COCO: common objects incontext. In ECCV, 2014.

[30] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia.Path aggregation network for instance segmentation. InCVPR, 2018.

[31] Wei Liu, Dragomir Anguelov, Dumitru Erhan, ChristianSzegedy, Scott E. Reed, Cheng-Yang Fu, and Alexander C.Berg. SSD: single shot multibox detector. In ECCV, 2016.

[32] Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei,Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans-former: Hierarchical vision transformer using shifted win-dows. arXiv preprint arXiv:2103.14030, 2021.

[33] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fullyconvolutional networks for semantic segmentation. InCVPR, 2015.

[34] Xin Lu, Buyu Li, Yuxin Yue, Quanquan Li, and J. Yan. Gridr-cnn. In CVPR, 2019.

Page 14: arXiv:2105.01928v3 [cs.CV] 23 May 2021

[35] Tim Meinhardt, Alexander Kirillov, Laura Leal-Taixe, andChristoph Feichtenhofer. Trackformer: Multi-object track-ing with transformers. arXiv preprint arXiv:2101.02702,2021.

[36] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi.V-net: Fully convolutional neural networks for volumetricmedical image segmentation. In 3DV, 2016.

[37] Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan,Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Ed-ward Lockhart, Luis Cobo, Florian Stimberg, et al. Parallelwavenet: Fast high-fidelity speech synthesis. In ICML, 2018.

[38] Jiangmiao Pang, Kai Chen, Jianping Shi, Huajun Feng,Wanli Ouyang, and Dahua Lin. Libra r-cnn: Towards bal-anced learning for object detection. In CVPR, 2019.

[39] Shaoqing Ren, Kaiming He, Ross B. Girshick, and J. Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. TPAMI, 2015.

[40] Peize Sun, Yi Jiang, Enze Xie, Zehuan Yuan, ChanghuWang, and Ping Luo. Onenet: Towards end-to-end one-stageobject detection. arXiv preprint arXiv:2012.05780, 2020.

[41] Peize Sun, Yi Jiang, Rufeng Zhang, Enze Xie, Jinkun Cao,Xinting Hu, Tao Kong, Zehuan Yuan, Changhu Wang, andPing Luo. Transtrack: Multiple-object tracking with trans-former. arXiv preprint arXiv:2012.15460, 2020.

[42] Peize Sun, Rufeng Zhang, Yi Jiang, Tao Kong, Chen-feng Xu, Wei Zhan, Masayoshi Tomizuka, Lei Li, ZehuanYuan, Changhu Wang, et al. Sparse r-cnn: End-to-endobject detection with learnable proposals. arXiv preprintarXiv:2011.12450, 2020.

[43] Zhiqing Sun, Shengcao Cao, Yiming Yang, and Kris Kitani.Rethinking transformer-based set prediction for object detec-tion. arXiv preprint arXiv:2011.10881, 2020.

[44] Zhi Tian, Chunhua Shen, and Hao Chen. Conditional convo-lutions for instance segmentation. In ECCV, 2020.

[45] Zhi Tian, Chunhua Shen, H. Chen, and Tong He. Fcos: Asimple and strong anchor-free object detector. TPAMI, 2020.

[46] Zhi Tian, Bowen Zhang, Hao Chen, and Chunhua Shen. In-stance and panoptic segmentation using conditional convo-lutions. arXiv preprint arXiv:2102.03026, 2021.

[47] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-reit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Il-lia Polosukhin. Attention is all you need. arXiv preprintarXiv:1706.03762, 2017.

[48] Thang Vu, Kang Haeyong, and Chang D Yoo. Scnet: Train-ing inference sample consistency for instance segmentation.In AAAI, 2021.

[49] Huiyu Wang, Yukun Zhu, Hartwig Adam, Alan Yuille,and Liang-Chieh Chen. Max-deeplab: End-to-end panop-tic segmentation with mask transformers. arXiv preprintarXiv:2012.00759, 2020.

[50] Jianfeng Wang, Lin Song, Zeming Li, Hongbin Sun,Jian Sun, and Nanning Zheng. End-to-end object de-tection with fully convolutional network. arXiv preprintarXiv:2012.03544, 2020.

[51] Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, andLei Li. SOLO: Segmenting objects by locations. In ECCV,2020.

[52] Xinlong Wang, Rufeng Zhang, Tao Kong, Lei Li, and Chun-hua Shen. Solov2: Dynamic, faster and stronger. arXivpreprint arXiv:2003.10152, 2020.

[53] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen,Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-end video instance segmentation with transformers. arXivpreprint arXiv:2011.14503, 2020.

[54] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-YenLo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019.

[55] Saining Xie, Ross B. Girshick, Piotr Dollar, Zhuowen Tu,and Kaiming He. Aggregated residual transformations fordeep neural networks. In CVPR, 2017.

[56] Yuwen Xiong, Renjie Liao, Hengshuang Zhao, Rui Hu,Min Bai, Ersin Yumer, and Raquel Urtasun. Upsnet: Aunified panoptic segmentation network. In Proceedings ofthe IEEE/CVF Conference on Computer Vision and PatternRecognition, pages 8818–8826, 2019.

[57] Linjie Yang, Yuchen Fan, and Ning Xu. Video instance seg-mentation. In ICCV, 2019.

[58] Minghang Zheng, Peng Gao, Xiaogang Wang, HongshengLi, and Hao Dong. End-to-end object detection with adaptiveclustering transformer. arXiv preprint arXiv:2011.09315,2020.

[59] Xingyi Zhou, Vladlen Koltun, and Philipp Krahenbuhl.Probabilistic two-stage detection. In arXiv preprintarXiv:2103.07461, 2021.

[60] Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl. Ob-jects as points. arXiv preprint arXiv:1904.07850, 2019.

[61] Xizhou Zhu, Han Hu, Stephen Lin, and Jifeng Dai. De-formable convnets v2: More deformable, better results. InCVPR, 2019.

[62] X. Zhu, Weijie Su, Lewei Lu, Bin Li, X. Wang, and JifengDai. Deformable detr: Deformable transformers for end-to-end object detection. arXiv preprint arXiv:2010.04159,2020.