GPS: GROUP PEOPLE SEGMENTATION WITH DETAILED PART...

GPS: GROUP PEOPLE SEGMENTATION WITH DETAILED PART INFERENCE

Yue Liao1,2 , Si Liu3,⇤ , Tianrui Hui1,2, Chen Gao1,2, Yao Sun1, Hefei Ling4, Bo Li3

1Institute of Information Engineering, Chinese Academy of Sciences2School of Cyber Security, University of Chinese Academy of Sciences

3School of Computer Science and Engineering, Beihang University4Huazhong University of Science and Technology, School of Computer Science and Technology{liaoyue,huitianrui,gaochen,sunyao}@iie.ac.cn, {liusi,boli}@buaa.edu.cn, [email protected]

ABSTRACT

Noticeable progress has been witnessed in general object de-tection, semantic segmentation and instance segmentation,while parsing a group of people is still a challenging task forhuman-centric visual understanding due to severe occlusionand various poses. In this paper, we present a new large-scaledataset named “GPS (Group People Segmentation)”to boostacademical study and technology development. GPS contains14000 elaborately annotated images with 20 fine-grained se-mantic category labels related to human, divided into two sub-datasets corresponding to indoor and outdoor scenes involv-ing various poses, occlusion and background. We further pro-pose a novel GPSNet for group people segmentation. GPSNetconsists of a new “Adjusted RoI Align”module to adjust po-sition of detected person and align RoI features, such that thenetwork does not need to fit various positions of each per-son. A fusion of global and local features is also employedto refine parsing results. Compared with baseline methods,GPSNet achieves the best performance on GPS Dataset.

Index Terms— Semantic Segmentation, Human Parsing,Detection, Instance Segmentation

1. INTRODUCTION

In recent years, human parsing [1, 2, 3, 4, 5] is receivingincreasing attentions owning to its wide applications, suchas person re-identification [6] and person attribute prediction[7, 8, 9]. Most existing human parsing methods [5, 1, 10, 11]only focus on segmenting the parts of single persons. How-ever, photos with group of people are more common in ourdaily life, and few works are done on the Group People Seg-mentation (GPS) task, which needs to parse both instancesand human parts at the same time.

The GPS task has been rarely studied due to two mainreasons. First, there are few large-scale group people dataset

This work was supported by Natural Science Foundation of China (GrantU1536203, Grant 61572493, Grant 61876177).

* Corresponding author.

Ind

oo

rO

utd

oo

r

images Instance segmentation Part segmentation

Fig. 1. The GPS Dataset

with fine annotations available publicly. Second, the GPStask needs more efficient methods than those used for singleperson part segmentation and instance segmentation, becausemore mutual occlusions occur and more details of personsshould be understood.

To boost the study of the GPS task, we collected and anno-tated a large amount of photos that contain group of people.Figure 1 shows two collected images and their annotations.The proposed dataset is named as GPS Dataset. Because thelightness and backgrounds of images in the Indoor scenes aremuch different from those in outdoor scenes, it is divided intotwo sub-datasets, Indoor Dataset and Outdoor Dataset.

A straightforward way to resolve the GPS task is to de-tect single persons first and parse the body parts of each in-stance afterwards. But this method suffers from two maindrawbacks. First, due to the mutual occlusions, the majorperson in each detected bounding boxes is usually not easy todetermine, since there may exist background persons in thesame bounding boxes. This often makes the neural nets con-fused and hence leads to poor segmentation accuracy. Second,the detected bounding boxes only contain local informationbut lack the global context of the whole image. This usuallymakes the neural nets mistaken the belongingness of humanparts or misunderstand the category of the parts.

To alleviate these two drawbacks, we propose a novel con-volutional neural network, named GPSNet, which has fourmain sub-networks: Feature Extraction Network, Adjusted

Single Person Parsing Network, Global Parsing Network, andFusion Network. The Feature Extraction Network extractsfeatures of the input images. The Adjusted Single PersonParsing Network parses the human parts for single person,which is based on the Mask R-CNN framework [12]. RoIs arecomputed by a similar structure as RPN first. Then the pro-posed Adjusted RoI Align module align the features of eachRoIs to fixed sizes (e.g. 41 ⇥ 41 pixels). The module alsoadjust it so that the features mapping to the region of majorperson always lie in the middle of the aligned results. Next, aFCN [13] based parser is attached to obtain the segmentationof the human parts. The Global Parsing Network simulta-neously segments the whole image for global context. Andthe Fusion Network fuses the features obtained from the Ad-justed Single Person Parsing Network and the Global ParsingNetwork to predict the final results.

The contribution of this paper are summarized as follows.(1) We collected and annotated a large amount of imageswhich contain group of people to build the GPS dataset. It canserve as a benchmark for GPS task and will be released soonfor academic uses. (2) The proposed GPSNet achieves thebest performance on GPS Dataset, compared with two base-line methods. A novel Adjusted RoI Align module is alsoproposed to adjust and align the features of detected personsto improve performance. All components of the introducedmethods can be learned jointly in an end-to-end manner.

Dataset # Image # Category InstancePASCAL-Part[14] 3,533 7 ⇥

ATR[1] 17,700 18 ⇥LIP[3] 50,462 20 ⇥

MHP[15] 4,980 19p

GPS-Indoor 7,500 20p

GPS-Outdoor 6,500 20p

Table 1. Statistics for publicly available human parsingdatasets.

2. RELATED WORKS

Human Parsing Datasets: The statistics of popular publiclyavailable datasets for human parsing are summarized in Ta-ble. 1. Datasets other than MHP contain only human partannotation without human instance annotation. The MHPdataset annotates both but has limited number of images. OurGPS dataset includes 14000 images divided into indoor andoutdoor sub-datasets with elaborate labeled regions and finerboundaries.

Human Parsing Approaches: Liang et al. [1] proposea Contextualized Convolutional Neural Network , which in-tegrates multiple levels of image contexts into a unified ner-work, to tackle the problem and achieve very impressing re-sults. Xia et al. [16] propose the “Auto-Zoom Net” for human

paring, which adapts to the local scales of objects and parts.Gong et al. [3] propose a self-supervised structure-sensitivelearning framework for human parsing, which is capable ofexplicitly enforcing the consistency between the parsing re-sults and the human joint structures. Some other works ex-plore how to jointly object and part segmentation using deeplearned potentials [17]. Li et al. [15] propose a novel MH-Parser model as a reference method which generates parsingmaps and instance masks simutaneously in a bottom-up fash-ion. Although great success achieved, these methods methodssolve human parsing problem with the assistance of pose in-formation or edge prior between human instances, while oursolution combines local and global segmentation clues to re-fine final results without using additional annotations.

3. GROUP PEOPLE SEGMENTATION BENCHMARK

The newly collected GPS Dataset contains two sub-datasets,namely Indoor Dataset and Outdoor dataset. All the imagesare taken from real-world scenarios appearing in a wide rangeof resolutions with challenging poses, various viewpoints andheavy occlusions. Some examples are shown in Fig. 1.

The Indoor Dataset contains 7500 images which are sam-pled from indoor scenes, such as cafes, living rooms, etc.Each image includes at least two persons, while the averagenumber of persons in each image is 5.23. Every foregroundperson in this dataset is annotated with a segmentation maskowning 20 categories of semantic labels. While the OutdoorDataset consists of 6500 images with average 3.54 personsper image. Those images contain the same 19 categories as inthe Indoor Dataset except for category “tie” and are capturedfrom outdoor scenes such as playground, beach, etc.

0.26%

29%

21%15%

10%

8%

6% 4%7%

� � � � � � � � >=10

23%

32%25%

13%

4%

2%1%0.4% 0.4%

� � � � � � � � >=10

Fig. 2. The number of persons per image.

3.1. Image collection and annotation methodology

With pre-chosen keywords describing some specific scenes(e.g. playground, restaurant) and relationship between people(e.g. group, crowd, team), we crawl a large number of imagesfrom the Internet and carry out some preliminary preprocesssuch as filtration.

After the above step of curating imagery, the next stepis manual annotation executed by professional data annota-

0

5000

10000

15000

20000

25000

30000

35000

40000

0

5000

10000

15000

20000

25000

Fig. 3. The appearing times of human parts.

tors, in which two distinct tasks are contained. The formertask is manual counting of the number of foreground per-sons and then duplicating each image into several copies interms of that number. Once the former task is done, thelatter one follows to specify the fine-grained pixel-wise la-bel for each instance. The annotation of semantic segmen-tation for each instance is formed in the order from left toright in line with the duplicated image having the self-indexfrom beginning to the end. For each instance, the follow-ing semantic categories are defined and annotated: hat, hair,“sun glasses”, “upper clothes”, “skirt”, “pants”, “dress”,“belt”, “left shoe”, “right shoe”, “face”, “left leg”, “rightleg”, “left arm”, “right arm”, “bag”, “scarf”, “socks”, and“tie” (not included in Outdoor Dataset).

Each instance occupies a complete set of annotationswhenever the relevant category takes place in the current im-age. When one instance is annotated, others are identified asbackground. Consequently, the resulting annotation set foreach image contains masks of N person-level segmentation,where N is the number of persons in the image.

3.2. Dataset splits and statistics

The images in Indoor Dataset and Outdoor Dataset are splitinto train, validation and test sets with the ratio 3:1:1. Specifi-cally, the number of images for training/validation/testing are4500/1500/1500 and 3900/1300 /1300 in Indoor Dataset andOutdoor Dataset respectively.

To analyze each detailed region of human more precisely,we defined 14 categories of clothing and 6 body parts. Be-sides common clothes like upper clothes and pants, some in-

frequent categories such as skirts and jumpsuits are annotatedas well. Furthermore, small scale accessories are taken intoconsideration including sunglasses, gloves, tie, and socks.

The statistics of the number of persons per image areshown in Figure 2. And Figure 3 illustrates the appearingtimes of each human part. By this statistics, we can seethat the GPS Dataset has plentiful and sophisticated multi-human scenarios with high annotation quality, which providesenough difficulty for human parsing methods.

4. METHOD

We propose a new convolutional neural network, called GP-SNet, for the Group People Segmentation task. i.e., Fea-ture Extraction Network, Adjusted Single Person Parsing Net-work, Global Parsing Network and Fusion Network. TheFeature Extraction Network and Global Parsing Network ex-ploit DeepLab v2 to extract features and generate human partsparsing results without distinguishing instances. In the fol-lowing subsections, we illustrate the details of the AdjustedSingle Person Parsing Network and the Fusion Network.

4.1. The Adjusted Single Person Parsing Network

We notice that many person instances cropped by the pre-dicted RoIs locate in various positions in those RoIs, whichmeans the neural network has to be more powerful to parsehumans in various positions. This may lower the accuracy ofthe predictions.

To improve the accuracy, we propose a novel AdjustedSingle Person Parsing Network, which adjusts the positionof person in the RoI, such that the major person of this RoIalways appears in almost the same position of the aligned fea-tures. This adjustment not only decreases the difficulty of hu-man part segmentation, but also assists the network to locatethe major person. Since faces are much easier to detect, weprefer to using the faces to adjust the positions of persons.

Figure 5 illustrates the procedure of this adjusted RoIalignment. Specifically, a face detector is trained togetherwith the instance detector, such that besides the person RoI,we will also have face RoIs. The major face RoI in a personRoI is supposed to have the largest face RoI region. Fromthis face RoI, the center of face is calculated. Next, the per-son RoI is padded such that the center of the face RoI movesto the fixed relative position of this person RoI. At last, theRoI align proposed in Mask-RCNN [12] is used to pool thisadjusted RoI to features with fixed sizes. FCN layers followthe above RoI alignment and predict the pixel-level categoriesfor the adjusted features. During our experiments, the struc-ture of this parser is the same as that of the Global ParsingNetwork, but does not share the weights.

CONVNETRPN

Global Parsing

Proposals

Global Features

Padded Local Features Local Parsing

Global-to-Local Parsing

Concatenates Fusion

Final ParsingInput

RoI alignMulti-Receptive fields

Multi-Receptive fieldsAdjust back

Adjusted RoI align

Feature Extracting Network

Global Parsing Network

Adjusted Single Person Parsing Network

Fusion Network

Fig. 4. The architecture of the proposed GPSNet framework.

Face RoI

Person RoI

RoI Align

Adjust by padding

Adjusted RoI

Fig. 5. The adjusted RoI align.

4.2. The Fusion Network

The Fusion Network fuses the localized single person parsingresult and the global parsing result. The localized predictionshould have better accuracies on inferring the instances, andthe global prediction is supposed to have better accuracies ondetailed parts. So the fusion of these two results should leadto better performance.

Before fusing the results, two operations have to be done.First, the output of the single person parsing network corre-sponds to the adjusted RoI of person, so it must be adjustedback to the real person RoI. Second, a general RoI align-ment should be done to the global parsing results to obtainthe corresponding cropped result. After that, the predictionresults from both local and global parsers are concatenatedand pushed into two successive convolution layers. Finally,the fused result is obtained.

5. EXPERIMENTS

5.1. Experimental setting

Dataset: The dataset for experiments is the proposed GPSDataset. We trained our network on both Indoor Dataset and

Outdoor Dataset.Evaluation metrics: Motivated by the standard AP metricfor evaluating instance-level segmentation, we propose a newevaluation metric, called AP� , on the GPS Dataset. To calcu-late the value of AP� for a given threshold �, we first matchthe predicted instances with ground-truths. The pixel-levelmean IoU is computed between a ground-truth instance andall predicted ones, and the mean is calculated over all hu-man part categories. The predicted instance that has the high-est mean IoU with the ground-truth is regarded as matchedinstance. An instance prediction is considered correct if itsmean IOU with its matched ground-truth is above the thresh-old �.

We also use the average AP metric to evaluate the overallperformance. That is, average AP is the mean of the AP�

while � ranges from 0.1 to 0.9 by step 0.1.Baselines: Since there are few relevant studies done on theGPS task, we use two traditional methods as our baselines.The backbone networks for segmentation of the baselines areall based on the DeepLab v2 method, which is the same asour network.

1. Detection + parsing (Det.+Par.): This baselinemethod trains a Faster R-CNN network first, using thebounding boxes deduced by the segmentation annota-tions. Next, the single persons are cropped by the pre-dicted bounding boxes, and used for training a singleperson parser. At last, the predicted results of singlepersons are regarded as instances for evaluations.

2. Instance segmentation + global parsing (Ins.+Glo.):This baseline method parses the global images with-out considerations of instances first. Then, a Mask-RCNN network is trained to predict the human in-stances. At last, the corresponding region of a instance

on the global parsing result is regarded as the part seg-mentation of this instance.

Implementation details: All our experiments are done onCaffe. The initial learning rate is 0.001, and the “step” methodis used for changing the learning rate after 50000 iterations.The batch size during training is 2. We use DeepLab-v2framework as our global parser, and share the output featuremaps of conv5 3 with spatial dimension 1/8 of the input im-ages for dense maps. We use a Faster R-CNN to generate theRoIs, and created a new layer to do the adjustment and thealignment of RoIs. On the GPS Dataset, the output size ofafter the alignment is set 41⇥ 41. Experimental results showthis size is dense enough for part segmentation, and it willalso not cost too much memory.

5.2. Comparisons with baselines

We compare the proposed method with 2 baseline methodsin both Indoor Dataset and Outdoor Dataset. Quantitativecomparisons are shown in Table 2 and 3. In both tables,“Det.+Par.” and “Ins.+Glo.” are the two baseline methods.“Loc.” refers to the ablated version of GPSNet, which con-tains only the Feature Extraction Network and Single Per-son Parsing Network without adjustments, while “Ad. Loc.”refers to the “Loc.” version with adjustments. Similarly,“Fuse” means the GPSNet without adjustments, while “Ad.Fuse” is exactly the GPSNet.

Methods AP0.5 AP0.6 AP0.7 AP0.8 avg.AP

Det.+Par. 36.99 32.05 24.16 9.42 39.74Ins.+Glo. 39.99 32.84 25.62 11.23 41.9

Loc. 41.05 32.79 23.25 10.84 41.65Ad. Loc. 42.16 33.29 23.61 10.91 42.58

Fuse 44.11 36.19 27.78 13.85 44.21Ad. Fuse 45.86 37.24 27.65 14.31 45.97

Table 2. Comparison with baselines on Indoor dataset. (%).

Methods AP0.5 AP0.6 AP0.7 AP0.8 avg.AP

Det.+Par. 70.91 62.13 55.21 17.25 63.27Ins.+Glo. 72.01 63.11 56.03 18.23 65.03

Loc. 71.43 62.48 53.33 17.44 63.97Ad. Loc. 73.44 63.22 55.87 18.14 65.28

Fuse 75.33 64.84 57.68 19.22 66.74Ad. Fuse 77.56 66.24 58.43 19.14 68.05

Table 3. Comparison with baselines on Outdoor dataset. (%).

The GPSNet performs much better than the two baselineson both datasets. There are three reasons. First, the GPSNet istrained end-to-end, so each sub-network improves each otherduring the training. Second, the adjusted RoI alignment is

used to adjust the positions of persons to fixed regions. So thesingle person parser has more capacity to predict the humanparts more accurately. This can be seen from the comparisonsof “Loc.” vs. “Ad. Loc.” and “Fuse” vs. “Ad. Fuse”. Third,the fusion of global and local parsing results also improvesthe final results, which shown by the comparisons of “Loc.”vs. “Fuse” and “Ad. Loc.” vs. “Ad. Fuse”.

Detected person

Adjusted person

Ground-truth

Result by NO adjust

Result by adjust

Fig. 6. Component analysis on the adjusted RoI alignment.

Detected person

Ground-truth

Local Global Fusion

Fig. 7. Component analysis on the fusion of local and globalresults.

5.3. Component Analysis

There are two important components in the GPSNet, the ad-justed RoI alignment and the fusion of local and global re-sults. Table 2 and 3 have shown the improvements of thesetwo components by quantitative comparisons. Figure 6 and 7present some qualitative demonstrations.

In Figure 6, to better illustrate the effects of the adjust-ments, we apply the adjustment to the detected images. Wecan see that after the adjustments, the faces of the detectedpersons move to the upper centers of the boxes. The predic-tions of adjusted persons are hence better since the networkdoes not need to fit the various positions of persons.

Figure 7 shows the improvements brought by the fusion.Local and global results are both somewhat not so good, butthe fusion results merge the advantages of each predictionsand are much better.

5.4. Qualitative Results

Many quantitative comparison results are shown in Figure 8.The GPSNet outperforms than two baselines obviously forthese cases. The “Det.+Par” method is usually able to findout all instances, but he detailed part parsing is not so gooddue to the occlusions. The “Ins.+Glo.” method often missesinstances which also lowers the accuracy.

Image Det.+Par. Ins.+Glo. GPS Net Ground-truth

Indo

or D

atas

et

Image Det.+Par. Ins.+Glo. GPS Net Ground-truth

Ou

tdo

or

Data

set

Fig. 8. Qualitative results.

6. CONCLUSIONS

To boost the study of the Group People Segmentation tasks,we collected and annotated a large dataset, named GPSDataset.We also proposed a new framework, called GPSNet,to resolve the GPS task. Experiments show this new net-work has better performances. The GPS Dataset as well asthe codes of GPSNet will be released for academic researchessooner.

7. REFERENCES

[1] Xiaodan Liang, Chunyan Xu, Xiaohui Shen, JianchaoYang, Si Liu, Jinhui Tang, Liang Lin, and ShuichengYan, “Human parsing with contextualized convolutionalneural network,” in ICCV, 2015.

[2] Jian Zhao, Jianshu Li, Yu Cheng, Li Zhou, Terence Sim,Shuicheng Yan, and Jiashi Feng, “Understanding hu-mans in crowded scenes: Deep nested adversarial learn-ing and a new benchmark for multi-human parsing,”arXiv, 2018.

[3] Ke Gong, Xiaodan Liang, Dongyu Zhang, XiaohuiShen, and Liang Lin, “Look into person: Self-supervised structure-sensitive learning and a new bench-mark for human parsing,” in CVPR, 2017.

[4] Si Liu, Xiaodan Liang, Luoqi Liu, Xiaohui Shen, Jian-chao Yang, Changsheng Xu, Liang Lin, Xiaochun Cao,

and Shuicheng Yan, “Matching-cnn meets knn: Quasi-parametric human parsing,” in CVPR, 2015.

[5] Xiaodan Liang, Si Liu, Xiaohui Shen, Jianchao Yang,Luoqi Liu, Liang Lin, and Shuicheng Yan, “Deep hu-man parsing with active template regression,” TPAMI,2015.

[6] Rui Zhao, Wanli Ouyang, and Xiaogang Wang, “Learn-ing mid-level filters for person re-identification,” inCVPR, 2014.

[7] Si Liu, Zheng Song, Guangcan Liu, Shuicheng Yan,Changsheng Xu, and Hanqing Lu, “Street-to-shop:Cross-scenario clothing retrieval via parts alignment andauxiliary set,” in CVPR, 2012.

[8] HW Zhang, Z-J Zha, Y Yang, SC Yan, Y Gao, and T-S Chua, “Attribute-augmented semantic hierarchy: to-wards bridging semantic gap and intention gap in imageretrieval,” in MM, 2013.

[9] Si Liu, Jiashi Feng, Zheng Song, Tianzhu Zhang, Han-qing Lu, Changsheng Xu, and Shuicheng Yan, “Hi,magic closet, tell me what to wear!,” in MM, 2012.

[10] Kazuhiro Yamaguchi, Mohammad Hadi Kiapour, andTamara Berg, “Paper doll parsing: Retrieving similarstyles to parse clothing items,” in ICCV, 2013.

[11] Kota Yamaguchi, M Hadi Kiapour, Luis E Ortiz, andTamara L Berg, “Parsing clothing in fashion pho-tographs,” in CVPR, 2012.

[12] Kaiming He, Georgia Gkioxari, Piotr Dollar, and RossGirshick, “Mask r-cnn,” in ICCV, 2017.

[13] Jonathan Long, Evan Shelhamer, and Trevor Darrell,“Fully convolutional networks for semantic segmenta-tion,” in CVPR, 2015.

[14] Xianjie Chen, Roozbeh Mottaghi, Xiaobai Liu, SanjaFidler, Raquel Urtasun, and Alan Yuille, “Detect whatyou can: Detecting and representing objects using holis-tic models and body parts,” in CVPR, 2014.

[15] Jianshu Li, Jian Zhao, Yunchao Wei, Congyan Lang, Yi-dong Li, Terence Sim, Shuicheng Yan, and Jiashi Feng,“Multiple-human parsing in the wild,” arXiv, 2017.

[16] Fangting Xia, Peng Wang, Liang-Chieh Chen, andAlan L Yuille, “Zoom better to see clearer: Humanand object parsing with hierarchical auto-zoom net,” inECCV, 2016.

[17] Peng Wang, Xiaohui Shen, Zhe Lin, Scott Cohen, BrianPrice, and Alan L Yuille, “Joint object and part segmen-tation using deep learned potentials,” in ICCV, 2015.

GPS: GROUP PEOPLE SEGMENTATION WITH DETAILED PART...

Documents

Transcript of GPS: GROUP PEOPLE SEGMENTATION WITH DETAILED PART...