LOMO3D Descriptor for Video-Based Person...

5
LOMO3D DESCRIPTOR FOR VIDEO-BASED PERSON RE-IDENTIFICATION Sutong Zheng ? Xiaoyu Li Zhuqing Jiang ? Xiaoqiang Guo ? Beijing University of Posts and Telecommunications, Beijing, China {zhengst, jiangzhuqing}@bupt.edu.cn Academy of Broadcasting Science, Beijing, China {lixiaoyu, guoxiaoqiang}@abs.ac.cn ABSTRACT Person re-identification methods aim to match identical pedestrians across non-overlapping camera views. In this task, a robust and distinctive feature is of great value, espe- cially for complex scenes that are interfered by occlusion, illumination variance, and background clutter. Existing de- scriptors focus on spatial information of images, the capacity of which is limited on video-based data. Video-based person re-identification has more potential information but is also more challenging. To this end, this paper proposes a feature representation named LOMO3D that extracts both spatial in- formation and temporal information. This feature focuses on the maximum horizontal occurrence of local features in both spatial and temporal dimension. Experiments on iLIDS-VID and PRID2011 datasets affirms the strong representation ca- pability of this new descriptor compared with state-of-the-art hand-crafted features. Index Termsperson re-identification, LOMO, tempo- ral features, maximum occurrence 1. INTRODUCTION The growing quantity of monitor cameras in recent years brings rapid growth in surveillance data. Specific information needs to be extracted from the data for various tasks, such as face recognition, object detection, object tracking, and person re-identification (Re-ID). Task of Re-ID is to match identical person from non-overlapping camera views. There are onerous challenges in this task, as is shown in Fig. 1. First of all, target scene can be complex because of occlusion, illumination variance, background clutter and camera’s color cast. Secondly, some factors such as similarity between peo- ple in gait, dressing style and shape obscures the edge against different person. Other factors like variance in people’s pose and viewing angle enlarges the inter-class variation. Further- more, most of the video frames are low in resolution, which makes it unworkable to use techniques based on details, such as face recognition. Thanks to National Science Foundation of China(61671077,61671264) for funding. Fig. 1. Person re-identification challenges (left to right): background clutter, occlusion, illumination variation, similar dressing style, different pose, low resolution. Existing algorithms to solve the problem can be roughly classified into two types according to their strategies. Most earlier approaches extract features from given images for distinctive description of person and choose metric learning methods to find a proper feature space in which the distance between identical samples is small, and the distance between unidentical samples is large [1, 2, 3, 4, 5, 6]. A great deal of research has been devoted to optimizing these two branches and the algorithms also achieve high matching rates. On the other hand, deeply-learned models has been popular in com- puter vision in recent years. The second approach is based on deep neural network (DNN) [7, 8, 9], which has shown strong capability in detection, tracking and segmentation tasks. By constructing a Siamese network, a DNN-based algorithm can achieve feature extraction and metric learning simulta- neously. The parameters of each layer are often shared, and distance between samples is output from the top layer. According to data’s organization form, Re-ID algorithms can also be classified into single-shot task and multi-shot task. In a single-shot task, there is just one image for each person from each camera. In a multi-shot task, data for pedestrians is provide in form of image sequences. Early research focused 672 978-1-5090-5990-4/17/$31.00 ©2017 IEEE GlobalSIP 2017

Transcript of LOMO3D Descriptor for Video-Based Person...

Page 1: LOMO3D Descriptor for Video-Based Person Re-identificationfaculty.sist.shanghaitech.edu.cn/faculty/luoxl/... · puter vision in recent years. The second approach is based on deepneuralnetwork(DNN)[7,8,9],

LOMO3D DESCRIPTOR FOR VIDEO-BASED PERSON RE-IDENTIFICATION

Sutong Zheng

?Xiaoyu Li

†Zhuqing Jiang

?Xiaoqiang Guo

? Beijing University of Posts and Telecommunications, Beijing, China{zhengst, jiangzhuqing}@bupt.edu.cn

†Academy of Broadcasting Science, Beijing, China{lixiaoyu, guoxiaoqiang}@abs.ac.cn

ABSTRACTPerson re-identification methods aim to match identicalpedestrians across non-overlapping camera views. In thistask, a robust and distinctive feature is of great value, espe-cially for complex scenes that are interfered by occlusion,illumination variance, and background clutter. Existing de-scriptors focus on spatial information of images, the capacityof which is limited on video-based data. Video-based personre-identification has more potential information but is alsomore challenging. To this end, this paper proposes a featurerepresentation named LOMO3D that extracts both spatial in-formation and temporal information. This feature focuses onthe maximum horizontal occurrence of local features in bothspatial and temporal dimension. Experiments on iLIDS-VIDand PRID2011 datasets affirms the strong representation ca-pability of this new descriptor compared with state-of-the-arthand-crafted features.

Index Terms— person re-identification, LOMO, tempo-ral features, maximum occurrence

1. INTRODUCTION

The growing quantity of monitor cameras in recent yearsbrings rapid growth in surveillance data. Specific informationneeds to be extracted from the data for various tasks, suchas face recognition, object detection, object tracking, andperson re-identification (Re-ID). Task of Re-ID is to matchidentical person from non-overlapping camera views. Thereare onerous challenges in this task, as is shown in Fig. 1.First of all, target scene can be complex because of occlusion,illumination variance, background clutter and camera’s colorcast. Secondly, some factors such as similarity between peo-ple in gait, dressing style and shape obscures the edge againstdifferent person. Other factors like variance in people’s poseand viewing angle enlarges the inter-class variation. Further-more, most of the video frames are low in resolution, whichmakes it unworkable to use techniques based on details, suchas face recognition.

Thanks to National Science Foundation of China(61671077,61671264)for funding.

Fig. 1. Person re-identification challenges (left to right):background clutter, occlusion, illumination variation, similardressing style, different pose, low resolution.

Existing algorithms to solve the problem can be roughlyclassified into two types according to their strategies. Mostearlier approaches extract features from given images fordistinctive description of person and choose metric learningmethods to find a proper feature space in which the distancebetween identical samples is small, and the distance betweenunidentical samples is large [1, 2, 3, 4, 5, 6]. A great deal ofresearch has been devoted to optimizing these two branchesand the algorithms also achieve high matching rates. On theother hand, deeply-learned models has been popular in com-puter vision in recent years. The second approach is based ondeep neural network (DNN) [7, 8, 9], which has shown strongcapability in detection, tracking and segmentation tasks. Byconstructing a Siamese network, a DNN-based algorithmcan achieve feature extraction and metric learning simulta-neously. The parameters of each layer are often shared, anddistance between samples is output from the top layer.

According to data’s organization form, Re-ID algorithmscan also be classified into single-shot task and multi-shot task.In a single-shot task, there is just one image for each personfrom each camera. In a multi-shot task, data for pedestrians isprovide in form of image sequences. Early research focused

672978-1-5090-5990-4/17/$31.00 ©2017 IEEE GlobalSIP 2017

Page 2: LOMO3D Descriptor for Video-Based Person Re-identificationfaculty.sist.shanghaitech.edu.cn/faculty/luoxl/... · puter vision in recent years. The second approach is based on deepneuralnetwork(DNN)[7,8,9],

(a) Original SILTP (b) SILTP3D

Fig. 2. Extension of SILTP descriptor to time domain. (a) The original SILTP pattern. The pattern is calculated based onneighbor pixels in the same frame. (b) The SILTP3D pattern. The pattern is calculated based on neighbor pixels not only in thesame frame but also in the previous frame and the succeeding frame.

on single-shot tasks. Meanwhile features are mainly takenfrom single images. Nevertheless, the real surveillance datais in the form of video or image sequence, which tends tocontain more information than a single image.

In this paper, we extend LOMO descriptor [4] to time do-main by extracting temporal Scale Invariant Local TernaryPattern (SILTP) descriptor [10] and introducing cuboid blocksin histogram calculation. Experiments on public datasets af-firms the strong representation capability of the proposed de-scriptor compared with state-of-the-art hand-crafted features.The rest of this paper is organized as follows. In section 2 wediscuss works related to this paper. In section 3 the proposedapproach is illustrated specifically. Section 4 contains detailsof experiments as well as evaluation of the results. Finally,conclusions of this paper are in section 5.

2. RELATED WORK

In recent years, new achievements have been made in Re-ID.A number of studies work on new descriptors for person ap-pearance [11, 12, 5, 13]. Actually, feature representation tech-niques are commonly used in many computer vision tasks.According to the content described, low-level features fallsinto several categories: color feature [14, 15], texture fea-ture, shape feature and so on. Color histogram is the mostcommonly used method to express color features, with theadvantage of not being affected by rotation and translation ofthe image. Common used texture features include Local Bi-nary Pattern (LBP) [16], Local Ternary Pattern (LTP) [17],and SILTP [10]. In general, there are two kinds of shape fea-tures: one is the contour feature, the other is the regional fea-ture. The contour of the image is mainly aimed at the outerboundary of the object, while the regional feature of the imageis related to the whole shape region. These coarse descriptors

can only describe a single nature, and they are often used to-gether to achieve better performance.

Besides the basic descriptors, some new advanced fea-tures have been proposed in Re-ID area. The Symmetry-Driven Accumulation of Local Features (SDALF) [11] isan appearance-based descriptor that models the image fromthe global color content, the spatial relationship of color,and recurrent highly-structured patches of different humanbody parts. Lisanti et. al [12] proposed weighted histogramsof overlapping stripes (WHOS) that approximately segmentforeground from background by introducing weights relatedto the position of feature points in a graph. Ensemble ofLocalized Features (ELF6) [5] contains 8 color channels and21 texture channels, which are extracted from 6 horizon-tal stripes. Matsukawa et. al [13] proposed a hierarchicalGaussian model based on a hierarchical distribution of pixelfeatures and patch features.

3. OUR APPROACH

3.1. LOMO Descriptor Revisit

Liao et. al [4] proposed Local Maximal Occurrence (LOMO)feature that represent the horizontal maximum occurrence oflocal features. First of all, Retinex preprocessing [18] is lever-aged to make images more close to the real perception ofhuman eyes. Then the picture is resized as 128*48 and di-vided into overlapping sliding square blocks in size of 10⇥10with the overlapping rate set to 0.5. HSV histograms [14]and SILTP histograms [10] are extracted from each block inthree scales. Finally, the color descriptor and texture descrip-tor are integrated by maximizing the histograms in blocks atthe same horizontal location.

673

Page 3: LOMO3D Descriptor for Video-Based Person Re-identificationfaculty.sist.shanghaitech.edu.cn/faculty/luoxl/... · puter vision in recent years. The second approach is based on deepneuralnetwork(DNN)[7,8,9],

Fig. 3. Flowchart of LOMO3D extraction.

3.2. LOMO3D Descriptor

While the original LOMO feature concerns horizontal maxi-mal occurrence only in spatial domain, we extend this princi-ple into time domain so as to extract extra information fromthe continuity of video frames. We also use HSV histogramfor color representation. To describe texture character, wepropose a modified SILTP descriptor called SILTP3D that cal-culates feature value not only spatially but also temporally, asshown in Fig. 2. Given any pixel location (xc, yc), SILTPencodes it as

SILTP ⌧N,R(xc, yc) =

N�1M

k=0

s⌧ (Ic, Ik), (1)

where Ic is the gray intensity value of the center pixel, Ik arethat of its N neighborhood pixels equally spaced on a circleof radius R in the current frame and the previous and follow-ing frames with distance R from the current frame, denotesconcatenation operator of binary strings, ⌧ is a scale factorindicating the comparing range, and s⌧ is a piecewise func-tion defined as

s⌧ =

8><

>:

01, if Ik > (1 + ⌧)Ic,

10, if Ik < (1� ⌧)Ic,

00, otherwise.

(2)

The range of representation is extended from 8 pixels to 26pixels under the case of Fig. 2.

With the Retinex image sequences, we extract HSV andSILTP histograms from a series of cuboids formed withblocks in the same location of continuous frames instead ofwith a single block, as shown Fig. 3. Given an image se-quence with D frames, we define dc as the depth, wc as thewidth and hc as the height of cuboid for histogram extraction.That is, each cuboid contains hc⇥wc⇥dc pixels. Consideringthe overlapping rate �, the step size in vertical, horizontal,and time direction is �hc, �wc, and �dc respectively. Withthe image size of H ⇥ W , we have H�hc

�hc+ 1, W�wc

�wc+ 1,

and D�dc�dc

+ 1 cuboids in three directions. Then the maximaloccurrence is calculated in each horizontal stripe. That is,(W�wc

�wc+1)⇥ (D�dc

�dc+1) cuboids are integrated in the max-

imization step. We define HIST ji as the histogram of the

ith feature type extracted from the jth cuboid in a horizontalcuboid set. We have

HIST ji = {histji1, hist

ji2, ..., hist

jim}, (3)

where histjik(k = 1, 2, ...,m) is the value of the kth bin ofthe ith feature type and m is the number of bins. Then themaximization operate woks as

HISTi = {histi1, histi2, ..., histim}, (4)

where

histik =max(hist1ik, hist2ik, ...,

hist(W�wc

�wc+1)⇥(D�dc

�dc+1)

ik ).(5)

In this paper, parameters are initialized as follows: D =20, d = 8, h = 10, w = 10, � = 0.5. For the originalimage, H = 128, W = 64. We extract 8 ⇥ 8 ⇥ 8-bin jointHSV histogram, and SILTP0.3

6,3 as well as SILTP0.36,5. So we

get 24 ⇥ 11 ⇥ 4 = 1056 cuboids totally from an image se-quence, and 11 ⇥ 4 = 44 cuboids each horizontal location.To get multi-scale information, we produce two more scalesby applying 2 ⇥ 2 pooling to each image. Descriptors ex-tracted from different scales are concatenated. Our final de-scriptor has (8⇥ 8colorbins+36SILTPbins)⇥ (24+11+5horizontalgroups) = 30120 dimensions.

3.3. Metric Method

In the metric stage, we apply Cross-view Quadratic Discrim-inant Analysis (XQDA) [4] to learn the distance metric be-tween different images. XQDA extends Bayesian face andKISSME [2] algorithm into cross-view metric learning. TheXQDA metric achieves good performance when processinghigh dimensional features.

4. EXPERIMENTS

4.1. Experimental setup

We tested the proposed descriptor on two prevailing bench-mark datasets for person Re-ID: iLIDS-VID [19], and PRID2011[20]. The iLIDS-VID dataset contains 300 distinct pedestri-ans from two disjoint camera views. We only use the imagesequences based version with sequence lengths of 23 to 192frames. This dataset is very challenging due to clutteredbackground, illumination variations, different poses, peo-ple’s dressing similarities and random occlusions. PRID2011dataset contains 385 persons in camera A and 749 persons incamera B. We selected 178 persons with more than 23 im-ages in both cameras from the first 200 persons who appearin both cameras views. In this work, we randomly selectedhalf of the datasets (150 for iLIDS-VID, 89 for PRID2011)as the training set, and the rest of the datasets was remainedfor testing. Results are measured by Cumulated MatchingCharacteristics (CMC), as shown in Fig. 4.

674

Page 4: LOMO3D Descriptor for Video-Based Person Re-identificationfaculty.sist.shanghaitech.edu.cn/faculty/luoxl/... · puter vision in recent years. The second approach is based on deepneuralnetwork(DNN)[7,8,9],

Fig. 4. CMC comparison on iLIDS-VID and PRID2011 datasets.

4.2. Experimental evaluation

In Table 1, we list the performance of out approach and showsthe comparison with the state-of-the-art multi-shot based ap-proaches on iLIDS-VID and PRID2011 datasets, using top 1,5, 10, and 20 ranking accuracies. Except for the Rank1 ac-curacy on iLIDS-VID dataset, our approach outperforms bya considerable margin, especially on the PRID2011 dataset.That might be attributable to better definition and less occlu-sion of PRID2011 images. By observation of Fig. 4, we havefollowing findings:

• The proposed descriptor has better performance onPRID2011, which is more clear than the other. Thatmay be due to the descriptor’s deficiency of robustness.

• There exists positive correlation between performanceand sequence length. Longer sequences tend to havericher information.

• The SILTP descriptor performs obviously better thanHSV histogram descriptor on iLIDS-VID, while the lat-ter works well on PRID2011. That may be a result ofbackground clutter in iLIDS-VID images.

• Compared to other descriptors, LOMO3D has ave ob-vious advantage in top ranks.

5. CONCLUSION

In this paper, we proposed a temporal LOMO descriptornamed LOMO3D constructed by color descriptor and tex-ture descriptor. Features are extracted in three scales for

Table 1. Comparisons on iLIDS-VID and PRID2011.

Method iLIDS-VIDr=1 r=5 r=10 r=20

DVDL [21] 25.9 48.2 57.3 68.9CS-FAST3D+RMLLC [22] 28.4 54.7 66.7 78.1

MS-Colour&LBP+DVR [19] 34.5 56.7 67.5 77.5Salience+DVR [19] 30.9 54.4 65.1 77.1Color+LFDA [23] 28.0 55.3 70.6 88.0

AFDA [24] 37.5 62.7 73.0 81.8eSDC+MS-SDALF+DVR [25] 41.3 63.5 72.5 83.1

Proposed 38.1 67.0 78.0 87.7

Method PRID2011r=1 r=5 r=10 r=20

DVDL [21] 40.6 69.7 77.8 85.6CS-FAST3D+RMLLC [22] 31.2 60.3 76.4 88.6

MS-Colour&LBP+DVR [19] 37.6 63.9 75.3 89.4Salience+DVR [19] 41.7 64.5 77.5 88.8Color+LFDA [23] 43.0 73.1 82.9 90.3

AFDA [24] 43.0 72.7 84.6 91.9eSDC+MS-SDALF+DVR [25] 48.3 74.9 87.3 94.4

Proposed 78.9 94.6 96.7 99.0

better representation. Experimental results indicate that ourapproach achieves superior performance compared to state-of-the-art descriptors and multi-shot algorithms. As futurework, the LOMO3D descriptor can be improved on robust-ness.

675

Page 5: LOMO3D Descriptor for Video-Based Person Re-identificationfaculty.sist.shanghaitech.edu.cn/faculty/luoxl/... · puter vision in recent years. The second approach is based on deepneuralnetwork(DNN)[7,8,9],

6. REFERENCES

[1] K. Q. Weinberger and L. K. Saul, “Distance metriclearning for large margin nearest neighbor classifica-tion,” Journal of Machine Learning Research, vol. 10,no. 1, pp. 207–244, 2009.

[2] M. Kostinger, M. Hirzer, P. Wohlhart, P. M. Roth, andH. Bischof, “Large scale metric learning from equiva-lence constraints,” in CVPR, 2012, pp. 2288–2295.

[3] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon,“Information-theoretic metric learning,” in Machine

Learning, Proceedings of the Twenty-Fourth Interna-

tional Conference, 2007, pp. 209–216.

[4] S. Liao, Y. Hu, Xiangyu Zhu, and S. Z. Li, “Personre-identification by local maximal occurrence represen-tation and metric learning,” in CVPR, 2015, pp. 2197–2206.

[5] W. S. Zheng, S. Gong, and T. Xiang, “Reidentificationby relative distance comparison,” TPAMI, vol. 35, no. 3,pp. 653–668, 2013.

[6] F. Xiong, M. Gouand O. Camps, and M. Sznaier, Person

Re-Identification Using Kernel-Based Metric Learning

Methods, Springer International Publishing, 2014.

[7] N. McLaughlin, J. M. d. Rincon, and P. Miller, “Recur-rent convolutional network for video-based person re-identification,” in CVPR, 2016, pp. 1325–1334.

[8] S. Ding, L. Lin, G. Wang, and H. Chao, “Deep featurelearning with relative distance comparison for person re-identification,” Pattern Recognition, vol. 48, no. 10, pp.2993–3003, 2015.

[9] D. Yi, Z. Lei, S. Liao, and S. Z. Li, “Deep metric learn-ing for person re-identification,” in ICPR, 2014, pp. 34–39.

[10] S. Liao, G. Zhao, and V. Kellokumpu, “Modelingpixel process with scale invariant local patterns for back-ground subtraction in complex scenes,” in IEEE Con-

ference on Computer Vision and Pattern Recognition,2010, pp. 1301–1306.

[11] M. Farenzena, L. Bazzani, A. Perina, V. Murino, andM. Cristani, “Person re-identification by symmetry-driven accumulation of local features,” in CVPR, 2010,pp. 2360–2367.

[12] G. Lisanti, I. Masi, and A. D. Bagdanov, “Person re-identification by iterative re-weighted sparse ranking,”IEEE Transactions on Pattern Analysis Machine Intel-

ligence, vol. 37, no. 8, pp. 1629–1642, 2015.

[13] T. Matsukawa, T. Okabe, E. Suzuki, and Y. Sato,“Hierarchical gaussian descriptor for person re-identification,” in IEEE Conference on Computer Vision

and Pattern Recognition, 2016, pp. 1363–1372.

[14] J. Smith and S. Chang, “Visualseek: a fully automatedcontent-based image query system,” in ACM Interna-

tional Conference on Multimedia, 2010, pp. 87–98.

[15] J. van de Weijer, C. Schmid, J. Verbeek, and D. Lar-lus, “Learning color names for real-world applications,”TIP, vol. 18, no. 7, pp. 1512–1523, 2009.

[16] T. Ojala, M. Pietikainen, and D. Harwood, “Perfor-mance evaluation of texture measures with classificationbased on kullback discrimination of distributions,” inICPR, 1994, vol. 1, pp. 582–585.

[17] X. Tan and B. Triggs, “Enhanced local texture featuresets for face recognition under difficult lighting condi-tions,” in IEEE International Workshop on Analysis and

Modeling of Faces and Gestures, 2007.

[18] D. J. Jobson, Z.-U. Rahman, and G. A. Woodell, “Amulti- scale retinex for bridging the gap between colorimages and the human observation of scenes,” Image

Processing, vol. 6, no. 7, pp. 965–976, 1997.

[19] T. Wang, S. Gong, X. Zhu, and S. Wang, “Person re-identification by video ranking,” ECCV, pp. 688–703,2014.

[20] M. Hirzer, C. Beleznai, P. M. Roth, and H. Bischof, Per-

son Re-identification by Descriptive and Discriminative

Classification, Springer Berlin Heidelberg, 2011.

[21] S. Karanam, Y Li, and R. Radke, “Person re-identification with discriminatively trained viewpointinvariant dictionaries.,” in IEEE International Confer-

ence on Computer Vision, 2015, pp. 4516–4524.

[22] Z. Liu, J. Chen, and Y. Wang, “A fast adaptivespatio-temporal 3d feature for video-based person re-identification,” ICIP, pp. 4294–4298, 2016.

[23] S. Pedagadi, J. Orwell, S. Velastin, and B. Boghossian,“Local fisher discriminant analysis for pedestrian re-identification,” CVPR, pp. 3318–3325, 2013.

[24] Y. Li, Z. Wu, S. Karanam, and R. Radke, “Multi-shothuman re-identification using adaptive fisher discrimi-nant analysis.,” in British Machine Vision Conference,2015.

[25] T. Wang, S. Gong, X. Zhu, and S. Wang, “Person re-identification by discriminative selection in video rank-ing,” TPAMI, vol. 38, no. 12, pp. 2501–2514, 2016.

676