Group Membership Prediction - Northeastern UniversityGroup Membership Prediction Ziming Zhang,...

Group Membership Prediction

Ziming Zhang Yuting Chen and Venkatesh SaligramaDepartment of Electrical amp Computer Engineering Boston University

zzhang14 yutingch srvbuedu

Abstract

The group membership prediction (GMP) problem in-volves predicting whether or not a collection of instancesshare a certain semantic property For instance in kinshipverification given a collection of images the goal is to pre-dict whether or not they share a familial relationship In thiscontext we propose a novel probability model and introducelatent view-specific and view-shared random variables tojointly account for the view-specific appearance and cross-view similarities among data instances Our model positsthat data from each view is independent conditioned onthe shared variables This postulate leads to a parametricprobability model that decomposes group membership like-lihood into a tensor product of data-independent parame-ters and data-dependent factors We propose learning thedata-independent parameters in a discriminative way withbilinear classifiers and test our prediction algorithm onchallenging visual recognition tasks such as multi-cameraperson re-identification and kinship verification On mostbenchmark datasets our method can significantly outper-form the current state-of-the-art

1 IntroductionVisual similarity plays an important role in visual recog-

nition in object detection and scene understanding [11 17]A visual similarity function returns a score of how likelytwo instances (eg images and videos) share similar seman-tic concepts (eg persons cars etc) With this perspectivewe propose the Group Membership Prediction (GMP) prob-lem where the goal is to determine how likely a collectionof distinct items share the same semantic property Fig 1depicts the idea of the GMP problem for two visual recogni-tion tasks ie person re-identification and kinship verifica-tion In person re-identification (Re-ID) we are given a col-lection of images of persons captured from multiple views(cameras) and the goal is to detect whether or not they be-long to the same person In applications such as kinshipdetection the underlying semantic property is more gen-eral and the goal is to predict whether or not a collection of

(a) Person re-identification (b) Kinship verification

Figure 1 Illustration of group membership prediction (GMP) in the visualrecognition tasks of (a) person re-identification and (b) kinship verificationHere we would like to predict (a) whether the four pedetrain images aretaken from the same person and (b) whether the face images are from thesame family These images are borrowed from (a) VIPeR dataset [13] and(b) Family101 dataset [8] respectively

images share a familial relationship GMP poses significantchallenges on account of large variations in data includinglighting conditions poses and camera views

We introduce a novel parametric probability model forpredicting group membership Our key insight is that al-though the visual appearances can significantly vary theyshare a set of latent variables common to all views As de-picted in Fig 2 we can hypothesize ldquobody partsrdquo as sharedlatent variables for all the pedestrian images while for kin-ship verification ldquofacial landmarksrdquo could be considered asthe shared latent variables Our model postulates that con-ditioned on the location of each shared latent variable (bodypart or facial landmark) the visual appearance at that loca-tion is conditionally independent for different views Thisproperty leads to a natural way of measuring image similar-ities through comparison of visual similarities of the sameshared latent variables across different views

This postulate leads us to a joint parametric probabil-ity model that consists of view-specific and view-sharedrandom variables View-specific variables account for vi-sual characteristics within a view while view-shared vari-ables account for the integration of multi-view informationThe group membership likelihood factorizes into a tensorproduct consisting of data-independent and data-dependentfactors We learn the data-independent parameters (ieweights) discriminatively using bilinear classifiers Finallywe marginalize these data tensors over all the dimensions

1

(a) Body parts (b) Facial landmarks

Figure 2 Illustration of (a) body parts (eg head torso legs) for Re-IDand (b) facial landmarks (eg eyes nose mouth) for kinship verificationNote that in these aligned images these body parts or facial landmarksapproximately coincide in terms of spatial locations

with the learned weights as the group membership scoresOur experimental results on multi-camera person Re-ID andkinship verification demonstrate the good prediction perfor-mance and computational efficiency of our method

11 Related Work

GMP problem is closely related to multi-view learning(MVL) Indeed our perspective of shared variables hasbeen used before in the context of MVL [12 29 30] Nev-ertheless the goal of MVL specifically in visual recognitionis different from ours Namely the objective of MVL is toleverage multiple sources (eg texts images videos etc)of data corresponding to the same underlying object (egpersons events etc) to improve recognition performance[3 14 21 30] On the other hand our goal is to predictgroup membership among the multiple sources

Person Re-ID essentially is a GMP problem where eachcamera view can be taken as one of the instances In the lit-erature however most of existing works consider this prob-lem as an independent two-view classification task mainlyfocusing on cleverly designing local features [10 20 2631 36] or learning better metrics [15 16 18 19 25 37]Recently Figueira et al [12] proposed a semi-supervisedlearning method to fuse multi-view features for Re-ID sothat the features agree on the classification results Daset al [5] considered the group membership prediction inRe-ID by maximizing the summation of pairwise similar-ity scores using binary integer programming during testingUnlike [5] we formulate the group membership problemas a learning problem rather than a post-processing step toimprove the matching rate

Kinship verification is indeed another GMP problemwhere each family role (eg father mother son daugh-ter etc) can be considered as an instance Similar to per-son Re-ID existing works mainly focus on learning betterfeatures [6 9] and better distance metrics [23] for pair-wise classification [22] Recently Qin et al [28] pro-posed a bilinear model to handle so-called tri-subject kin-ship verification problems Fang et al [8] proposed a sparsegroup lasso based feature selection method to determinewhether a query person is from a specific family Unlike

[8 28] our method targets at a more general and challeng-ing problem which can be used to predict an arbitrary num-ber of images with a fixed structure of family roles such asfather-son father-mother-daughter grandfather-father-son-grandson etc

2 Our Method

21 Problem Setting

Let (Xm ym)m=1middotmiddotmiddot M be a group ofM persons fromdifferent views where forallmXm denotes the mth person andym denotes its label (eg identity or family) Let foralln =1 middot middot middot Nmxmn isin Xm be the nth image for the per-son with Nm images in total The goal of our method is topredict the following probability as group membership

p(y1 = middot middot middot = yM |X1 middot middot middot XM ) (1)

Note that our problem setting is naturally applicable tothe multiple instance cases For example during learningwe allow multiple images to be associated with a person (ieXm = xmn) in person Re-ID and kinship verification asin the CUHK Campus [35] and Family101 [8] datasets

While we have motivated our approach in the contextof shared latent variables (body parts or facial landmarks)this information is unavailable during the training or test-ing phases Furthermore estimating locations of body partsand facial landmarks is known to be extremely challenging[2 38] Fortunately in the context of the applications andproblems that we are concerned with the images are ap-proximately aligned In these images foreground objectsare centralized and well cropped Currently most bench-mark datasets are composed of such approximately alignedimages namely the same body parts or facial landmarksappear roughly at similar locations In such cases pixellocations provide good approximation of where body partsand facial landmarks are and we utilize this property to by-pass the detection challenge while accounting for spatialmisalignments with spatial kernels Note that the issue ofvisual ambiguity of the shared variables still remains in ourproblem

22 Parametric GMP Model

We introduce two latent variables to model the relation-ship between the class labels ym and data samples XmThe graphical representation of our parametric probabil-ity model is shown in Fig 3(a) where forallm zm denotesthe view-specific latent variable for view m h denotes theview-shared latent variable and Nm denotes the number ofimages from viewm Based on this model we can factorize

(a) Parametric probability model (b) Pairwise decomposition

Figure 3 (a) Graphical representation of our parametric probability modelfor GMP (b) Pairwise decomposition of our model in (a)

our group membership score as follows


=sum

z1middotmiddotmiddot zM h

py|z

Mprodm=1

p(zm|Xm h)p(h)

=sum


py|z middot p(h)Mprodm=1

[1

Nm

Nmsumn=1

p(zm|xmn h)

]

where py|z = p(y1 = middot middot middot = yM |z1 middot middot middot zM )

Model interpretation To show the intuition of our para-metric probability model we consider the person Re-ID ex-ample in Fig 2(a) in more detail In the Re-ID problemthe view-specific latent variables zm can be thought of asvisual appearances of body parts of different persons andthe view-shared latent variable h can be considered as thesebody parts which are shared among all the persons

Then using Bayes rule we can expand Eq 2 Inparticular for the two-view Re-ID problem we see thatthe group membership score of the image pair (x1 y1)and (x2 y2) as p(y1 = y2|x1x2) =

sumz1z2h

p(y1 =y2|z1 z2)p(x1|z1 h)p(x2|z2 h)p(h) Since visual appear-ances in z1 (or z2) are posited to be independent given im-age x1 (or x2) and the parts h we can predict whetheror not y1 is equal to y2 (ie p(y1 = y2|x1x2)) bymarginalizing the similarities of corresponding visual fea-tures of each individual part in both images (ie p(x1|z1 h)and p(x2|z2 h)) with some data-independent weights (iep(y1 = y2|z1 z2) and p(h)) Similarly for the kinship ex-ample in Fig 2(b) we can infer the group membership scoreby marginalizing the corresponding landmark similarities

We take these data-independent weights as the model pa-rameters for prediction which are learned discriminatively

23 Discriminative Learning of Model Parameters

231 Co-occurrence Tensor Representation

As discussed in Section 21 images are approximatelyaligned in the related applications Specifically in personRe-ID benchmarks the head is always located at the top ofimages torso in the middle and legs at the bottom Thistypical structure has been exploited in designing discrimi-native features [10] Therefore with approximately alignedimages we can bypass the problem of shared variable detec-tion and directly utilize pixel locations as surrogates for lo-cations of body parts or facial landmarks Note that we canstill allow small spatial misalignments by designing kernelsto account for spatial distortions

Recently Zhang et al [32] proposed an interesting fea-ture representation to handle visual ambiguity and spatialdistortion in images for person re-id The basic idea in theirmethod is to capture visual ambiguity using visual wordsand match them at similar locations using distance trans-form to handle spatial distortion This results in a visualword co-occurrence matrix for a pair of images

Inspired by [32] we propose a visual word co-occurrence tensor representation using p(zm|xmn h) frommultiple views to represent the group of data samples Theirproposed Gaussian kernel [32] is computationally cumber-some Instead we design a truncated exponential functionas the spatial kernel κ with an arbitrary distance functioninside to improve flexibility and computational efficiency

Let πzm isin Π(zmxmn) be a pixel location where thecorresponding pixel in image xmn is encoded using visualword zm and πh be the pixel location with index h Thenwe define p(zm|xmn h) in Eq 2 as follows

p(zm|xmn h)∆= max

πzmisinΠ(zmxmn)κ (πzm πhσm) (3)

=

exp

minusminπzm

d(πzm πh)

σm

if d(πzm πh) le α

0 otherwise

where d(middot middot) denotes a distance function σm ge 0 de-notes a predefined window size parameter for view mand α ge 0 is a predefined spatial scale parameter Thenif we take view-specific and view-shared latent variablesas the dimensions in the tensor to represent the group ofdata the entry at index (z1 middot middot middot zm h) can be calculated asprodMm=1

[1Nm

sumNm

n=1 p(zm|xmn h)]

232 General Learning Formulation

Here we introduce additional notations to simplify our ex-position Rather than directly representing a group of datasamples X1middotmiddotmiddot M = X1 middot middot middot XM as a tensor we con-vert it into a matrix φ(X1middotmiddotmiddot M ) isin R

prodMm=1 |zm|times|h| with di-

mensionsprodMm=1 |zm| and |h| respectively where forallm |zm|

and |h| denote the numbers of visual words for view m andpixel locations in images Further we denote wz

∆= p(y1 =

middot middot middot = yM |z1 middot middot middot zM ) isin RprodM

m=1 |zm| and wh∆= p(h) isin

R|h| as our model parameters in the form of vectors Thenour group membership score in Eq 2 can be rewritten as adecision function f as follows

f(X1middotmiddotmiddot M ) = wTz φ(X1middotmiddotmiddot M )wh (4)

where (middot)T denotes the matrix transpose operator Iff(X1middotmiddotmiddot M ) ge 0 we expect that all the members in thegroup have the same class label (and do not otherwise)

Let (X (k)1middotmiddotmiddot M y

(k)1middotmiddotmiddot M )k=1middotmiddotmiddot N be a set of N training

data groups from M views where forallk y(k)1middotmiddotmiddot M = 1 if all

the class labels in group k are the same (andminus1 otherwise)Due to the specific form in Eq 4 we propose learning bilin-ear classifiers (ie wz and wh) for GMP inspired by [27]which used bilinear classifiers in a different context (binaryclassification)

minwzwh

λ1

2wz22+

λ2

2wh22+

Nsumk=1

`(y

(k)1middotmiddotmiddot M f(X

(k)1middotmiddotmiddot M )

)

(5)

where `(middot middot) denotes the loss function (eg hinge loss)λ1 ge 0 λ2 ge 0 are predefined regularization parametersand middot 2 denotes the `2-norm of a vector

Note that here we relax the probability constraint on wz

and wh to real numbers so that Eq 5 can be efficientlysolved using alternating optimization In each iteration wefix one parameter (ie wz or wh) and use a standard supportvector machine (SVM) solver to find the other parameter sothat the objective value decreases monotonically thus guar-anteeing a local optimal solution

233 Pairwise Decomposition Approximation

With sufficient training data we can train a bilinear classi-fier directly using Eq 5 This training method howeverdoes not scale well with the number of views due to thehigh dimensional tensor representation leading to seriouscomputational and overfitting issues

To overcome these issues we propose an approximatepairwise decomposition method as illustrated in Fig 3(b)to reduce the parameter space This is based on the condi-tional independence assumption in multi-view learning [1]Accordingly we can rewrite our group membership scorein Eq 2 as follows


equivsum

mi 6=mjisin1middotmiddotmiddot M

sumzmi

zmjh

p(y1 = middot middot middot = yM |ymi= ymj

)

p(ymi= ymj

|zmi zmj

)p(zmi|Xmi

h)p(zmj|Xmj

h)p(h)

Algorithm 1 Pairwise decomposition based learningInput φ(Xmimj )forallmi 6=mjisin1M ymiforallmiisin1M

λ1 λ2 λ3 ge 0

Output wmimjmi 6=mjisin1Mwhβ

Initialize β larr 1 wh larr 1 wmimj larr 1repeat

Solve wmimj in Eq 8 (multi-view training) or Eq 9(double-view training) by fixing β and whSolve wh in Eq 8 or Eq 9 by fixing β and wmimj Solve β in Eq 8 or Eq 9 by fixing wmimj and wh

until Convergereturn wmimjmi 6=mjisin1Mwhβ

where p(y1 = middot middot middot = yM |ymi= ymj

) indicates howimportantly the pair of views mi and mj contribute toGMP In this way the number of parameters that needto be learned in our method is significantly reduced from(prodM

m=1 |zm|+ |h|)

to(sum

mi 6=mj|zmi||zmj

|+ |h|)

Let φ(Xmimj)

∆= p(zmi

|Xmi h)p(zmj

|Xmj h) isin

R(|zmi||zmj

|)times|h| be the pairwise visual word matrix be-tween views mi and mj where Xmimj

= XmiXmj

Also let wmimj

∆= p(ymi

= ymj|zmi

zmj) isin R|zmi

||zmj|

and β∆= p(y1 = middot middot middot = yM |ymi

= ymj) isin R|zmi

||zmj|

Then based on Eq 6 we can rewrite Eq 4 as follows

f(X1middotmiddotmiddot M ) =sum

mi 6=mj

βmimjwTmimj

φ(Xmimj)wh (7)

where βmimj denotes the entry in β for the view pairTo learn our model parameters in Eq 7 we propose two

learning methods as follows namely multi-view trainingand double-view training

Multi-view training

minwmimj

whβ

λ1

2

summi 6=mj

wmimj22 +

λ2

2wh22 +

λ3

2β22

+

Nsumk=1

`(y



) st β ge 0 (8)

Double-view training

minwmimj

whβ

λ1

2

summi 6=mj

wmimj22 +

λ2

2wh22 +

λ3

2β22

+

Nsumk=1

summi 6=mj

`(y(k)mimj

f(X (k)mimj

)) st β ge 0 (9)

where forallk y(k)mimj = 1 if in group k the labels of the two

persons ymi= ymj

holds otherwise 0 Here ge denotes

an element-wise ge operator Both training can be done us-ing alternating optimization with a standard SVM solverStill local optima are guaranteed For two-view scenariosboth training methods are essentially identical and scalequadratically with the number of views in general Lin-ear scalability is also possible if we organize all the viewsas cycle graphs Difference in these two training methodscomes from the loss functions where in multi-view training`measures the group (ie multi-view) loss while in double-view training ` measures the pair-view loss Our algorithmis summarized in Alg 1

3 Experiments

We evaluate our method on person Re-ID and kinshipverification tasks along with state-of-the-art methods onbenchmark datasets Standard trainingtesting protocols areused in all experiments For each comparing method we ei-ther cite the original results from the papers (denoted by (middot)lowastin the tables) or calculate from released codes Our resultsare reported as the average over 3 trials

For each experiment we choose the same or similar low-level feature as the other methods (see the details in subsec-tion) for fair comparison We densely sample the images togenerate a low-level local feature per pixel Then we use K-Means to build the visual vocabularies with about 2 times 104

randomly selected features per view Further every localfeature is quantized into one of these visual words basedon Euclidean distance Note that more complicated featureselection methods may be employed to yield better perfor-mance but we do not fine-tune this component for the sakeof computational efficiency and generalization ability

We employ the chessboard distance for Eq 3 and LI-BLINEAR [7] as our SVM solver with hinge loss Werandomly generate about 3 times 104 training samples to learnmodel parameters wrsquos The regularization parameters aredetermined by cross-validation

31 Person Re-identification

For performance measure we adopt the standard Cumu-lative Match Characteristic (CMC) curve which displaysthe recognition rate as a function of rank The recognitionrate at rank-r is the proportion of queries correctly matchedto a corresponding gallery entity at rank-r or better

For tasks with multiple camera views we follow [5] tocompare results under two camera views Consider theresults from multiple views as a high dimensional tensorone dimension per view To predict pairwise matches frommulti-view results (eg identifying matches between cam-era view 1 and view 2 from the predicted results for thejoint of view 1 2 and 3) we can either sum over or findthe maximum over the extra dimensions Cross-validationis used to choose the better way for each dataset

Table 1 Matching rate comparison () on VIPeR and CUHK01Rank r = 1 5 10 15 20 25

VIPeRSCNCD [31] 207 472 606 688 751 791SCNCDfinal [31] 378 685 812 870 904 927LADF [19] 293 610 760 834 881 909Mid-level filters [36] 291 523 659 739 799 843Mid-level+LADF [36] 434 730 849 909 937 955VW-CooC [32] 3070 6298 7595 8101 - -Ours 335 595 728 813 880 896

CUHK01Single-shot LAFTlowast [18] 258 550 667 738 790 830Multi-shot LAFTlowast [18] 314 580 683 740 790 830Mid-level filters [36] 343 551 650 710 749 780VW-CooC [32] 4403 7047 7912 8477 - -Ours 6039 8292 9043 9342 9455 9578

311 Two Camera Views

Person Re-ID between two views is the simplest scenarioWe test our method on the VIPeR [13] and CUHK Cam-pus [35] dataset We extract a 672-dim Color+SIFT1 vectorfrom each 5times5 pixel patch in images as low-level featuresWe follow the experimental setting in [35] for both datasets

Our comparison results are listed in Table 1 As we seeon VIPeR ldquoMid-level+LADFrdquo from [36] is the current bestmethod which utilized more discriminative mid-level filtersas features and a powerful classifier and ldquoSCNCDfinalrdquofrom [31] is the second which utilized only foreground fea-tures Our results are comparable to both of them How-ever our method always outperforms their original meth-ods significantly when either the powerful classifier or theforeground information is not involved On CUHK01our method performs the best At rank-1 it outperforms[32 36] by 1636 and 2609 respectively Comparedwith [32] the improvement mainly comes from the multi-ple instance setting of our method

The CMC curve comparison on VIPeR and CUHK01 isshown in Fig 4 As we see our curve is very similar to thatof LADF This is mainly because LADF is a second-order(ie quadratic) decision function based on metric learningwhich shares some commonality with our classifiers

We also demonstrate the impacts of different numbersof pixel locations (ie view-shared space) and visual words(ie view-specific space) on the performance using VIPeRin Fig 5 We sample the pixel locations step by from 1 to 5pixels along x and y-axis in images (larger number leadingto fewer samples) while using different numbers of visualwords Visual words capture the variations in appearanceand with more visual words more similar patterns can bedifferentiated (eg pink and red) Matching between pixellocations gives us the statistic information of visual wordsand more samples make the statistics more robust Togetherthey work for good performance

1We downloaded the code from httpsgithubcomRobert0812salience_match

(a) (b)Figure 4 CMC curve comparison on (a) VIPeR and (b) CUHK01 respectively Notice that except our results the rest are copied from [36]

Figure 6 CMC curve comparison on WARD Note except for our results the other results are cited from [5]

Figure 5 Demonstration of the impacts of different numbers of pixel lo-cations and visual words on the performance using VIPeR Warmer colordemotes higher accuracy This figure is best viewed in color

312 Three Camera Views

Now we consider three camera views and test our methodon the WARD dataset [24] Following [5] we denote thecamera views as view 1 2 and 3 However for pairwiseview matching [5] did not mention which view as probeor gallery Here we define the view with a smallerlargernumber of data to be the galleryprobe set We randomlyselect 35 people for training and the rest for testing

We first resize each image to the same 128 times 64 pixelsand take every 2 times 2 pixel patch in the HSV color space togenerate our low-level features by concatenating 3times2times2 =12 entries into a vector The reason for choosing this featureis because in [5] the features were built in the HSV colorspace as well Different from [5] we take the whole imageto generate features without foreground segmentation

Table 2 AUC comparison () on WARD based on Fig 6

View pair 1-2 1-3 2-3 AveFT 933 910 949 931NCR on ICT 904 848 911 887NCR on FT 954 919 956 943Ours Multi-view 944 921 981 949Ours Double-view 927 910 975 938

The results are shown in Fig 6 As we see our methodperforms similar or better than NCR [5] and the curves ofboth the multi-view training and double-view training forour method behave very similarly We list the area undercurve (AUC) scores in Table 2 Our method is better thanNCR on FT by 06 on average from 943 to 949

313 Four Camera Views

Next we consider four camera views and test our methodon the Re-identification Across indoor-outdoor Dataset(RAiD) [5] with two indoor views camera 1 and 2 and twooutdoor views camera 3 and 4 Still we take the views withsmallerlarger numbers as galleriesprobes We follow [5]and utilize the same HSV low-level feature as we did inSection 312

Our comparison results are shown in Fig 7 As we seeour method again performs equally well or better than NCRWe list the AUC score comparison results in Table 3 Stillour method is better than NCR on FT by 16 on averagefrom 947 to 963

Figure 7 CMC curve comparison on RAiD Notice that expect our results the rest results are cited from [5]

Table 3 AUC comparison () on RAiD based on Fig 7View pair 1-2 1-3 1-4 2-3 2-4 3-4 AveFT 966 843 888 900 939 935 912NCR on ICT 985 906 921 910 944 941 934NCR on FT 981 904 931 945 965 959 947Ours Multi-view 982 930 971 941 966 905 949Ours Double-view 993 908 983 930 980 988 963

For both indoor-indoor and outdoor-outdoor cases ourmethod consistently works best which may indicate that thevisual word co-occurrence patterns are more discriminativeif the lighting condition is similar

32 Kinship Verification amp Identification

As before we utilize the HSV 12-dim low-level fea-tures In the experiments we denote father mother sonand daughter as F M S and D respectively Following[23] we measure the verification performance with the ver-ification rate defined by the number of correctly classifiedface pairs divided by the total number of face pairs in thetest set For identification CMC curves are also used Weonly use double-view training in this task since the informa-tion captured by parent-offspring pairs are more important

Kinship verification between two views (one parent andone offspring) is the conventional setting where we testour method on two datasets ie KinFaceW-I [23] andKinFaceW-II [23] The former consists of 156 FS 134FD 116 MS and 127 MD pairs while the latter contains250 pairs of each kin relation The main difference be-tween the two datasets is that each pair of face images inKinFaceW-II comes from the same photo while the image

pairs in KinFace-I come from different photos We followthe same protocol as that in [23 6 28] and use a 5-fold crossvalidation with balanced positive and negative pairs on thedefault trainingtesting split Results are listed in Table 4

On KinFaceW-II our method significantly outperformsthe competitors but on KinFaceW-I ours is slightly worseOur reasoning is that our current visual word representa-tion using simple K-Means does not account for significantvisual ambiguity in appearance when imaging factors (eglighting conditions illumination etc) change substantiallyThis leads to large intra-cluster variations in visual wordsthat our method does not currently handle well To furtherinvestigate the different performances on both datasets weuse a smaller training set randomly sampled on KinFaceW-II such that it has the same size as KinFaceW-I while keep-ing the same test set and record the results as ldquoreducedtraining setrdquo The results become slightly worse than theoriginal training set while still outperform other methodsThese relatively good results along with the worse resultson KinFaceW-I demonstrate that the size of training data isindeed important but less important than the data sources

Next we use TSKinFace dataset [28] for three-view kin-ship verification (ie father mother offspring) which con-tains 513 FM-S and 502 FM-D groups Following [28] wecarry out a 5-fold cross validation with balanced positiveand negative samples and list the results in Table 5 As wesee our method performs consistently better than [28]

Finally we employ the Family 101 dataset [8] to inves-tigate kinship identification namely identifying the cor-

Table 4 Verification rate comparison () on KinFaceWFS FD MS MD Mean

KinFaceW-IDehghan et al [6] 764 725 719 773 745Lu et al [23] 725 665 662 720 699Qin et al [28] 768 768 746 780 766Ours 635 650 638 756 670

KinFaceW-IIDehghan et al [6] 839 767 834 848 822Lu et al [23] 769 743 774 776 765Qin et al [28] 846 770 844 854 829Ours 854 818 866 900 860Ours (reduced training set) 844 782 846 878 838

Table 5 Verification rate comparison () on TSKinFaceFS FD MS MD FM-S FM-D

Dehghan et al [6] 799 742 785 763 819 796Fang et al [8] 691 668 687 679 716 698Lu et al [23] 748 700 722 713 770 714Qin et al [28] 830 805 828 811 864 844Ours 885 870 879 878 906 890

Table 6 AUC comparison () on Family 101FS FD MS MD Mean

Dehghan et al [6] 888 913 943 964 927Ours 903 946 960 970 945

Figure 8 CMC curve comparison with [6] on Family 101

rect parentchild among a set of candidates given onechildparent image This dataset contains 14816 images thatform 206 nuclear families belonging to 101 unique familytrees Following [6] we adopt 101 nuclear families and use50 families for training and 51 families for testing For eachof the four kin relations we train a model and use the modelto match offspring images to all possible parent images TheCMC curves2 are shown in Fig 8 and Table 6 lists the AreaUnder Curve (AUC) measure of the CMC curves

33 Storage amp Computational Time

Storage (St for short) and computational time duringtesting are two critical issues in real-world applications Inour method we only need to store a feature matrix for eachentity based on Eq 3 which is used to calculate similari-ties between different entities The computational time canbe roughly divided into two parts (1) computing featurematrices T1 and (2) predicting group membership T2 Wedo not consider the time for generating low-level featuressince different implementations vary significantly

2We use the authorrsquos code (httpenriquegortizcompublicationsFamResemblancezip) to produce the results

Table 7 Average storage and computational time for our mdthod

St (Kb) T1 (ms) T2 (ms)VIPeR 1107 529 06WARD 1137 997 15RAiD 1665 687 05

We record the storage and computational time using 300visual words for both probe and gallery sets on VIPeR (twoviews) WARD (three views) and RAiD (four views) Therest of the parameters are the same as described in Section31 As we see the storage per data sample and computa-tional time are linearly proportional to the size of imagesand number of visual words Our implementation is basedon unoptimized MATLAB code3 Numbers are listed in Ta-ble 7 including the time for saving and loading featuresOur experiments were all run on a multi-thread CPU (XeonE5-2696 v2) with a GPU (GTX TITAN) The method runsefficiently with very low demand for storage

4 ConclusionIn this paper we propose a general parametric proba-

bility model for the group membership prediction (GMP)problem We introduce the notions of view-specific andview-shared latent variables to capture visual informationand commonality for each view Using these two variableswe can factorize the group membership score into a tensorproduct and thus propose a new visual word co-occurrencetensor feature to represent groups of data samples In ourparametric probability model we can handle the multipleinstance cases as well Further we propose discriminativelylearning a bilinear classifier for GMP with the decisionfunction as the marginalization over all latent variables Ourexperiments on multi-camera person re-id and kinship ver-ification tasks demonstrate the good predictive ability andcomputational efficiency of our method As future workwe would like to explore other applications for our methodsuch as activity retrieval [4] and develop new approachessuch as zero-shot recognition [34] and structured learning[33] for our problem

AcknowledgementWe thank the anonymous reviewers for their very use-

ful comments This material is based upon work supportedin part by the US Department of Homeland Security Sci-ence and Technology Directorate Office of University Pro-grams under Grant Award 2013-ST-061-ED0001 by ONRGrant 50202168 and US AF contract FA8650-14-C-1728The views and conclusions contained in this document arethose of the authors and should not be interpreted as nec-essarily representing the social policies either expressed orimplied of the US DHS ONR or AF

3Our code is available at httpszimingzhangwordpresscomsource-code

References[1] A Blum and T Mitchell Combining labeled and unlabeled

data with co-training In COLT pages 92ndash100 1998 4[2] L Bourdev and J Malik Poselets Body part detectors

trained using 3d human pose annotations In CVPR pages1365ndash1372 2009 2

[3] Z Cai L Wang X Peng and Y Qiao Multi-view su-per vector for action recognition In CVPR pages 596ndash6032014 2

[4] G D Castanon Y Chen Z Zhang and V Saligrama Ef-ficient activity retrieval through semantic graph queries InACM Multimedia 2015 8

[5] A Das A Chakraborty and A K Roy-Chowdhury Consis-tent re-identification in a camera network In ECCV pages330ndash345 2014 2 5 6 7

[6] A Dehghan E G Ortiz R Villegas and M Shah Who doi look like determining parent-offspring resemblance viagated autoencoders In CVPR pages 1757ndash1764 2014 2 78

[7] R-E Fan K-W Chang C-J Hsieh X-R Wang and C-JLin LIBLINEAR A library for large linear classificationJMLR 91871ndash1874 2008 5

[8] R Fang A C Gallagher T Chen and A C Loui Kinshipclassification by modeling facial feature heredity In ICIPpages 2983ndash2987 2013 1 2 7 8

[9] R Fang K D Tang N Snavely and T Chen Towardscomputational models of kinship verification In ICIP pages1577ndash1580 2010 2

[10] M Farenzena L Bazzani A Perina V Murino andM Cristani Person re-identification by symmetry-driven ac-cumulation of local features In CVPR pages 2360ndash23672010 2 3

[11] P F Felzenszwalb R B Girshick D McAllester and D Ra-manan Object detection with discriminatively trained partbased models PAMI 32(9)1627ndash1645 2010 1

[12] D Figueira L Bazzani H Q Minh M CristaniA Bernardino and V Murino Semi-supervised multi-feature learning for person re-identification In AVSS pages111ndash116 2013 2

[13] D Gray S Brennan and H Tao Evaluating appearancemodels for recognition reacquisition and tracking In PETSpages 41ndash47 2007 1 5

[14] M M Kalayeh H Idrees and M Shah Nmf-knn Imageannotation using weighted multi-view non-negative matrixfactorization In CVPR 2014 2

[15] S Khamis C-H Kuo V K Singh V Shet and L SDavis Joint learning for attribute-consistent person re-identification In ECCV Workshop on Visual Surveillanceand Re-Identification pages 134ndash146 2014 2

[16] M Kostinger M Hirzer P Wohlhart P Roth andH Bischof Large scale metric learning from equivalenceconstraints In CVPR pages 2288ndash2295 2012 2

[17] L-J Li R Socher and F-F Li Towards total scene under-standing Classification annotation and segmentation in anautomatic framework In CVPR pages 2036ndash2043 2009 1

[18] W Li and X Wang Locally aligned feature transformsacross views In CVPR pages 3594ndash3601 2013 2 5

[19] Z Li S Chang F Liang T S Huang L Cao and J RSmith Learning locally-adaptive decision functions for per-son verification In CVPR pages 3610ndash3617 2013 2 5

[20] C Liu S Gong C C Loy and X Lin Person re-identification what features are important In ECCV pages391ndash401 2012 2

[21] J Liu Y Jiang Z Li Z-H Zhou and H Lu Partiallyshared latent factor learning with multiview data NNLS2014 2

[22] J Lu J Hu V E Liong X Zhou A Bottino I U IslamT F Vieira X Qin X Tan Y Keller et al The fg 2015kinship verification in the wild evaluation In FG 2015 2

[23] J Lu X Zhou Y-P Tan Y Shang and J Zhou Neighbor-hood repulsed metric learning for kinship verification PAMI36(2)331ndash345 2014 2 7 8

[24] N Martinel and C Micheloni Re-identify people in widearea camera network In CVPR Workshops pages 31ndash362012 6

[25] A Mignon and F Jurie PCCA a new approach for distancelearning from sparse pairwise constraints In CVPR pages2666ndash2672 2012 2

[26] S Pedagadi J Orwell S Velastin and B Boghossian Localfisher discriminant analysis for pedestrian re-identificationIn CVPR pages 3318ndash3325 2013 2

[27] H Pirsiavash D Ramanan and C Fowlkes Bilinear clas-sifiers for visual recognition In NIPS pages 1482ndash14902009 4

[28] X Qin X Tan and S Chen Tri-subject kinship verificationUnderstanding the core of a family arXiv150102555 20152 7 8

[29] Y Song L-P Morency and R Davis Multi-view latent vari-able discriminative models for action recognition In CVPRpages 2120ndash2127 2012 2

[30] C Xu D Tao and C Xu A survey on multi-view learningarXiv13045634 2013 2

[31] Y Yang J Yang J Yan S Liao D Yi and S Z Li Salientcolor names for person re-identification In ECCV pages536ndash551 2014 2 5

[32] Z Zhang Y Chen and V Saligrama A novel visual wordco-occurrence model for person re-identification In ECCVWorkshop on Visual Surveillance and Re-Identificationpages 122ndash133 2014 3 5

[33] Z Zhang and V Saligrama PRISM Person re-identificationvia structured matching arXiv preprint arXiv140644442014 8

[34] Z Zhang and V Saligrama Zero-shot learning via semanticsimilarity embedding In ICCV 2015 8

[35] R Zhao W Ouyang and X Wang Person re-identificationby salience matching In ICCV pages 2528ndash2535 2013 25

[36] R Zhao W Ouyang and X Wang Learning mid-level fil-ters for person re-identification In CVPR pages 144ndash1512014 2 5 6

[37] W-S Zheng S Gong and T Xiang Re-identification byrelative distance comparison PAMI 35(3)653ndash668 2013 2

[38] X Zhu and D Ramanan Face detection pose estimationand landmark localization in the wild In CVPR pages 2879ndash2886 2012 2

(a) Body parts (b) Facial landmarks

Figure 2 Illustration of (a) body parts (eg head torso legs) for Re-IDand (b) facial landmarks (eg eyes nose mouth) for kinship verificationNote that in these aligned images these body parts or facial landmarksapproximately coincide in terms of spatial locations

with the learned weights as the group membership scoresOur experimental results on multi-camera person Re-ID andkinship verification demonstrate the good prediction perfor-mance and computational efficiency of our method

11 Related Work

GMP problem is closely related to multi-view learning(MVL) Indeed our perspective of shared variables hasbeen used before in the context of MVL [12 29 30] Nev-ertheless the goal of MVL specifically in visual recognitionis different from ours Namely the objective of MVL is toleverage multiple sources (eg texts images videos etc)of data corresponding to the same underlying object (egpersons events etc) to improve recognition performance[3 14 21 30] On the other hand our goal is to predictgroup membership among the multiple sources

Person Re-ID essentially is a GMP problem where eachcamera view can be taken as one of the instances In the lit-erature however most of existing works consider this prob-lem as an independent two-view classification task mainlyfocusing on cleverly designing local features [10 20 2631 36] or learning better metrics [15 16 18 19 25 37]Recently Figueira et al [12] proposed a semi-supervisedlearning method to fuse multi-view features for Re-ID sothat the features agree on the classification results Daset al [5] considered the group membership prediction inRe-ID by maximizing the summation of pairwise similar-ity scores using binary integer programming during testingUnlike [5] we formulate the group membership problemas a learning problem rather than a post-processing step toimprove the matching rate

Kinship verification is indeed another GMP problemwhere each family role (eg father mother son daugh-ter etc) can be considered as an instance Similar to per-son Re-ID existing works mainly focus on learning betterfeatures [6 9] and better distance metrics [23] for pair-wise classification [22] Recently Qin et al [28] pro-posed a bilinear model to handle so-called tri-subject kin-ship verification problems Fang et al [8] proposed a sparsegroup lasso based feature selection method to determinewhether a query person is from a specific family Unlike

[8 28] our method targets at a more general and challeng-ing problem which can be used to predict an arbitrary num-ber of images with a fixed structure of family roles such asfather-son father-mother-daughter grandfather-father-son-grandson etc

2 Our Method

21 Problem Setting

Let (Xm ym)m=1middotmiddotmiddot M be a group ofM persons fromdifferent views where forallmXm denotes the mth person andym denotes its label (eg identity or family) Let foralln =1 middot middot middot Nmxmn isin Xm be the nth image for the per-son with Nm images in total The goal of our method is topredict the following probability as group membership


Note that our problem setting is naturally applicable tothe multiple instance cases For example during learningwe allow multiple images to be associated with a person (ieXm = xmn) in person Re-ID and kinship verification asin the CUHK Campus [35] and Family101 [8] datasets

While we have motivated our approach in the contextof shared latent variables (body parts or facial landmarks)this information is unavailable during the training or test-ing phases Furthermore estimating locations of body partsand facial landmarks is known to be extremely challenging[2 38] Fortunately in the context of the applications andproblems that we are concerned with the images are ap-proximately aligned In these images foreground objectsare centralized and well cropped Currently most bench-mark datasets are composed of such approximately alignedimages namely the same body parts or facial landmarksappear roughly at similar locations In such cases pixellocations provide good approximation of where body partsand facial landmarks are and we utilize this property to by-pass the detection challenge while accounting for spatialmisalignments with spatial kernels Note that the issue ofvisual ambiguity of the shared variables still remains in ourproblem

22 Parametric GMP Model

We introduce two latent variables to model the relation-ship between the class labels ym and data samples XmThe graphical representation of our parametric probabil-ity model is shown in Fig 3(a) where forallm zm denotesthe view-specific latent variable for view m h denotes theview-shared latent variable and Nm denotes the number ofimages from viewm Based on this model we can factorize





=sum


py|z

Mprodm=1

p(zm|Xm h)p(h)

=sum



[1

Nm

Nmsumn=1

p(zm|xmn h)

]




sumz1z2h









p(zm|xmn h)∆= max


=

exp

minusminπzm

d(πzm πh)

σm


0 otherwise


[1Nm

sumNm

n=1 p(zm|xmn h)]






∆= p(y1 =










minwzwh

λ1

2wz22+

λ2

2wh22+

Nsumk=1

`(y



)

(5)








equivsum


sumzmi

zmjh


)

p(ymi= ymj

|zmi zmj

)p(zmi|Xmi

h)p(zmj|Xmj

h)p(h)


λ1 λ2 λ3 ge 0







m=1 |zm|+ |h|)

to(sum

mi 6=mj|zmi||zmj

|+ |h|)

Let φ(Xmimj)

∆= p(zmi

|Xmi h)p(zmj

|Xmj h) isin

R(|zmi||zmj


= XmiXmj

Also let wmimj

∆= p(ymi

= ymj|zmi

zmj) isin R|zmi

||zmj|


= ymj) isin R|zmi

||zmj|



mi 6=mj

βmimjwTmimj

φ(Xmimj)wh (7)



Multi-view training

minwmimj

whβ

λ1

2

summi 6=mj

wmimj22 +

λ2

2wh22 +

λ3

2β22

+

Nsumk=1

`(y



) st β ge 0 (8)


minwmimj

whβ

λ1

2

summi 6=mj

wmimj22 +

λ2

2wh22 +

λ3

2β22

+

Nsumk=1

summi 6=mj

`(y(k)mimj

f(X (k)mimj

)) st β ge 0 (9)


persons ymi= ymj



3 Experiments






































































































=sum


py|z

Mprodm=1

p(zm|Xm h)p(h)

=sum



[1

Nm

Nmsumn=1

p(zm|xmn h)

]




sumz1z2h









p(zm|xmn h)∆= max


=

exp

minusminπzm

d(πzm πh)

σm


0 otherwise


[1Nm

sumNm

n=1 p(zm|xmn h)]






∆= p(y1 =










minwzwh

λ1

2wz22+

λ2

2wh22+

Nsumk=1

`(y



)

(5)








equivsum


sumzmi

zmjh


)

p(ymi= ymj

|zmi zmj

)p(zmi|Xmi

h)p(zmj|Xmj

h)p(h)


λ1 λ2 λ3 ge 0







m=1 |zm|+ |h|)

to(sum

mi 6=mj|zmi||zmj

|+ |h|)

Let φ(Xmimj)

∆= p(zmi

|Xmi h)p(zmj

|Xmj h) isin

R(|zmi||zmj


= XmiXmj

Also let wmimj

∆= p(ymi

= ymj|zmi

zmj) isin R|zmi

||zmj|


= ymj) isin R|zmi

||zmj|



mi 6=mj

βmimjwTmimj

φ(Xmimj)wh (7)



Multi-view training

minwmimj

whβ

λ1

2

summi 6=mj

wmimj22 +

λ2

2wh22 +

λ3

2β22

+

Nsumk=1

`(y



) st β ge 0 (8)


minwmimj

whβ

λ1

2

summi 6=mj

wmimj22 +

λ2

2wh22 +

λ3

2β22

+

Nsumk=1

summi 6=mj

`(y(k)mimj

f(X (k)mimj

)) st β ge 0 (9)


persons ymi= ymj



3 Experiments



































































































∆= p(y1 =










minwzwh

λ1

2wz22+

λ2

2wh22+

Nsumk=1

`(y



)

(5)








equivsum


sumzmi

zmjh


)

p(ymi= ymj

|zmi zmj

)p(zmi|Xmi

h)p(zmj|Xmj

h)p(h)


λ1 λ2 λ3 ge 0







m=1 |zm|+ |h|)

to(sum

mi 6=mj|zmi||zmj

|+ |h|)

Let φ(Xmimj)

∆= p(zmi

|Xmi h)p(zmj

|Xmj h) isin

R(|zmi||zmj


= XmiXmj

Also let wmimj

∆= p(ymi

= ymj|zmi

zmj) isin R|zmi

||zmj|


= ymj) isin R|zmi

||zmj|



mi 6=mj

βmimjwTmimj

φ(Xmimj)wh (7)



Multi-view training

minwmimj

whβ

λ1

2

summi 6=mj

wmimj22 +

λ2

2wh22 +

λ3

2β22

+

Nsumk=1

`(y



) st β ge 0 (8)


minwmimj

whβ

λ1

2

summi 6=mj

wmimj22 +

λ2

2wh22 +

λ3

2β22

+

Nsumk=1

summi 6=mj

`(y(k)mimj

f(X (k)mimj

)) st β ge 0 (9)


persons ymi= ymj



3 Experiments



































































































3 Experiments


































































































Group Membership Prediction - Northeastern UniversityGroup Membership Prediction Ziming Zhang,...

Documents

Transcript of Group Membership Prediction - Northeastern UniversityGroup Membership Prediction Ziming Zhang,...