I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i...

16
ESEARCH R EP R ORT IDIAP

Transcript of I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i...

Page 1: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.

ES

EA

RC

HR

EP

RO

RT

ID

IA

P

D a l l e M o l l e I n s t i t u t efor Per eptua l Art i f i ia lIntelligen e � P.O.Box 592 �Martigny �Valais � Switzerlandphone +41� 27� 721 77 11fax +41� 27� 721 77 12e-mail se retariat�idiap. hinternet http://www.idiap. h

On Spe tral Methods andthe Stru turing of HomeVideosJean-Mar Odobez * Daniel Gati a-Perez *Mael Guillemot *IDIAP�RR 02-55published inInternational Conferen e on Image and Video Retrieval (CIVR) 2003.

* IDIAP, Martigny, Switzerland

Page 2: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.
Page 3: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.

IDIAP Resear h Report 02-55On Spe tral Methods andthe Stru turing of Home Videos

Jean-Mar Odobez Daniel Gati a-Perez Mael Guillemotpublished inInternational Conferen e on Image and Video Retrieval (CIVR) 2003.

Abstra t. A essing and organizing home videos present te hni al hallenges due to theirunrestri ted ontent and la k of storyline. In this paper, we propose a spe tral method to groupvideo shots into s enes based on their visual similarity and temporal relations. Spe tral methodsexploit the eigenve tor de omposition of a pair-wise similarity matrix and an be e�e tive in apturing per eptual organization features. In parti ular, we investigate the problem of automati model sele tion, whi h is urrently an open resear h issue for spe tral methods. We �rst analyzethe behaviour of the algorithm with respe t to variations in the number of lusters, and thenpropose measures to assess the validity of a grouping result. The methodology is used to groups enes from a six-hour home video database, and is assessed with respe t to a ground-truthgenerated by multiple humans. The results indi ate the validity of the proposed approa h, both ompared to existing te hniques as well as the human ground-truth.

Page 4: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.

2 IDIAP�RR 02-551 Introdu tionThe organization and edition of personal memories ontained in home videos onstitute a te hni al hallenge due to the la k of e� ient tools. The development of browsing and retrieval te hniques forhome video would open doors to video albuming and other multimedia appli ations [8℄, [5℄. Unre-stri ted ontent and the absen e of storyline are the main hara teristi s of onsumer video. Homevideos are omposed of a set of s enes, ea h omposed of one or a few video shots, visually onsistent,and randomly re orded along time. Su h features make onsumer video unsuitable for analysis ap-proa hes based on storyline models, and have diverted resear h on home video analysis until re ently,as it was generally assumed that home videos la k of any stru ture [8, 5℄. However, re ent studies haverevealed that the behaviour of home �lmmakers indu es ertain stru ture [6, 4℄, as people impli itlyfollow ertain rules of attention fo using and re ording. The stru ture indu ed by these �lming trendsis often semanti ally meaningful. In parti ular, the s ene stru ture of home video an be dis losedfrom su h rules [4℄.At the same time, there is an in reasing interest in omputer vision and ma hine learning towardsspe tral lustering methods [16, 18, 7, 9℄, whi h aim at partitioning a graph based on the eigenve torsof its pair-wise similarity matrix. These methods have provided some of the best known results forimage segmentation and data lustering, several relevant issues remain unsolved. One of them is modelsele tion. Several of the urrent te hniques partition a graph in two sets, and are re ursively appliedto �nd K lusters [16, 7℄. However, it has been experimentally observed that using more eigenve torsand omputing dire tly a K-way partitioning provides better results [1℄. In [18℄, Weiss performed a omparative analysis of four spe tral methods employed in omputer vision. His analysis led otherauthors to propose a new algorithm [10℄ that uses K eigenve tors simultaneously and ombines theadvantages of two other algorithms [15, 16℄, demonstrating theoreti ally why the algorithm worksunder some onditions. However, in most of these referen es, the automati determination of thenumber of lasses has not been fully addressed.In this paper, we propose a methodology to dis over the luster stru ture in home videos usingspe tral algorithms. Our paper has two ontributions. In the �rst pla e, we present a novel analysisrelated to the problem of model sele tion in spe tral lustering. We �rst extend the analysis of theperforman e of the algorithm of [10℄ when the number of lusters is not the � orre t� one. Then,we study some measures to assess the quality of a partition, and dis uss the balan e between thenumber of lusters and the lustering quality. In parti ular, we dis uss the use of the eigengap, ameasure often used in matrix perturbation and spe tral graph theories [7, 10℄, and referred to as apotential tool for lustering evaluation [10, 9℄, but for whi h we are not aware of any experimentalstudies showing its usefulness in pra ti e. In the se ond pla e, we show that the appli ation of spe tral lustering to home video stru turing results in a powerful method, despite the use of simple featuresof visual similarity and temporal relations. The methodology shows good performan e with respe tto luster dete tion and individual shot- luster assignment, both ompared to existing te hniques andto humans performing the same task, when evaluated on a six-hour home video database for whi h athird party ground-truth generated by multiple subje ts is available.The rest of the paper is organized as follows. Se tion 2 des ribes in details the spe tral lusteringalgorithm, presenting an analysis of algorithmi performan e with respe t to model sele tion, anddis ussing the use of various measures to assess the validity of a grouping result. Se tion 3 des ribesthe appli ation of the methodology to stru turing of home videos. Se tion 4 des ribes the databaseand the performan e measures, and presents results of our approa h ompared to existing te hniquesas well as to human performan e. Se tion 5 provides some on luding remarks.2 The spe tral lustering algorithmIn this se tion we brie�y des ribe the spe tral algorithm we employ (proposed in [10℄ and inspired by[16, 15℄). The algorithm is then analyzed for both ideal and general ases. The hoi e of the number

Page 5: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.

IDIAP�RR 02-55 3a 1 2 3 4 5

1

2

3

4

Initial set

bAffinity matrix

20 40 60 80 100

20

40

60

80

100 −10

1

−10

1

−1

0

1

Rows of Y, K=3

dQ matrix, K=3

20 40 60 80 100

20

40

60

80

100 e 1 2 3 4 5

1

2

3

4

Clusters, K=3

f −10

1

−10

1

−1

0

1

Rows of Y, K=3

Figure 1: Clustering example : (a) initial points; (b) the a�nity matrix; ( ) the rows of Y (in IR3)when K=3 and eigensystem solved with eig from matlab; (d) the Q matrix; (e) the lustering result;(f) the rows of Y, but with the eigensystem solved with the eigs fun tion.of lusters is dis ussed, and several measures of assessing lustering quality are presented.2.1 The algorithmLet us de�ne a graph G by (S;A), where S denotes the set of nodes, and A is the a�nity matrixen oding the value asso iated with the edges of the graph. Ais built from the pair-wise similarityde�ned between any two nodes in the set S. We ensure that Aii = 0 for all i in S. The a�nity Aij isoften de�ned as : Aij = exp�d2(i;j)2�2 ; (1)where d(i; j) denotes a distan e measure between two nodes, and � is a s ale parameter. Followingthe notation in [10℄, the algorithm onsists of the following steps :1. De�ne D(A) to be the degree matrix of A (i.e. a diagonal matrix su h that Dii =Pj Aij), and onstru t L(A) by : L(A) = (D(A))�1=2 A (D(A))�1=2 (2)2. Find fx1; x2; : : : ; xkg the k largest eigenve tors of L ( hosen to be mutually orthogonal in the ase of repeated eigenvalues), and form the matrix X = [x1x2 : : : xk℄ by sta king the eigenve torsin olumns.3. Form the matrix Y from X by renormalizing ea h row to have unit length. The row Yi is to thenew feature asso iated with node i.4. Treating ea h row of Y as a point in IRK , luster them into k lusters via K-means.5. Finally, assign to ea h node of the set S the luster number orresponding to its row.

Page 6: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.

4 IDIAP�RR 02-55a −1 0 1

−1.5

−1

−0.5

0

0.5

1

1.5Rows of Y, K=2

bQ matrix, K=2

20 40 60 80 100

20

40

60

80

100 1 2 3 4 5

1

2

3

4

Clusters, K=2

d −1 0 1−1.5

−1

−0.5

0

0.5

1

1.5Rows of Y, K=2

eQ matrix, K=2

20 40 60 80 100

20

40

60

80

100 f 1 2 3 4 5

1

2

3

4

Clusters, K=2

Figure 2: Same data as in �gure 1, with K=2 : (a,b, ) when the eigensystem is solved with eig onmatlab ; (d,e,f) when using eigs. (a) (d) denote the rows of Y (in IR2); (b) (e) denote the Q matrix;( )(f) show the lustering result.As will be explained in the following se tion, when the value of K orresponds to its true value,the rows of Y should luster in K orthogonal dire tions. Exploiting this property, the K initial entroids (Y i )i=1;::: ;K in the fourth step of the agorithm an be sele ted from the rows of Y by �rst�nding the row of Y for whi h the Ninit neighbours form the tightest luster, and then re ursivelysele ting the row whose inner produ t to the existing entroids is the smallest a ording to :Y i+1 = argminYj max(Y l )l=1:i (Y l :Yj) ;where Yj denotes the j-th row of Y .2.2 Algorithm analysisFigure 1(e) and 4(b) show examples of lustering results that an be obtained with this algorithm.It was shown in [10℄ that the above algorithm is able to �nd the true lusters under the onditionthat K orresponds to the true number of lusters (whenever su h a value exists). In this se tionwe extend this result by analyzing the behaviour for the ase when K is above or below this idealnumber. Two ases are onsidered: the ideal ase, when the true lusters are well separated; and thegeneral ase, when noise due to inter- luster similarity exists.2.2.1 The ideal aseTo understand the behaviour of the algorithm, we onsider an ideal ase in whi h the di�erent lustershave in�nite separation. Without loss of generality, if we additionally suppose that Kideal=3, the setof all node indexes is given by S = S1 [ S2 [ S3, where Si denotes the ith luster of size ni. Wealso assume that the node indexes are ordered a ording to their luster. An example obeying theseassumptions is illustrated in Fig. 1, where the distan e employed to de�ne the a�nity between twonodes is the usual eu lidian distan e between the 2D oordinates, and a�nity is omputed by Eq. 1.In this ase, A (resp. L) is a diagonal matrix omposed of 3 blo ks (A(ii))i=1;2;3 (resp. (L(ii))i=1;2;3)whi h are the intra- luster a�nity matri es for L. It follows that (i) its eigenvalues and eigenve tors

Page 7: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.

IDIAP�RR 02-55 5are the union of the eigenvalues and eigenve tors of its blo ks L(ii) (the latter appropriately paddedwith zeros); (ii) its highest eigenvalue is unity; (iii) unity is a repeated eigenvalue of order 3; (iv) the4th eigenvalue is stri ly less than 1 (assuming A(ii)jk > 0; j 6= k) and (v) the resulting eigenspa e of theunity eigenvalue has dimension 3, and thus, the eigenve tors provided by a parti ular de ompositionalgorithm are not unique. In this ase X3 (where XK denotes the �rst K eigenve tors sta ked in olumns) is of the form :X3 = 264 v(1)1 0 00 v(2)1 00 0 v(3)1 375�R ; with R = 24 r1r2r3 35 ;where R is a 3� 3 rotation matrix omposed out of the three row ve tors ri, and v(j)l denotes the ltheigenve tor of the matrix L(jj). Thus, ea h row of X3 is of the form v(j)1i � rj , where v(j)1i is a s alar(the ith omponent of v(j)1 ). Therefore, after renormalizing the rows of X3 (step 3 of the algorithm),the matrix Y has rows that ful�ll Yi = rj 8i 2 Sj . Fig. 1( ) illustrates this result for the data setof Fig. 1(a). Fig. 1(f) shows the result obtained by hanging the matlab fun tion that solves theeigensystem. Note that the three ve tors are still orthogonal, but have a di�erent on�guration. Analternative formulation de�nes Q = Y � Y T [15℄. In the ideal ase, we have Q(i; j) = 1 for nodes iand j belonging to the same luster, and Q(i; j) = 0 otherwise (see Fig. 1(d)).2.2.2 variation in the number of lusters in the ideal aseAs we are interested in estimating K, let us onsider the two ases when K 6= Kideal, whi h have notbeen studied in [10℄:1. For K < Kideal, XK simply orresponds to the �rst K olumns of XKideal . After normalization,we get a Y matrix whose entries are (in our example, with K=2) :Yj = (ri1; ri2)=k(ri1; ri2)k _=r0i 8i and 8j 2 Si:This simply orresponds to proje ting (and normalizing) the initial orthogonal ve tors ri intoa lower dimensional spa e. Note that this may indeed ause some normalization problem whenthe proje tions are near the origin. Sin e (as pointed out above) the ve tors ri an be in anyorthogonal on�guration, there is no general rule about the on�guration of their proje tionsr0i. As an example, Fig. 2 shows these proje tions in the ase of the data of Fig. 1. Note that,depending on the spe i� eigensystem solver, the proje tions and the lustering results an di�er.2. For K > Kideal, onsider for simpli ity that K = 4. Thus, X4 onsists in the matrix X3 withthe fourth eigenve tor appended as an extra olumn. As mentioned above, this eigenve tororiginates from one of the L(ii) matri es. More pre isely �4 = maxi �(i)2 . Assume that we hoosethis fourth eigenve tor in the �rst luster. We have :X4 = 264 v(1)1 r1 v(1)2v(2)1 r2 0v(3)1 r3 0 375 :After row normalization, it is easy to show that the resulting Q = Y � Y T matrix has thefollowing property: Q(i; j) = 0 ; 8i 2 Sk;8j 2 Sl; k 6= l:For example, 8(i; j) 2 S1 � S2 :Q(i; j) = YiY Tj = (v(1)1i r1; v(1)2i )(v(2)1(j�n1)r2; 0)T = 0;

Page 8: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.

6 IDIAP�RR 02-55a

Q matrix, K=4

20 40 60 80 100

20

40

60

80

100 b 1 2 3 4 5

1

2

3

4

Clusters, K=4

1 2 3 4 5

1

2

3

4

Clusters, K=5

Figure 3: Same data as in Fig. 1. (a) the Q matrix when K=4; (b) the orresponding result; ( ) lustering result when K=5.meaning that the true original lusters remain orthogonal to ea h other. Furthermore, we have :Q(i; j) = 1;8(i; j) 2 S2k ; k = 2 or 3;indi ating that the se ond and third luster remain un hanged. Indeed, only the �rst luster isa�e ted, and is divided into two parts. The same kind of reasoning an be applied when movingto higher values of K. To summarize, when K > Kideal, the resulting lustering orresponds toan over lustering of the ideal ase. This point is illustrated in Fig.s 3 and 4.2.2.3 The general aseIn the ideal ase, we have seen that the Q matrix should only have 0 and 1 entries when the true Kis sele ted, and that there might be other entry values when K 6= Kideal (esp. K > Kideal). Indeed,this an be related to the distortion obtained at the end of the K-means algorithm :MSE = 1n KXi=1 Xj2 lusteri kYj � Y i k ; (3)where Y i represent the entroids at the end of the K-means. In the ideal ase, and when K = Kidealthe distortion should be 0. Furthermore, for the real ase, it was shown in [10℄ that the distortion( omputed with respe t to the ideal luster enters ri) is bounded by some value that depends on theentries of the a�nity matrix (related to the lusters' density, the intra- luster onne tivity, et ). Giventhe orre t K value, the authors in [10℄ use the distortion as a measure to sele t the lustering resultfrom a set of results obtained by varying the s ale parameter � in the a�nity matrix omputation(Eq. (1)). However, the a tual value of the bound in their experiments is not spe i�ed, and there isno indi ation of how this bound would behave for varying values of K. Note in parti ular that thedistortion measure is omputed in spa es of di�erent dimension (Yj lie in IRK), so distortion valuesmay not be easily ompared.2.3 Automati Model sele tionThe sele tion of the � orre t� number of lusters is a di� ult task. We have seen in the previousSe tion that the analysis of the MSE measures for di�erent K is not trivial. For this reason, we onsidered other riteria stemming from matrix perturbation and spe tral graph theories to performmodel sele tion.We have adopted the following strategy. The spe tral lustering algorithm is employed to provide andidate solutions (one per value of K), and the sele tion is performed based on the riteria dis ussedin the following se tions.

Page 9: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.

IDIAP�RR 02-55 7a 0 2 4

−1

0

1

2

3

4

5Clusters, K=2

b 0 2 4

−1

0

1

2

3

4

5Clusters, K=3

0 2 4

−1

0

1

2

3

4

5Clusters, K=4

d 0 2 4

−1

0

1

2

3

4

5Clusters, K=5

Figure 4: Another example. The lustering result with (a) K=2, (b) K=3, ( ) K=4 (d) K=5a 0 0.5 1 1.5 2

0

0.2

0.4

0.6

0.8

1

scale σ

1−λ(ii

)2

eigengap of each cluster

b 0 0.5 1 1.5 20

0.1

0.2

0.3

0.4

0.5

0.6

scale σ

δ K

0 0.5 1 1.5 20

0.05

0.1

0.15

0.2

0.25

0.3

scale σ

rela

tive

cut

Figure 5: In�uen e of s ale hange. (a) eigengap of individual lusters : the red urve at the top orresponds to the �ball like� luster; the blue urve in the middle orresponds to verti al line luster;the green urve at the bootom orresponds to the half- ir le luster; (b) eigengap measure ÆK forK=1 (the magenta urve with diamonds), K=2 (the green urve with x's), K=3 (the yan urve withtriangles), K=4 (the blue urve with +'s) and K=5 (the red urve with o's) ( ) relative ut r utK .Same labeling than in (b).2.3.1 The eigengapThe eigengap is an important measure in the analysis of spe tral methods [7, 9, 10℄. Please refer to [2℄for basi de�nitions. The eigengap of a matrix A is de�ned by Æ(A) = 1� �2�1 where �1 and �2 are thetwo largest eigenvalues of A [7℄. In pra ti e, the eigengap is often used to assess the stability of the�rst eigenve tor1 of a matrix and it an be shown to be related to the Cheeger onstant [2℄, a measureof the tightness of lusters. To larify this relation, let us de�ne the ut value of the partitioning(I; �I) of a graph hara terized by its a�nity matrix A byCutA(I; �I) =Xi2IXj =2I Aij :We also de�ne the volume of the subset I by V olA(I) = Pi2IPj2I Aij . Furthermore, the ondu -tan e � of the partitioning (I; �I) is de�ned as�A(I) = CutA(I; �I)min(V olA(I); V olA(�I)) :The Cheeger onstant is de�ned as the minimum of the ondu tan e over all the possible bipartition-ings of the graph, hG(A) = minI �A(I):1Or the �rst k eigenve tors, in ases where we have a k-repeated, largest eigenvalue.

Page 10: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.

8 IDIAP�RR 02-55It an be shown that the above quantity is bounded by the eigengap [10, 7℄ :hG(A) � 12Æ(A): (4)The ondu tan e indi ates how well (I; �I) partitions the set of nodes into two subsets, and theminimum over I orresponds to the best partition. Therefore, if there exist a partition for whi h (i)the weights Aij of the graph edges a ross the partition are small, and (ii) ea h of the regions in thepartition has enough volume, then the Cheeger onstant will be small. Starting from K = 1, we wouldlike to sele t the simplest lustering model (i.e., the smallest K) for whi h the extra ted lusters aretight enough (hard to split into two subsets). This is equivalent to request that the Cheeger onstantis large enough for ea h luster, or due to Eq. (4), to request that the eigengap is large for all lusters.Our �rst riterion is thus de�ned by ÆK = mini21:::K Æ(L(A(ii)K )); (5)where the A(ii)K are the submatri es extra ted from A a ording to the model obtained by the spe tralalgorithm, and L is de�ned by Eq. (2). The sele tion algorithm sele ts the smallest K value for whi hthe eigengap as de�ned by Eq. (5) ex eeds a threshold.2.3.2 The relative utThe lustering measure de�ned by Eq. (5) has a drawba k. The algorithm ould sele t a lusteringwith many lusters of relatively low quality so that the minimum in Eq. (5) is above the threshold,and reje t a lustering of overall good quality that unfortunately has one luster of very low quality[7℄. We thus onsidered a se ond riterion that ara terizes the overall quality of a lustering. This riterion is de�ned as the fra tion of the total weight of edges that are not overed by the lusters,r utK = PKk=1PKl=1;l6=kPi2SkPj2Sl AijPiPj Aij :The algorithm outputs the largest K for whi h r ut is below a threshold.2.4 S ale analysis and sele tionThe sele tion algorithm des ribed above requires the setting of some threshold, whi h is itself depen-dent on the setting of the s ale parameters required in the de�nition of the pairwise a�nity betweennodes. For instan e, let us onsider the ase of Fig. 4, where the a�nity is de�ned by Eq. (1). Fig. 5(a)plots the eigengap of ea h of the three lusters onsidered separately. As expe ted, as the s ale pa-rameter in reases, the eigengap in reases as well, meaning that lusters be ome harder to split. Notethat this in rease is quite regular (there is no 'step' e�e t in the onsidered s ale range) and that thein rease rate is very dependent on the luster type, whether nodes are on entrated (top red urve)or positioned along a 1-dimensional manifold (lower blue and green urves).Fig. 5(b, ) displays the evolution of our riteria for di�erent K values. The analysis of the ÆK urvesexhibits two trends : before � = 1, the nodes an more or less be split into 4 parts (Æ1; Æ2; Æ3 are nearzero and separated from Æ4 and Æ5), whereas above � = 1, there is no eviden e for more than one luster (Æ1 starts in reasing rapidly), or at most two lusters (Æ1 ' Æ2). The analysis of the relative ut measure provides similar trends. Note (in Fig. 5(b and 5( ) the behaviour when K = 5 around� = 1, due to bad K-means initialization and numeri al instabilities.The main on lusion that an be drawn from these urves is that the s ale value has a dire t in�u-en e on the measured riteria. This should not ome as a surprise sin e lustering is inherently as ale-dependent problem.

Page 11: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.

IDIAP�RR 02-55 93 Spe tral stru turing of home videosThe stru ture of home video bears similarity to the stru ture of onsumer still pi tures [11℄: videos ontain series of ordered and temporally adja ent shots that an be organized in groups that onveysemanti meaning, usually related to distin t s enes. Visual similarity and temporal ordering areindeed the two main riteria that allow people to identify lusters in video olle tions, when nothingelse is known about the ontent (unlike the �lmmaker, who knows details of ontext). A lusteringalgorithm that integrates visual similarity and temporal adja en y in a joint model is therefore asensible hoi e. Previous formulations in the literature are based on the same idea [17℄, [12℄. Weexplore the use of spe tral lustering, as des ribed in details in the following se tions.3.1 Shot representation and feature extra tionHome video shots usually ontain more than one appearan e, due to hand-held amera motion. Conse-quently, more than one key-frame might be ne essary to represent the inter-shot appearan e variation.In this paper, a shot is represented by a small �xed number of key-frames, Nkf = 5. However, we areaware that the number and quality of key-frames ould have an impa t on lustering performan e.Shots are further represented by standard visual features [4℄. The i � th key-frame fi of a video is hara terized by a olor histogram hi in the RGB spa e (uniformly quantized to 8� 8� 8 bins).3.2 Similarity omputationThe pair-wise a�nity matrix A is dire tly built from the set of all key-frames in a video (indexed asa whole, but knowing the orresponden e key-frame-shot) by de�ningAij = e�� d2v(fi;fj )2�2v + d2t (fi;fj )2�2t �; (6)where Aij is the a�nity between key-frames fi and fj , dv and dt are measures of visual andtemporal similarity, and �2v and �2t are visual and temporal s ale parameters.Visual similarity is omputed by the metri based on Bhatta haryya oe� ient, whi h has provento be robust to ompare olor distributions [3℄,dv(fi; fj) = (1� �BT (hi; hj))1=2; (7)where the �BT denotes the Bhatta haryya oe� ient, de�ned by �BT =Pk(hikhjk)1=2, the sumrunning over all bins in the histograms.Temporal similarity exploits the fa t that distant shots along the temporal axis are less likely tobelong to the same s ene, and is de�ned bydt(fi; fj) = j jfj j � jfij jjvj (8)where jfij denote the absolute frame number of fi in the video, and jvj denotes the entire video lip duration (in frames). Similar features have been used by other authors in di�erent formulations[17℄. Note that the range for both dv and dt is [0; 1℄.Taking into a ount the dis usion in Subse tion 2.4, we set the s ale parameters �v and �t in thefollowing way. Building upon a previous study of home videos in [4℄, we �xed the �v value to 0:25whi h represents a good threshold for separating intra and inter- luster similarities distributions inhome videos. Similarly, it was shown in [4℄ that almost 70% of home video s enes are omposed offour or less shots. Thus, the �t value was set to the average temporal separation between four shotsin a given video.

Page 12: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.

10 IDIAP�RR 02-553.3 Shot assignment after spe tral lusteringThe spe tral method is applied as dis ussed in Se tion 2. A luster number is then assigned to ea hshot using a simple majority rule on the luster labels of its key-frames. In the ase of a tie, the lusteris randomly sele ted from the possible andidates.4 ExperimentsIn this Se tion we �rst des ribe the dataset and the performan e measures that we use for evaluation.We then ompare the best result obtained by the spe tral method with respe t to the performan eof humans as well as with a probabilisti hierar hi al lustering method [4℄. The third subse tion isdevoted to the omparison between the di�erent riterions used for the sele tion of K.4.1 Data set and ground-truthThe data set onsists of 20 MPEG-1 home videos, digitized from VHS tapes provided by seven di�erentpeople, and with approximate individual duration of 20 minutes. The videos depi t va ations, s hoolparties, weddings, and hildren playing in indoor/outdoor s enarios. A ground-truth (GT) at the shotlevel was semi-automati ally generated, resulting in a total of 430 shots. The number of shots pervideo varies onsiderably (between 4 and 62 shots).There are two typi al options to de�ne the GT at the s ene level. In the �rst-party approa h, theGT is generated by the video reator [11℄. This method in orporates spe i� ontext knowledge aboutthe ontent (e.g., family links, and lo ation relationships) that annot be automati ally extra ted by urrent automati means. In ontrast, a third-party GT is de�ned by a subje t not familiar with the ontent. In this ase, there still exists human ontext understanding, but limited to what is displayedto a subje t. This �blind ontext� makes third-party GTs a fairer ben hmark strategy for automati algorithms [13℄.In this paper, we use a third-party GT based on multiple-subje t judgement that takes into a ountthe fa t that di�erent people might generate di�erent results. S enes for ea h video were found byapproximately twenty subje ts using a GUI that displayed a key-frame-based video summary (no realvideos were displayed). A very general statement about the lustering task, and no initial solutionwere provided to the subje ts at the beginning of the pro ess. The �nal GT set onsists of about 400human segmentations.4.2 Performan e measuresThe performan e measures that we onsider are (i) the number of lusters sele ted by the algorithm and(ii) the shots in errors (SIE). For the number of lusters, we report the value we obtain and ompareit with the numbers provided by humans. For shot in errors, let us denote GT i = fGT ij ; j 2 1; :::; Nigthe set of human GTs for the video Vi, and Ci the solution of an algorithm for the same video. TheSIE between the lustering result Ci and a ground-truth GT ij is de�ned as the number of shots whose luster label in Ci does not mat h the label in the GT. This �gure is omputed between Ci and ea hGT ij , and the GTs are ranked a ording to this measure. We then keep three measures : the minimum,the median and the maximum value of the SIE, denoted SIEimin, SIEimed and SIEimax respe tively.The minimum value SIEmin provides us an indi ation of how far an automati lustering is from thenearest segmentation provided by a human. The median value an be onsidered as a fair measure ofhow well the algorithm performs, taking into a ount the majority of the human GTs and ex ludingthe largest errors. These large errors may ome from outliers and are taken into a ount by SIEimax,whi h gives an idea of the spread of the measures.For the overall performan e measure, we omputed the average SIE measures over all the videos, ofthe per entage of shots in errors w.r.t. the number of shots in ea h video. Note that this normalizationis ne essary be ause the number of lusters (and shots) varies onsiderably from one video to another.

Page 13: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.

IDIAP�RR 02-55 11H PHC SMSIEmin 0.078 0.156 0.116SIEmed 0.275 0.362 0.271SIEmax 0.535 0.532 0.539Table 1: Average of the per entage of shots in error for humans (H), the probabilisti hierar hi al lustering algorithm (PHC), and the spe tral method (SM)4.3 ResultsThe best result with our method was obtained using the eigengap riterion and a threshold ÆK = 0:15.We ompared it with a probabilisti hierar hi al lustering method (PHC) des ribed in [4℄, as well aswith the performan e of humans. The latter was obtained in the following way : for ea h video, theminimum, median and maximum shots in error were omputed for ea h human GT against all theothers. These values were then averaged over all subje ts. These averages are plotted in Fig. 6 forea h video. Finally we omputed the average over all the videos to get the overall performan e.Table 1 summarizes the results. We an �rst noti e from the minimum and maximum values thatthe spread of performan es is very high, given the performan e measure. Se ondly, the spe tralmethod is performing better than the PHC, as an be seen from the median and minimum value, andapproximately as well as the humans.Fig. 6 displays the results obtained for ea h video. First, in Fig. 6(a), we show the number of dete ted lusters (the red ir les) as predi ted by the algorithm and ompare them to the mean of the number of lusters in the ground-truth. The spread of the luster numbers in the ground-truth is represented bythe blue bar (plus or minus one standard deviation). Note that the video have been ordered a ordingto their number of shots. The dete ted luster numbers are in good a ordan e with the GT, thoughslightly underestimated. Fig. 6(b) displays the values of the shot in error measures in omparison tothe average of human performan e. The ir les depi t the measures obtained with our method andthe rosses denote human performan e. The olor represents the di�erents measures (minimum in red,median in blue, and maximum in green). The median performan e of our algorithm is better thanthe average human in 8 ases and worse in 6 ases. Noti e that in 25% of the ases, our algorithmprovides a segmentation that also exists in the ground-truth.Two examples of the generated lusters are shown in Fig. 7. Ea h luster is displayed as a row ofshots, whi h in turn are represented by one keyframe (labeled e). Qualitatively, the method providessensible results.4.4 Comparison of the di�erent riteriaFig. 8 shows the obtained results using the two riteria. The sele tion with the eigengap riterionslightly outperforms the results obtained with the relative ut. We an also noti e that the results arequite onsistent over a relatively large range of thresholds (in any ase, better than the probabilisti hierar hi al lustering algorithm). We also used the MSE ( f Eq. (3)) as a riterion, but ould notobtain good results with it.5 Con lusionIn this paper we have des ribed a method for lustering video shots using a spe tral method. Inparti ular, we investigated the automati sele tion of the number of lusters, whi h is urrently anopen resear h issue for spe tral methods. We have shown in our experiments that the eigengap measure ould indeed be used to estimate this number. The algorithm was applied to a six-hour home videodatabase, and the results are favorably ompared to existing te hniques as well as human performan e.

Page 14: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.

12 IDIAP�RR 02-55

a 0 5 10 15 200

10

20

30

Video Number

Num

ber

of C

lust

ers

Number of clusters found by humans and spectral clustering

b 0 5 10 15 20

0

0.2

0.4

0.6

0.8

Video Number

Sho

t in

erro

r (%

)

Shot in error for human and spectral clustering

Figure 6: (a) Determination of number of lusters. (b) Per entage of shot in error. The blue barindi ates the spread of the human performan es of the SIEmed value.

abFigure 7: Example of shot lustering (a) Video 16 (b) Video 8. Only one keyframe of ea h shot isdisplayed.

Page 15: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.

IDIAP�RR 02-55 13a 0.1 0.2 0.3

0.1

0.2

0.3

eigengap threshold

shot

s in

err

or

b 0.03 0.04 0.05 0.06 0.07 0.08

0.1

0.2

0.3

Relative cut threshold

shot

s in

err

or

Figure 8: Variation of the average of per entage of shots in error (average of the median in red, ofthe min in blue) for the di�erent riterion in fun tion of their threshold a) the eigengap threshold(ranging from 0.1 to 0.3) b) the relative ut threshold (ranging from 0.04 to 0.08).6 A knowledgmentsThe authors thank the Eastman Kodak Company for providing the Home Video Database, and NapatTriroj (University of Washington) for providing the multiple-subje t third-party s ene ground-truth.This work was arried out in the framework of the Swiss National Center of Competen e in Resear h(NCCR) on Intera tive Multimodal Information Management (IM)2.Referen es[1℄ C. Alpert, A. Kahng, and S.Z Yao, �Spe tral partitioning : The more eigenve tors, the better,�Dis rete Applied Math, , no. 90, pp. 3�26, 1999.[2℄ F. R.K. Chung, Spe tral Graph Theory, Ameri an Mathemati al So iety, 1997.[3℄ D. Comani iu, V. Ramesh, and P. Meer, �Real-Time Tra king of Non-Rigid Obje ts using MeanShift,� in Pro . IEEE CVPR., Hilton Head Island, S.C., June 2000.[4℄ D. Gati a-Perez, M.-T. Sun, and A. Loui, �Consumer Video Stru turing by Probabilisti Mergingof Video Segments,� in Pro . IEEE Int. Conf. on Multimedia and Expo, Tokyo, Aug. 2001.[5℄ G. Iyengar and A. Lippman, �Content-based browsing and edition of unstru tured video,� in Pro .IEEE Int. Conf. on Multimedia and Expo, New York City, Aug. 2000.[6℄ J.R. Kender and B.L. Yeo, �On the Stru ture and Analysis of Home Videos,� in Pro . Asian Conf.on Computer Vision, Taipei, Jan. 2000.[7℄ S. Vempala R. Kannan and A. Vetta, �On lusterings - good, bad and spe tral,� in Pro . 41stSymposium on the Foundation of Computer S ien e, FOCS, 2000.[8℄ R. Lienhart, �Abstra ting Home Video Automati ally,� in Pro . ACM Multimedia Conf., Orlando,O t. 1999. pp. 37-41.[9℄ M. Meila and J. Shi, �A random walks view of spe tral segmentation,� in Pro . AISTATS, Florida,2001.[10℄ A. Ng, M. I. Jordan, and Y. Weiss, �On spe tral lustering: analysis and an algorithm,� in Pro .NIPS, Van ouver, De 2001.[11℄ J. Platt �AutoAlbum: Clustering Digital Photographs using Probablisiti Model Merging,� inPro . IEEE Workshop on Content-Based A ess to Image and Video Libraries, Hilton Head Island,S.C.,2000.

Page 16: I D I A P - Semantic Scholar · E S E A R C H R E P R O R T I D I A P D a l l e M o l l e I n s t i t u t e f o r P e r c e p t u a l A r t i f i c i a l I n t e l l i g e n c e P.

14 IDIAP�RR 02-55[12℄ Y. Rui and T. Huang, �A Uni�ed Framework for Video Browsing and Retrieval,� in Alan Bovik,Ed., Image and Video Pro essing Handbook, A ademi Press, 2000, pp.705-715.[13℄ A. Savakis, S. Etz, and A. Loui, �Evaluation of image appeal in onsumer photography,� in Pro .SPIE/IS&T Conf. on Human Vision and Ele troni Imaging V, San Jose, CA, Jan. 2000.[14℄ B. S holkopf, A. Smola, and K.-R. Muller, �Nonlinear omponent analysis as a kernel eigenvalueproblem,� Neural omputation, , no. 10, pp. 1299�1319, 1998.[15℄ G.L. S ott and H.C. Longuet-Higgins, �Feature grouping by relo alisation of eigenve tors of theproximity matrix,� in British Ma hine Vision Conf., 1990, pp. 103�108.[16℄ J. Shi and J. Malik, �Normalized uts and image segmentation,� IEEE Transa tions on PatternAnalysis and Ma hine Intelligen e, vol. 22, no. 8, pp. 888�905, 2000.[17℄ M. Yeung, B.L. Yeo, and B. Liu, �Segmentation of Video by Clustering and Graph Analysis,�Computer Vision and Image Understanding, Vol. 71, No. 1, pp. 94-109, July 1998.[18℄ Y. Weiss, �Segmentation using eigenve tors: a unifying view,� in IEEE ICCV, 1999.