Photo Stream Alignment for Collaborative Photo Collection ...jyang29/papers/WSM2011.pdf · the...

6
Photo Stream Alignment for Collaborative Photo Collection and Sharing in Social Media Jianchao Yang University of Illinois at Urbana-Champaign Urbana, Illinois, USA [email protected] Jiebo Luo and Jie Yu Kodak Research Labs Rochester, New York, USA [email protected] Thomas Huang University of Illinois at Urbana-Champaign Urbana, IL, USA [email protected] ABSTRACT With the popularity of digital cameras and camera phones, it is common for different people, who may or may not know each other, to attend the same event and take pictures and videos from different spatial or personal perspectives. Within the realm of so- cial media, it is desirable to enable these people to share their pic- tures and videos in order to enrich memories and facilitate social networking. However, it is cumbersome to manually manage these photos from different cameras, of which the clocks settings are of- ten not calibrated. In this paper, we propose an automatic algorithm to accurately align different photo streams or sequences from dif- ferent photographers for the same event in chronological order on a common timeline, while respecting the time constraints within each photo stream. Given the preferred similarity measure (e.g. visual, temporal, and spatial similarities), our algorithm performs photo stream alignment via matching on a sparse representation graph that forces the data connections to be sparse in a explicit fashion. We evaluate our algorithm on real-world personal online albums for thirty-six events and demonstrate its efficacy in automatically facilitating collaborative photo collection and sharing. Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval Models; I.4 [Image Processing and Computer Vision]: Feature Measurement, Image Representation General Terms Algorithms, Experimentation Keywords Sparse representation graph, kernel sparse representation, photo stream alignment, photo sharing, graph matching, collaborative me- dia collection. 1. INTRODUCTION Today, millions and millions of users worldwide capture images and videos to record various events in their lives. Such image data Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. WSM’11, November 30, 2011, Scottsdale, Arizona, USA. Copyright 2011 ACM 978-1-4503-0989-9/11/11 ...$10.00. Figure 1: Collaborative photo collection and sharing is com- mon in social media websites such as Facebook and Picasa. capturing was done for two important reasons: first for the partic- ipants themselves to relive the events of their lives at later points of time, and second for them to share these events in their lives with other friends and family, who were not present at the events but still interested in knowing how the event (e.g. the vacation, or the wedding, or the trip) went. When it comes to sharing, it is also common for people who have been to the same event to share pic- tures taken at the same event by different people who had different viewpoints, timing, or subjects. The importance of sharing is in- deed underscored by the billions of image uploads per month on social media sites like Facebook, Flickr, Picasa, and so on. In this study, we consider a very common scenario for many photo-worthy events: for example, you and several friends took a trip to the Yellowstone National Park, and each of you took your own cameras to record the trip. At the end, collectively you ended up with several photo albums, each created by a different camera and composed of hundreds or even thousands of photos. One natu- ral problem arises: How do you share these photo albums or collec- tions among the friends in an effective and organized manner? Such a scenario occurs often for events that involve many people, such as trips, excursions, sports activities, concerts and shows, graduations, weddings, and picnics. Many photo sharing sites now provide func- tions or apps to facilitate photo sharing. As shown in Figure 1, for example, Picasa now allows people one shares photos with to con- tribute to one’s album, while a Facebook app called Friends Photos

Transcript of Photo Stream Alignment for Collaborative Photo Collection ...jyang29/papers/WSM2011.pdf · the...

Page 1: Photo Stream Alignment for Collaborative Photo Collection ...jyang29/papers/WSM2011.pdf · the sharing of multiple photo albums captured for the same event. Figure 2 illustrates the

Photo Stream Alignment for Collaborative PhotoCollection and Sharing in Social Media

Jianchao YangUniversity of Illinois atUrbana-ChampaignUrbana, Illinois, USA

[email protected]

Jiebo Luo and Jie YuKodak Research Labs

Rochester, New York, [email protected]

Thomas HuangUniversity of Illinois atUrbana-Champaign

Urbana, IL, [email protected]

ABSTRACTWith the popularity of digital cameras and camera phones, it iscommon for different people, who may or may not know eachother, to attend the same event and take pictures and videos fromdifferent spatial or personal perspectives. Within the realm of so-cial media, it is desirable to enable these people to share their pic-tures and videos in order to enrich memories and facilitate socialnetworking. However, it is cumbersome to manually manage thesephotos from different cameras, of which the clocks settings are of-ten not calibrated. In this paper, we propose an automatic algorithmto accurately align different photo streams or sequences from dif-ferent photographers for the same event in chronological order on acommon timeline, while respecting the time constraints within eachphoto stream. Given the preferred similarity measure (e.g. visual,temporal, and spatial similarities), our algorithm performs photostream alignment via matching on a sparse representation graphthat forces the data connections to be sparse in a explicit fashion.We evaluate our algorithm on real-world personal online albumsfor thirty-six events and demonstrate its efficacy in automaticallyfacilitating collaborative photo collection and sharing.

Categories and Subject DescriptorsH.3.3 [Information Search and Retrieval]: Retrieval Models; I.4[Image Processing and Computer Vision]: Feature Measurement,Image Representation

General TermsAlgorithms, Experimentation

KeywordsSparse representation graph, kernel sparse representation, photostream alignment, photo sharing, graph matching, collaborative me-dia collection.

1. INTRODUCTIONToday, millions and millions of users worldwide capture images

and videos to record various events in their lives. Such image data

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.WSM’11, November 30, 2011, Scottsdale, Arizona, USA.Copyright 2011 ACM 978-1-4503-0989-9/11/11 ...$10.00.

Figure 1: Collaborative photo collection and sharing is com-mon in social media websites such as Facebook and Picasa.

capturing was done for two important reasons: first for the partic-ipants themselves to relive the events of their lives at later pointsof time, and second for them to share these events in their liveswith other friends and family, who were not present at the eventsbut still interested in knowing how the event (e.g. the vacation, orthe wedding, or the trip) went. When it comes to sharing, it is alsocommon for people who have been to the same event to share pic-tures taken at the same event by different people who had differentviewpoints, timing, or subjects. The importance of sharing is in-deed underscored by the billions of image uploads per month onsocial media sites like Facebook, Flickr, Picasa, and so on.

In this study, we consider a very common scenario for manyphoto-worthy events: for example, you and several friends tooka trip to the Yellowstone National Park, and each of you took yourown cameras to record the trip. At the end, collectively you endedup with several photo albums, each created by a different cameraand composed of hundreds or even thousands of photos. One natu-ral problem arises: How do you share these photo albums or collec-tions among the friends in an effective and organized manner? Sucha scenario occurs often for events that involve many people, such astrips, excursions, sports activities, concerts and shows, graduations,weddings, and picnics. Many photo sharing sites now provide func-tions or apps to facilitate photo sharing. As shown in Figure 1, forexample, Picasa now allows people one shares photos with to con-tribute to one’s album, while a Facebook app called Friends Photos

Page 2: Photo Stream Alignment for Collaborative Photo Collection ...jyang29/papers/WSM2011.pdf · the sharing of multiple photo albums captured for the same event. Figure 2 illustrates the

Figure 2: Overview of our collaborative photo collection andsharing system.

searches in one’s network to present an overview of the friends’photo albums. However, these functions do not automatically alignphotos of the same event from different contributors.

Currently, people must view the individual photo collections sep-arately, and few are willing to invest time to augment their own col-lections with photos taken by others. It is clearly not a solution tosimply merge all the photos from different albums into one supercollection. Because different albums use different photo namingconventions, putting those photos together will result in either or-dering the photos into disjoint groups bearing no semantic mean-ings, or worse, disorder due to naming conflicts.

Second, it is unlikely that one can merge these photos based ontheir timestamps. Within each photo collection from one camera,the photos can be arranged in chronological order based on theirtimestamps, which forms the photo stream. However, since peoplerarely bother to calibrate their camera clocks before taking photos,the timestamps from different cameras are typically out of sync andnot reliable for aligning the photos in different streams. Typically,the camera clocks can be offset by minutes, hours and even dayswhen people travel through different time zones. In fact, this is truefor all the real-world photo collections unintentionally gathered forexperiments in this work.

Therefore, it is desirable to develop an automatic algorithm thatcan facilitate different people, who may or may not know eachother, attend the same event and take photos from different spa-tial or personal perspectives, to share their pictures and videos inan effective way, especially for online albums, in order to enrichmemories and promote social networking.

In recent years, the explosion of consumer digital photos hasdrawn growing research interest in the organization and sharing ofcommunity photo collections, due to the popularity of web-baseduser-centric multimedia social networks, such as Facebook, Flickr,Picasa, and Youtube. While many efforts have been devoted tophoto organization [12], annotation [6], [11], [18], [19], [8] sum-marization [14], [5], browsing [9], [15], and search [10], [13], littlehas been done to relate media collections that are about the sameevent for effective sharing. As a practical need we encounter every-day, intelligent photo collection and sharing is one on many peo-ple’s wish lists.

This paper attempts to address the problem raised in the be-ginning and aims to develop an automatic algorithm to facilitatethe sharing of multiple photo albums captured for the same event.Figure 2 illustrates the diagram of our system, where multiple al-bums are aligned in the chronological order of the event to create amaster stream, which captures the integrity of the whole event forsharing among different users. Our algorithm relies on the kernelsparse representation graph that is constructed by explicitly sparsi-

fying the photo connections using ℓ1-norm minimization, in gen-eralization of the ℓ1-graph [3], [4] in the kernel space for broaderapplication purposes. For customer albums, most photos by dif-ferent cameras are visually uncorrelated, but some of them over-lap in content and thus form the base for aligning different photostreams. Therefore, the visual correlation links between differentphoto streams are sparse. As we will see, by explicitly account-ing for this sparseness in the graph construction via ℓ1-norm mini-mization, our matching algorithm is more robust than conventionalmethods.

The remainder of the paper is organized as follows. Section 2 de-scribes the kernel sparse representation graph, given the preferredphoto similarity measures. Tailored to our problem, Section 3 in-troduces a specific sparse bipartite graph for robust alignment ofmultiple photo streams. In Section 4 we report the experimentalresults on a total of 36 real-world photo datasets, each of which in-volves two or more cameras and was collected from the Picasa WebAlbum. Finally, Section 5 concludes our paper with future work.

2. KERNEL SPARSE GRAPH CONSTRUC-TION

In pervasive computer vision and machine learning tasks, findingthe correct data relationship, typically represented as a graph, is es-sential to the success of many algorithms. Due to the limitations ofthe existing similarity measures, sparse graphs usually offer certainadvantages because they can reduce spurious connections betweendata points and thus tend to exhibit high robustness [20]. Recently,Cheng et al.[3] proposed a new graph construction method via ℓ1-norm minimization, where the graph connections are establishedbased on the sparse representation coefficients of the current datumin terms of the rest data points. Robust to noise and adaptive forneighbor selection, the ℓ1-graph demonstrates substantial improve-ments on clustering and subspace learning over the conventionalgraph construction methods, such as kNN and ϵ-ball graphs. How-ever, the method in [3] is limited to applications where data can beroughly aligned, e.g. faces and digits. In this section, we propose togeneralize the concept of ℓ1-graph for exploring data relationshipsin a general kernel space, thus making our new graph applicable tomuch broader applications.

2.1 Similarity MeasureTo construct the graph, we first define the similarity measure for

photos. With the associated meta-data of consumer photos, we rep-resent each photo as {x, g}, where x denotes the image itself, andg its geo-location. To keep the notation uncluttered, we simply usex instead of the duplet in the following presentation. We define thephoto similarity as

S(xi,xj) = Sv(xi,xj) · ·Sg(xi,xj), (1)

where Sv and Sg are the visual and geo-location similarities be-tween photos xi and xj respectively. Other information, e.g. phototags for online albums, can also be incorporated if available.

Visual similarity Sv is the most important cue for our tasks. Inthis paper, we choose the following three visual features to computethe visual similarity, due to their simplicity and effectiveness:

1. Color Histogram, an evidently important cue for consumerphotos;

2. GIST [16], a simple and popular feature to capture the visualglobal shape;

3. LLC [17], a state-of-the-art appearance feature for imageclassification.

Page 3: Photo Stream Alignment for Collaborative Photo Collection ...jyang29/papers/WSM2011.pdf · the sharing of multiple photo albums captured for the same event. Figure 2 illustrates the

We concatenate these features with equal weights, normalize theminto unit length, and simply use their inner products as our visualsimilarity.

Photos taken close in location are probably about the same con-tent. Given the geo-locations gi and gj for photos xi and xj , ageo-location similarity can be defined, e.g., using a Gaussian ker-nel:

Sg(xi,xj) = exp(−∥gi − gj∥22

σ). (2)

Finally, we assume that the similarity measure S defines a validkernel κ(·, ·) with Φ(·) being the implicit feature mapping function,i.e.

κ(xi,xj) = Φ(xi)TΦ(xj) = S(xi,xj). (3)

In this paper, we mainly rely on the visual similarity (Geo-locationsimilarity was not used even when GPS information is recorded,which is left for future work.).

2.2 Graph ConstructionThe basic question of graph construction is, given one datum xt,

how to connect it with many other data points {xi}ni=1 based onsome given similarity measure. The graph construction based onsparse representation is formulated as

min ∥α∥0 s.t. ∥Φ(xt)−Dα∥22 ≤ ϵ, (4)

where D = [Φ(x1),Φ(x2), ...,Φ(xn)] serves as the dictionaryto represent Φ(xt). The connection between xt and each xi isdetermined by the solution α∗: if α∗(i) = 0, there is no edgebetween them; otherwise, the edge weight is defined as |α∗(i)|.Eqn. 4 is a combinatorial NP-hard problem, whose tightest convexrelaxation is through ℓ1-norm minimization [2],

min ∥α∥1 + β∥α∥22 s.t. ∥Φ(xt)−Dα∥22 ≤ ϵ, (5)

where we further add a small ℓ2-norm regularization term to stabi-lize the solution [21].

In many scenarios, we can easily define the similarity measurebetween data points whereas explicit feature mapping Φ(·) may notbe available, i.e. we only have the kernel function κ(·, ·). Eqn. 5can be solved implicitly in the kernel space by expanding the con-straint [7],

min ∥α∥1 + β∥α∥22s.t. 1 + αTκ(D,D)α− 2κ(xt, D)α ≤ ϵ,

(6)

where κ(D,D) is a matrix, with the (i, j)-th entry κ(D(:, i), D(:, j)) = S(xi,xj), and κ(xt, D) is a vector with k-th entry κ(xt, D(:, k)) = S(xt,xk). Eqn. 6 can be solved efficiently in the similarway of Eqn. 5.

Co-event photo collections by different cameras usually covera wide variety of contents with some amount of redundancy, i.e.most photos are visually uncorrelated while some of them over-lap in content, because different photographers may have capturedcorrelated contents at same times. Consequently, the visual correla-tion links between photo streams are usually sparse with respect toall possible edges between photos. By explicitly incorporating thesparsity constraint in the graph construction process, we can adap-tively select the most visually correlated photos for each node inthe graph and thus discovering the most informative links betweendifferent photo streams of the same event. In the following, withthe basic kernel sparse graph construction procedure, we proposea principled approach for aligning photo streams based on a sparsebipartite graph in Section 3.2.

3. PHOTO STREAM ALIGNMENTIn this Section, we describe our approach for aligning multiple

photo streams from different cameras whose time settings are notcalibrated. For each pair of photo streams, our alignment algorithmis based on matching on a bipartite graph constructed based on thekernel sparse representation graph discussed in previous Section.A max linkage selection procedure is further introduced for photolink competition for robust matching.

3.1 Problem StatementSuppose we are given two photo streams X1 = [x1

1,x12, ...,x

1m]

and X2 = [x21,x

22, ...,x

2n] about the same event, associated with

which are their own camera timestamps T1 = [t11, t12, ..., t

1m] and

T2 = [t21, t22, ..., t

2n], where xj

i denotes the i-th photo with tji itscamera timestamp in stream j ∈ {1, 2}. In most cases, we canassume that the relative time within both T1 and T2 is correct, butthe relative time shift between T1 and T2 is unknown. Our goal is toestimate the correct time shift ∆T between the two time sequences.To make accurate photo stream alignment possible, we make thefollowing assumption:

ASSUMPTION 1. The photo streams to be aligned contain acertain amount of temporal-visual correlations.

By finding such temporal-visual correlations between photos fromdifferent streams, though in many cases sparse, we can align thetwo photo streams in chronological order to describe the completeevent. Although there is only one parameter ∆T to infer, robust andaccurate alignment turns out to be nontrivial due to the followingreasons:

1. Limited effectiveness of the visual features, i.e. semanticallysimilar photos may be distant by the visual similarity, andvice versa. For example, in Figure 3, the left two photoshave low visual similarity, but they are related to the samemoment and same scene of the event.

2. Photos are not taken consciously to facilitate alignment, e.g.different photographers may capture largely different con-tents. This is very common since different photographershave different spatial and personal perspectives about the sameevent.

3. Misleading data may exist for alignment, e.g. similar scenesmay be captured at different times. For example, in Figure 3,the two photos are about the same scene, but they were takenat different times by the two photographers respectively.

As such, consumer photo streams are extremely noisy for accu-rate alignment, and decisions made on an isolated pair of imagescan be incorrect without proper context of the corresponding photostreams. In fact, as discussed later in our experiments in Section 4,heuristic approaches are not reliable and often run into contradic-tions. Therefore, we propose a principled approach for robust andaccurate alignment by matching two photo streams on a sparse bi-partite graph between each pair of photo streams, constructed basedon kernel sparse representation.

3.2 Sparse Bipartite GraphDifferent photo streams for the same event usually share some

similar photo contents. If we can build a bipartite graph G =(X1,X2, E) linking the informative pairs—the distinctive photopairs from two streams that share large visual similarities—fromthe two streams, with assumption 1, we will be able to find thecorrect ∆T . Consumer albums typically contain photos diverse in

Page 4: Photo Stream Alignment for Collaborative Photo Collection ...jyang29/papers/WSM2011.pdf · the sharing of multiple photo albums captured for the same event. Figure 2 illustrates the

Figure 3: The left two photos are visually distant, but semanti-cally they are about the same scene (from Horse Trail dataset).The right two photos are about the same scene, but were takenat different times (from Lijiang dataset).

contents and appearance, the informative pairs are only a few com-pared with the album sizes. Therefore, the bipartite graph G, whichincludes the links of informative pairs between photo streams as itsedges, should be sparse, i.e. |E| ≪ |X1| · |X2|.

In this case, we can only use the visual information and possibleGPS information for measuring the photo similarities. Based onthe basic technique presented in Section 2, Algorithm 1 shows theprocedures of constructing the bipartite graph between two photostreams X1 and X2, where E12 records the directed bipartite graphedges from X1 to X2, and E21 records the reverse graph edges.The final affinity matrix simply averages the two directed affinitymatrices

Eij =1

2(E12

ij + E21ji ). (7)

Using “average" of the two directed edge weights makes the bi-partite graph linkage more distinctive. If E12

ij and E21ji are both

nonzero, i.e. both x1i and x2

j choose the other one as one of itsinformative neighbors among many others, then x1

i and x2j are

strongly connected and are more likely to be the informative pairdesired for the alignment task.

Algorithm 1 Sparse Bipartite Graph Construction1: Input: photo streams X1 and X2, kernel function κ.2: for each x1

i ∈ X1 do3: Solve the following optimization:

α1i =argmin

α∥α∥1 + β∥α∥22

s.t. 1 + αTκ(X2,X2)α− 2κ(x1i ,X2)α ≤ ϵ.

4: Assign E12ij = |α1

i (j)| for j = 1, 2, ..., n.5: end for6: for each x2

j ∈ X2 do7: Solve the following optimization:

α2j =argmin

α∥α∥1 + β∥α∥22

s.t. 1 + αTκ(X1, X1)α− 2κ(x2j , X1)α ≤ ϵ.

(8)

8: Assign E21ji = |α1

j (i)| for i = 1, 2, ...,m.9: end for

10: Output: sparse bipartite graph affinity matrix E =[E12 + (E21)T

]/2.

3.3 Max Linkage Selection for Robust Match-ing

The above sparse bipartite graph construction is based on thesimilarity measure only, without respecting the chronological orderconstraint within each photo stream. Yet these sparse links providethe candidate photo matches critical for alignment. However, dueto the limitations of the photo similarity measures, these candidatematches may be too spurious for precise alignment. We propose

a procedure called max linkage selection to prune the candidatematches: if a photo has multiple links with other nodes, we onlykeep the edge with maximum weight and break the rest.

In this way, the remained matched pairs are more informative forthe alignment task, as verified by our experiments. Note that maxlinkage selection is not equivalent to finding the most similar photoin the first place: 1) finding the most similar photo does not have acompeting procedure; one still needs this max linkage selection toprune the false matches; 2) finding most similar photo has the prob-lem of assigning weights, which is essential for robust matching.

Denote the set of pruned matched pairs as

M ={(x1

i , t1i ;x

2j , t

2j )|Eij ̸= 0

}. (9)

The correct time shift ∆T (in seconds) is found by

∆T = argmax∆t

∑(i,j)∈M

Eijδ(|t1i − t2j −∆t| ≤ τ), (10)

where δ is the indicator function, and τ is a small time displacementtolerance for reliable matching (chosen as 60s in our experiments).Once we have ∆T ’s for each pair of photo streams, we can mergemultiple streams into a master photo stream in chronological orderfor sharing among different users.

3.4 Multiple Sequence AdjustmentIn practice, we usually have more than two photo streams, which

can provide complementary visual matching information for align-ment. Since pair-wise stream matching does not ensure time con-sistency as a whole, we need to combine the matching results formultiple stream pairs. Suppose we have s streams in total, for eachpair of matched photo streams, we have the matched photo pair setM∗

pq, 1 ≤ p, q ≤ s found by Eqn. 10. Let T ∗p and T ∗

q denotethe timestamp sequences of the matched photo pair set, and wpq betheir matching scores. Our goal is, for a chosen reference times-tamp sequence T ∗

ref , infer ∆Tp for each time stamp sequence T ∗p ,

so that multiple photo streams will be mapped onto the commontime axis. We define the matching error for two sequences as

ϵpq = h(wTpq(T

∗p +∆Tp − T ∗

q −∆Tq)), (11)

where h is the Huber function to allow matching outliers. Theconsistent time alignments can thus be found by

min{∆Tl}sl=1

s∑p=1

∑q ̸=p

ϵpq. (12)

4. EXPERIMENTAL EVALUATIONTo evaluate the performance of our algorithm, we collected a to-

tal of 36 real-world consumer photo datasets, each correspondingto one event and containing several personal photo albums. Thephotographers were not aware of this project at of the time of creat-ing their photo albums, and therefore the photos were taken uncon-sciously and thus avoid the bias for later alignment. The number ofphotos in each dataset ranges from several dozens to several hun-dreds or even over a thousand. The entire collection of datasets israther diverse: the content ranges from traveling (numerous natu-ral and urban scenes) to social events (wedding, car racing, sports,stage shows, etc.); tens of photographers were involved and tensof different models of cameras were used. In the following, wewill present our photo stream alignment results and compare withseveral baseline algorithms.

For all the photo datasets we have collected, the time settings ofthe cameras were not calibrated with each other in situ, and there-fore we do not have the absolute ground truth for the camera clocks.

Page 5: Photo Stream Alignment for Collaborative Photo Collection ...jyang29/papers/WSM2011.pdf · the sharing of multiple photo albums captured for the same event. Figure 2 illustrates the

Table 1: The photo stream alignment accuracy on the 36 photodatasets with different algorithms.

Alg. DNN SIFT kNN R-kNN SRG R-SRGAcc. 25/36 25/36 27/36 29/36 32/36 34/36

However, we obtained a ground truth accurate enough to reflect thecorrect sequential order of the merged photo stream through ver-ification with the first-party photographers. This ground truth issufficient for evaluating the algorithms, as one only cares about thesequential order of the photos in the event.

4.1 Alignment ResultsThe key to the photo stream alignment is to find the informative

photo pairs. One can come up with many possible heuristic ap-proaches for this problem. However, heuristic approaches often runinto contradiction and fail to obtain robust and accurate alignment,suggesting that a principled approach is needed for robust align-ment. In the following, we describe and compare with the threebest performing heuristic methods among what we have tried.

1. Distinctive nearest neighbor search (DNN). For photo x1

from the first stream, x2 in the second stream is its distinc-tive nearest neighbor, if the similarity between x1 and x2 isat least r (r > 1) times larger than those between x1 andany other photos in the second stream; otherwise, there isno match for photo x1. There are also other ways to defineDNN, e.g. only link those nearest neighbors with similari-ties larger than some threshold µ. However, we find that ourdefinition of DNN is more robust for different datasets, sinceit introduces a competing procedure instead of relying on acertain fixed threshold.

2. SIFT feature matching. Another straightforward way tofind the informative pairs is to use near-duplication detectiontechniques, such as local SIFT feature matching by RANSAC[1]. However, on one hand, SIFT feature matching tends tomiss many visually similar but not quite duplicate photos,leading to too few detections of the informative photo pairsin some cases. On the other hand, this method tends to bemislead by strong outliers, e.g. near-duplicate scenes that insome cases actually occurred at different times. After all, thephotographers do not always walk in locked steps and takepictures. In practice, this approach also runs too slow.

3. R-kNN graph matching. Instead of the proposed sparsegraph, one can use the conventional kNN graph to establishthe sparse links, and assign the edge weights with the calcu-lated similarities between the photo nodes. To reject spuriouslinks, one can also apply the max linkage selection procedureas in our algorithm for robust matching, referred as R-kNN.

We evaluate the alignment results by checking whether the mergedsuper stream is in the same sequential order as the verified groundtruth—if so, we count it as correct; otherwise, we count it as a fail-ure no matter how large the actual alignment error is. In Table 1, welist the alignment accuracies of the different algorithms. By usingthe proposed max linkage selection procedure, “R-kNN" performsbetter than kNN graph matching thanks to spurious linkage rejec-tion. Directly using the sparse representation graph, “SRG" alreadyoutperforms all the heuristic methods, and it is further improved byusing the competing procedure of max linkage selection (referredas “R-SRG").

Overall, our algorithm can achieve excellent alignment resultsin merging different photo streams in chronological order on most

Figure 4: Example photos from the “soccer" dataset, whichhardly shows informative visual-time correlations.

Figure 5: Alignment example by our proposed algorithm onthe Lijiang Trip dataset. Photos from different cameras areindicated by different border colors.

of the datasets, in a fashion comparable with human observers. Invery few other datasets (2 out of 36), such as the “soccer" dataset,where Asumption 1 is violated, our algorithm fails as do unrelatedhuman observers. In the “soccer" dataset, the photographers merelysat around the same location, taking photos that were visually verysimilar but at different times. Figure 4 shows some example pho-tos from two streams (indicated by different border colors) in thisdataset, where the photos are visually similar across different times.Alignment on this dataset is also very challenging for a human:only by a pair of photos can a very observant human roughly alignthese two streams with careful examination of the semantic contentof the photos (i.e. the positions and moving directions of the soccerplayers).

Figure 5 shows an alignment example for the “Lijiang" trip photodataset by our algorithm. There is one particular difficulty with thealignment for this dataset—visually similar photos do not alwaysoccur at the same time, which causes a problem for SIFT featurematching. The SIFT feature matching method links strongly thephoto pairs connected by yellow dotted lines shown in the figure,and eventually produces incorrect alignment. In contrast, by utiliz-ing more graph links from other photos, our method can ultimatelyidentify the correct time shift. Figure 6 shows another alignmentexample for the “Wedding" event, which occurred in the court-house. Many of those photo are visually very similar (same personswith same backgrounds). All the baseline algorithms fail in thiscase, since many matched pairs found by these algorithms are falselinks. By adaptively selecting the most relevant photos via spar-sity constraint followed by a competing procedure of max linkageselection, our algorithm can effectively reject those spurious linksand correctly identify the true time shift.

Figure 7 shows the curve of matching score versus time shift onthree of the datasets. For the first two cases, our algorithm can suc-cessfully locate the accurate time shift ∆T by picking the sharppeak from the curve. However, for the third “soccer" dataset, thealgorithm could not locate a clear peak. Compared with the pre-vious two cases, the curve has high entropy and multiple peaks,which are strong indications of poor matching.

Finally, we note that the proposed algorithm is also computation-

Page 6: Photo Stream Alignment for Collaborative Photo Collection ...jyang29/papers/WSM2011.pdf · the sharing of multiple photo albums captured for the same event. Figure 2 illustrates the

Figure 6: Alignment examples by our proposed algorithm onthe Wedding dataset. Photos from different cameras are indi-cated by different border colors.

−8 −6 −4 −2 0 2 4 6 8

x 105

0

2

4

6

8

10

12

Time Shift

Mat

chin

g S

core

−4 −3 −2 −1 0 1 2 3

x 104

0

1

2

3

4

5

6

7

Time Shift

Mat

chin

g S

core

−6000 −4000 −2000 0 2000 4000 6000 80000

0.5

1

1.5

2

2.5

Time Shift

Mat

chin

g S

core

Figure 7: The matching score vs. time shift. Left: GrandCanyon; middle: Lijiang day 4; right: Soccer.

ally as efficient as heuristic methods DNN and kNN (R-kNN), andmuch faster than SIFT matching.

5. CONCLUSIONS AND FUTURE WORKIn this paper, we address the practical problem of photo align-

ment for collaborative photo collection and sharing in social media.Since people have similar photo taking interests and viewpoints,there are photos with overlapping visual content when several cam-eras (photographers) capture the same event. Based on such visualinformation overlap, we are able to align multiple photo streamsalong a common chronological timeline of the event, employinga sparse bipartite graph to find the informative photo pairs and amax linkage selection competing procedure to prune false links.Compared with several baseline algorithms, our alignment algo-rithm can achieve satisfactory that are comparable to human perfor-mance. The proposed framework also lends itself to many other ap-plications, such as geo-tag or user-tag transfer between the alignedphoto streams, and photo summarization on the master stream forthe event, which we investigate in our future work.

AcknowledgementThis work is supported in part by Eastman Kodak Research, U.S.Army Research Laboratory and U.S. Army Research Office undergrand number W911NF-09-1-0383.

6. REFERENCES[1] M. Brown and D. G. Lowe. Automatic panoramic image

stitching using invariant features. International Journal ofComputer Vision, 74:59–73, 2007.

[2] E. Candes, J. Romberg, and T. Tao. Robust uncertaintyprinciples: exact signal reconstruction from highlyincomplete frequency information. IEEE Transactions onInformation Theory, 52:489–509, Feb. 2006.

[3] B. Cheng, J. Yang, S. Yan, Y. Fu, and T. Huang. Learningwith ℓ1-graph for image analysis. IEEE Transactions onImage Processing (TIP), 19(4):858–866, 2010.

[4] H. Cheng, Z. Liu, and J. Yang. Sparsity induced similaritymeasure for label propagation. In IEEE InternationalConference on Computer Vision, 2009.

[5] W. T. Chu and C.-H. Lin. Automatic summarization of travelphotos using near-duplicate detection and feature filtering. InProceedings of the ACM International Conference onMultimedia, 2009.

[6] S. Gammeter, L. Boassard, T. Quack, and L. V. Gool. I knowwhat you did last summer: object-level auto-annotation ofholiday snaps. In IEEE International Conference onComputer Vision, pages 614–621, 2009.

[7] S. Gao, I. W.-H. Tsang, and L.-T. Chia. Kernel sparserepresentation for image classification and face recognition.In European Conference on Computer Vision (ECCV), 2010.

[8] M. Guillaumin, T. Mensink, J. Verbeek, and C. Schmid.Tagprop: Discriminative metric learning in nearest neighbormodels for image auto-annotation. In IEEE InternationalConference on Computer Vision, 2009.

[9] D. Huynh, S. Drucker, P. Baudisch, and C. Wong. Time quilt:scaling up zoomable photo browsers for large, unstructuredphoto collections. In SIGCHI Conference on Human factorsin computing systems, pages 1937–1940, 2005.

[10] D. Kirk, A. Sellen, C. Rother, and K. Wood. Understandingphotowork. In SIGCHI Conference on Human factors incomputing systems, 2006.

[11] L.-J. Li, S. R., and L. Fei-Fei. Towards total sceneunderstanding: Classification, annotation and segmentationin an automatic framework. In IEEE InternationalConference on Computer Vision, pages 2036–2043, 2009.

[12] A. C. Loui and A. Savakis. Automated event clustering andquality screening of consuer pictures for digital albuming.IEEE Transactions on Multimedia, 2003.

[13] T. Quack, B. Leibe, and L. V. Gool. World-scale mining ofobjects and events from community photo collections. InInternational Conference on Content-based Image and VideoRetrieval, 2008.

[14] I. Simon, N. Snavely, and S. M. Seitz. Scene summarizationfor online image collections. In IEEE 11th InternationalConference on Computer Vision, pages 1–8, 2007.

[15] G. Strong and M. Gong. Organizing and browsing photosusing different feature vectors and their evaluations. In ACMInternational Conference on Image and Video Retrieval,2009.

[16] A. Torralba, K. P. Murphy, W. T. Freeman, and M. A. Rubin.Context-based vision system for place and objectrecognition. In Proceedings of International Conference onComputer Vision, 2003.

[17] J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong.Locality-constrained linear coding for image classification.In Proceedings of IEEE Computer Society Conference onComputer Vision and Pattern Recognition, 2010.

[18] X.-J. Wang, L. Zhang, M. Liu, Y. Li, and W.-Y. Ma.Arista-image search to annotation on billions of web photos.In IEEE Conference on Computer Vision and PatternRecognition, pages 2987–2994, 2010.

[19] S. Zhang, J. Huang, Y. Huang, Y. Yu, H. Li, and M. D.N.Antomatic image annotation using group sparsity. In IEEEConference on Computer Vision and Pattern Recognition,pages 3312–3319, 2010.

[20] X. Zhu. Semi-supervised learning literature survey.Technical report, University of Wisconsin Madison, 2008.

[21] H. Zou and T. Hastie. Regularization and variable selectionvia the elastic net. Journal of the Royal Statistical Society,Series B, 67:301–320, 2005.