ContextSeer: Context Search and Recommendation at Query ...jzwang/ustc11/mm2008/p199-yang.pdfas the...

ContextSeer: Context Search and Recommendation at Query Time for Shared Consumer Photos

Yi-Hsuan Yang, Po-Tun Wu, Ching-Wei Lee, Kuan-Hung Lin, Winston H. Hsu, Homer ChenNational Taiwan University, Taipei, Taiwan

{affige, frankwbd , smallbighead}@gmail.com, [email protected] [email protected], [email protected]

ABSTRACT The advent of media-sharing sites like Flickr has drastically increased the volume of community-contributed multimedia resources on the web. However, due to their magnitudes, these collections are increasingly difficult to understand, search and navigate. To tackle these issues, a novel search system, ContextSeer, is developed to improve search quality (by reranking) and recommend supplementary information (i.e., search-related tags and canonical images) by leveraging the rich context cues, including the visual content, high-level concept scores, time and location metadata. First, we propose an ordinal reranking algorithm to enhance the semantic coherence of text-based search result by mining contextual patterns in an unsupervised fashion. A novel feature selection method, wc-tf-idf is also developed to select informative context cues. Second, to represent the diversity of search result, we propose an efficient algorithm cannoG to select multiple canonical images without clustering. Finally, ContextSeer enhances the search experience by further recommending relevant tags. Besides being effective and unsupervised, the proposed methods are efficient and can be finished at query time, which is vital for practical online applications. To evaluate ContextSeer, we have collected 0.5 million consumer photos from Flickr and manually annotated a number of queries by pooling to form a new benchmark, Flickr550. Ordinal reranking achieves significant performance gains both in Flcikr550 and TRECVID search benchmarks. Through a subjective test, cannoG expresses its representativeness and excellence for recommending multiple canonical images.

Categories and Subject Descriptors H.3.3 [Information Search and Retrieval]: Retrieval models

General Terms Algorithms, Performance, Experimentation

Keywords Search, rerank, recommending, context, tag, metadata, canonical image, visual word, shared consumer photo

1. INTRODUCTION The explosive growth of digital videos/photos, the prevalence of capture devices, and the phenomenal success in WWW search have helped attract increasing interest in investigating new solutions in image/video search and navigation. This interest is gradually extended to the novel domain of community-contributed multimedia as the growing practice of online public media sharing. Billions of images shared on websites such as Flickr1 bring profound social impact to the human society, and pose a new challenge for the design of efficient indexing, searching and visualizing methods for manipulating them. Current image or video search approaches are mostly restricted to text-based solutions which process keyword queries against text tokens associated with the media, such as speech transcripts, captions, file names, etc. For shared consumer photos, text clues come from tags or descriptions that are added by users via some light-weight annotation tools. The associated tags may contain abundant information, yet their qualities are not uniformly guaranteed. In most photo-sharing websites, tags and other forms of text are freely entered and are not associated with any type of ontology or categorization. Tags are therefore often inaccurate, wrong or ambiguous [1]. In particular, due to the complex motivations behind tag usage [2], tags do not necessarily describe the content of the image [3]. Therefore, one primary task of a search system is to retrieve accurate images from the noisy tags.

1 http://flickr.com

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’08, October 26–31, 2008, Vancouver, British Columbia, Canada.Copyright 2008 ACM 978-1-60558-303-7/08/10...$5.00.

Figure 1: Context cues in consumer photos. Visual features and (high-level) concept scores can be obtained by feature extractors [22] and concept detectors [5]; location and time metadata will be exceedingly available from location-aware camera-phones and digital cameras [6].

199

The other attribute of consumer photos is their diversity in content, context, and aesthetic aspects. To paraphrase Susan Sontag [32], “everything exists to end up in a photograph.” Even the photos of the same object can be visually very different as they capture different aspects of the object. Therefore, to represent this diversity and enhance the search experience, a retrieval system should recommend additional information such as query-related tags [25] or canonical images [16]. To address these issues, we propose to utilize rich context cues2 of shared consumer photos to improve the search result and to recommend relevant tags and canonical images. Generally, the context cues of consumer photos contain the low-level visual features, (high-level) concept scores, and time and location metadata3, as illustrated in Fig. 1. Though text-based search model is not perfect, certain context cues do exist in the retrieved images. Fig. 2 gives an illustrative example of text-based search result for the query “eiffel tower.” While Fig. 2(a) correctly contains the query object in sight, the remaining ones fail to for noisy text descriptions (e.g., Fig. 2(b) is actually a picture of Tokyo Tower, by which Eiffel Tower is usually compared with.) or noisy tags (e.g., Figs. 2(c)–(e) are the images taken from, underneath and nearby Eiffel Tower; naturally they are tagged with the keyword “eiffel tower” by the photo owners.). However, by investigating the visual content, a search system might be able to learn that the information needs (or target semantics) of the user is related to tower-like objects and exclude Figs. 2(c)–2(e) from the search result. Likewise, Fig. 2(b) can be detected as false positive by examining the location metadata (i.e., geo tags). Time metadata can also be important for activity-related queries, such as “christmas eve” or “oktoberfest.” In other words, via mining the co-occurrence of context cues, or contextual patterns, we can uncover rich information that the user is looking for. The use of context cues such as visual content and concept scores has been studied and shown to improve upon text-based video search systems, but such multi-modal approaches mostly require either extra training data [19] or multiple query example images [22], which could be difficult for users to prepare. To utilize context cues in an unsupervised fashion and maintain the text-based search paradigm preferred by most users [12], the reranking framework is recently proposed [7]–[11]. Approximating the initial search result of a baseline model as the pseudo target semantics, reranking mines the contextual patterns directly from the initial result. The learnt contextual patterns are then leveraged to refine (by reordering) the search result such that relevant images are ranked higher, i.e. assigned with higher relevance scores. This is corresponding to the intuition that users usually put more emphasis on the relevance of top-ranked images. Reranking

2 The meanings of context are usually application-dependent [4].

Here, we refer to context as those attributes describing who, where, when, what, etc., shared by documents forming the recurrent patterns.

3 Visual features and concept scores can be obtained by feature extractors or pre-trained concept detectors [5]. Location metadata, or geo tags, will be exceedingly available, primarily from location-aware camera-phones and digital cameras, and initially from user input [6]. Time stamps are readily associated with the photos in digital capture devices.

has been explored in prior work for use in searching for broadcast news videos [7]–[11] and shown to offer 10%–30% relative performance gain to text-based model. Therefore, it is promising to apply reranking to the consumer photo domain, whose content are more unorganized and diverse, but closer to our ambient life. Additionally, previous works typically rely on mining contextual cues from visual features (e.g., color and texture) for reranking [7], [8], which may enhance the visual coherence of the top-ranked images, yet at the same time loses the diversity. To mitigate this problem and achieve semantic coherence, the exploitation of other context cues such as time and location metadata is also studied in this work. Besides effectiveness, time efficiency is also critical for the success of an on-line retrieval system. To achieve satisfactory query-time performance (i.e., after issuing the query, the result is obtained instantly such that the user is unaware of the additional computation), we propose a novel reranking method, ordinal reranking, to formulate reranking as a ranking problem and employ linear neural network model [21] to solve it. A novel context cue selection method, weighted c-tf-idf (wc-tf-idf), is also proposed to automatically select informative context cues to improve the performance of reranking. Experimental results (cf. Section 7) show that ordinal reranking is remarkably efficient and takes less than one second to rerank a query, which is faster than existing reranking methods in orders of magnitude. Moreover, when applied to the broadcast news videos, ordinal reranking achieves the best performance and offers up to 40% relative performance gain. On the other hand, to select canonical images, most existing methods involve the computation of pairwise similarity or the employment of clustering algorithms, which would be overwhelmingly time-consuming for online applications [24]. We take a fundamentally different approach and select the images that contain the most informative context cues (i.e., visual words [29]) measured by wc-tf-idf in a greedy fashion, which can be finished in roughly 0.01 sec. Relevant tag recommendation is also efficient as they are mined directly from the associated tags of retrieved images using wc-tf-idf at query time. To investigate the usage of context cues to enhance search quality and recommend query-related tags or canonical images, a novel system, ContextSeer, is developed. As depicted in Fig. 3, the flow diagram of ContextSeer follows the text-based search paradigm and only requires a user to key in (arbitrary) text queries. The context cues of the image database are extracted in advance, and utilized in both the reranking and recommending methods. ContextSeer is totally unsupervised and requires no extra training data or off-line learning processes. For evaluation, over 0.5 million photos and the associated metadata are collected from Flickr to form our database. We further define a number of queries covering several types of information need (such as landmarks, activities, objects, etc.) and manually annotate a set of our database by a pooling method [13].

Figure 2: Example results of a text-based search model for the query “eiffel tower.” Because the associated tags or text descriptions are noisy, (b)–(e) are retrieved inaccurately. The incorporation of context cues can mitigate this problem.

200

We then perform extensive experiments to evaluate the efficiency and effectiveness of reranking, and a subjective test to evaluate the proposed canonical image selection algorithm. In summary, the primary contributions of the paper include:

The proposed reranking and canonical image selection algorithms are efficient and can be finished at query time.

To enhance the semantic coherence of a text-based model, reranking algorithm is employed to mine the contextual patterns between target semantics and context cues including image content, concept scores, location and time (Section 2). A novel context cue selection measurement wc-tf-idf is also proposed (Section 3).

To further facilitate context-related queries and navigations, we propose wc-tf-idf to recommend relevant tags from those user-contributed ones (Section 4), and generate multiple canonical images to visualize search results (Section 5).

To our best knowledge, ContextSeer represents one of the first attempts searching over large-scale (0.5 million) shared consumer photos. The dataset, Flirkc550, is made publicly available as a benchmark (Section 6).

Extensive experiments are conducted to evaluate the accuracy and effectiveness of ContextSeer (Section 7).

2. RERANKING To maintain the text-based search paradigm while improving the search result, the reranking framework is proposed to automatically rerank the initial text search results based on the auxiliary information4, thereby contextual cues, from the retrieved objects in the initial search results [7]. Approximating the initial result as the pseudo ground truth of the target semantic, reranking algorithms mine the contextual patterns directly from the initial search result and further rerank it. Rooted in pseudo-relevance feedback for text search [19], several reranking algorithms have been proposed in the literature. The IB reranking method [7] finds the optimal clustering of images that preserves the maximal mutual information between the initial relevance scores and extracted features based on the information bottleneck principle. The context reranking method [8] mines the contextual cues by a biased random walk along the context graph, where video stories are nodes and the edges between them are weighted by multimodal similarities of the extracted features. The

4 Google’s PageRank [33] is one of the representative works to

utilize auxiliary information (i.e., hyperlinks) to gauge the quality of text-based retrieved web pages; the authors in [8] also shortly correlated the video reranking method with PageRank.

classification based reranking [9] takes the higher-ranked and lower-ranked images of a baseline system as pseudo-positive and pseudo-negative examples to train a discriminative model (i.e., support vector machines, SVM), and regards the normalized classification score for each image as its reranked score. Despite the promising relative performance gains that can be obtained, these approaches neglect the fact that reranking is also a kind of ranking problems and make no use of the underlying ordinal information of the initial list. In addition, the classification-based reranking method suffers from the ad-hoc nature of the mechanism for determining the threshold for noisy binary labels. More recently, we propose a novel ordinal reranking method [10] to incorporate and compare the learning to rank algorithms such as RankSVM [20] and ListNet [21] to the reranking framework. Since the objective of learning to rank algorithms is to minimize errors in object ranking, ordinal reranking is by nature more effective for mining ordering information and free of the ad-hoc thresholding problem. When evaluated on TRECVID 2005 video search benchmark, ordinal reranking outperforms existing methods in a great margin [10]. Moreover, thanks to the linear kernel of ListNet, ordinal reranking is remarkably efficient; it takes less than one second to rerank the result of a query. Below we first introduce the learning to rank task and the ListNet algorithm, and then describe the ordinal reranking algorithm.

2.1 Learning to Rank and ListNet Any system that presents results to a user, ordered by a utility function that the user cares about, is performing a ranking, in contrast to classification problem which aims to determine class labels. A common example is the ranking of search results from the search engine (e.g., Google). A ranking algorithm assigns a relevance score to each object, and ranks the object by that. The ranking order represents the relevance of objects with respect to the query. In the literature, the task of training a ranking model which can precisely predict the relevance scores of test data is commonly referred to as learning to rank. For learning to rank, a query is associated with a list of training data D = (d1, d2, …, dN), where N denotes the number of training data and dj denotes the j-th object, and a list of manually annotated relevance scores Y = (y1, y2, …,yN), where yj∈ [0, 1] denotes the relevance score of dj with respect to the query. Furthermore, for each object dj a feature vector Xj = (Xj1, Xj2, …, XjM) is extracted, where M is the dimension of the feature space. The purpose of learning to rank is to train a ranking algorithm f that can accurately predict the relevance score of test data by leveraging the co-occurrence patterns among X and Y. For the training set D we obtain a list of predicted relevance score Z = (z1, z2, …, zN) = (f(X1), f(X2), …, f(XN)). The objective of learning is

Figure 3: Flow diagram of ContextSeer. The contextual patterns are mined from the initial search result of a text-based model and then utilized to refine (by reranking) the initial result. Search-related tags and multiple canonical images are furtherrecommended to the users.

201

formalized as minimizing the total losses L(Y, Z) with respect to the training data, where L is a loss function for ranking. Conventional ranking algorithms such as RankSVM [20] formulate the learning task as classification of object pairs into two categories (correctly ranked and incorrectly ranked) and define L as the number of wrongly classified objects. These approaches are thus time-consuming as they take every possible pair in the data and runs at a complexity of O(N2). ListNet [21] conquers these shortcomings by using score lists directly as learning instances and minimizing the listwise loss between the initial list and the reranked list. In this way, the optimization is conducted directly on the list, and the computational cost can be reduced to O(N), making online reranking applications possible. Our previous studies [10] show ListNet is surprisingly efficient and even outperforms the conventional pairwise approaches such as RankSVM. To define the listwise loss function, ListNet transforms both the (pseudo-) ground truth scores and the predicted scores into probability distributions by sum-to-one normalization, and uses cross-entropy to measure the distance (listwise loss function) between these two probability distributions. Let P(yj) denotes the normalized score of dj, then the loss function is defined as:

1( , ) ( ) log( ( ))

Nj jj

L Y Z P y P z=

= −∑ . (1)

To minimize Eq. (1), ListNet employs a linear neural network model to assign a weight to each feature and forms the prediction of ranking score by linear weighted sum as:

( ) ,j j jz f X W X= =< > , (2)

where <．,．> denotes an inner product operation and W = (w1, w2, …, wM) is a weighting vector for feature dimensions. We can then derive the gradient of Eq. (2) with respect to each w as:

( )1

( , )( ) ( )

N

jmjm j jm

L Y Zw P z P y X

w =

∂Δ = = −

∂∑ , (3)

where wm denotes the weight for the m-th feature. The above formula is then used in gradient descend. Initially each w is set to zero, and then updated by

m m mw w wη= − ×Δ (4)

at a learning rate η. The learning process terminates when the change in W is less than a convergent threshold δ. The values of η and δ are determined empirically and a parameter sensitivity test over them is presented in Section 7.1.3.

2.2 Learning to Rank vs. Reranking Reranking and learning to rank differs in a number of aspects. First, while learning to rank requires a great amount of supervision, reranking takes an unsupervised fashion and approximates the initial results as the pseudo ground truth Y. Second, for learning to rank the ranking algorithm f is trained in advance to predict the relevance scores for arbitrary queries, while for reranking f is particularly trained at runtime to compute the reranked relevance scores for each query itself. However, since both the targets are to minimize errors in object ranking, it is reasonable to incorporate learning to rank algorithms to the task of reranking as long as the pseudo ground truth can be used to train a ranking algorithms. The proposed ordinal reranking achieves this by employing the cross-validation technique [10], as described below.

2.3 Ordinal Reranking The inputs to ordinal reranking is a list of objects D and the corresponding relevance scores Y assigned by a baseline model. We assume that the features (context cues) for each image are computed in advance. For these N objects in D, the corresponding M-dimensional features can form an N ×M feature matrix X. The major steps of ordinal reranking are as follows: 1. Context cue selection: Select informative context cues via

some feature selection method (cf. Section 3) to reduce the feature dimension to M’.

2. Employment of ranking algorithms: Randomly partition the data set into F folds D = {D(1), D(2), …, D(F)}. Hold out one fold D(i) as test set and train the ranking algorithm f(i) using the remaining data. Predict the relevance scores Z(i) of the test set D(i). Repeat until each fold is held out for testing once. The predicted scores of different folds are then combined to form the new list of scores Z = {Z(1), Z(2), …, Z(F)}.

3. Rank aggregation: After normalization, the initial score yj and new score zj (for object dj) are fused to produce a merged score sj by taking the weighted averaged as follows:

(1 )j j js y zα α= − + , (5)

where α∈ [0, 1] denotes a fusion weight on the initial and reranked scores; α=1 means totally reranked.

4. Rerank: Sort the fused scores to output a new ranked list for the target semantics.

To accommodate the supervised ranking algorithms to the unsupervised environment of reranking, we employ the F-fold cross-validation technique as [9] to partition the dataset, and train F ranking algorithms with different folds of data held out as the test set. Though the new scores Z are not assigned by a unified ranking algorithm, the nature of cross-validation ensures that

2F − folds of data are commonly used in the training of two different ranking algorithms. Experimental result reported in [9] also shows the effectiveness of applying cross-validation. Eq. (5) is used for rank aggregation of the initial relevance scores and newly predicted scores. Such a linear fusion model, though simple, has been shown adequate to fuse visual and text modalities in video search and concept detection [22]. The fusion weight α controls the degree of reranking and may be influential to the reranked result. Therefore, we also test variant fusion weights experimentally in Section 7.1.3.

3. CONTEXT CUE SELECTION As reported in [9] and [15], the performance of a search model can degenerate significantly as the feature dimension increases arbitrarily. To automatically select informative context cues for reranking in an unsupervised fashion, we develop a novel measurement, wc-tf-idf, by improving the concept tf-idf5 (c-tf-idf) proposed in [14]. Though c-tf-idf is originally proposed for selecting concepts, it is of high generality and can be applied to other real-valued context cues. As its name implies, c-tf-idf is adapted from the best known term-informativeness measurement tf-idf [15] by viewing images as documents and (real-valued) context cues as term frequencies. The authors in [14] construct the

5 “tf-idf” stands for term-frequency inverse-document-frequency.

202

document-term occurrence table from the initial search list, and compute the c-tf-idf of context cue c in a query q as:

c-tf-idf ( , ) ( , ) log( ),( )

Tc q freq c q c C

freq c= ∈ , (6)

where1

( , )N

jcjfreq c q X

== ∑ ,

1( )

T

jcjfreq c X

== ∑ is the occurrence

frequency of c in the initial list and the whole corpus, respectively, Xjc is the context cue of c in dj, T is the size of corpus (typically T N), and C is the set of adopted context cues. The intuition is the relevance of a context cue increases proportionally to the frequency it appears in the return list of a query, but is offset by the frequency of the context cue in the entire corpus to filter out common context cues. In this way, c-tf-idf offers a good combination between popularity (idf) and specificity (tf) [15]. c-tf-idf is unsupervised and efficient, which meets the requirement of reranking quite well. However, one major drawback of c-tf-idf is that it regards each object as equally relevant to the target semantics, and thus totally neglects the underlying ordinal information among the objects. Intuitively, the context cues of lower-ranked objects should be less important than those of higher-ranked ones. To address this issue, we improve c-tf-idf by weighting the context cues of each object according to the associated relevance scores, as

1

1

wc-tf-idf ( , ) log( )N

j jc Tj

jcj

Tc q y X

X=

=

= ∑∑

. (7)

In this way, we preserve the merit of c-tf-idf, but put more emphasis on the higher ranked objects. Our evaluation shows wc-tf-idf is indeed more effective than c-tf-idf (cf. Section 7.1.1)

4. RELEVANT TAG RECOMMENDATION As discussed in [24], log analysis reveals that user queries are often too short (generally two or three words) and imprecise to express their information needs. This is largely due to the fact that it is usually difficult for users to specify words that adequately represent their needs. To paraphrase N. J. Belkin [25], “When people engage in information-seeking behavior, it’s usually because they are hoping to resolve some problem, or achieve some goal, for which their current state of knowledge is inadequate.” To resolve this difficulty, term suggestion mechanisms [26]–[28] are commonly employed in current search engine design to give users hint on other possible queries and save efforts by providing shortcuts to relevant queries. We apply wc-tf-idf to mine the co-occurring tags associated with retrieved images that are ranked high after reranking. Note that this is different from the conventional web search settings, where relevant terms are mined from the web pages or query logs [26]. Tags, though noisy and few, contain abundant information that may be valuable for the user.

5. MULTIPLE CANONICAL IMAGE SELECTION The problem of selecting views that capture the essence of an object has been studied for over twenty years in the human vision and computer vision communities [23]. Defining precisely what makes a view canonical is still a topic of debate. A typical approach is to use image processing methods such as kmeans to cluster the images into visually similar groups, and then utilize a

number of heuristic criteria to select images from the clusters [17], [18], [23]. For example, [17] ranks clusters by several criteria such as visual coherence and cluster connectivity, and then selects images from highly ranked clusters. On the other hand, [23] proposes the following three criteria (but only uses the first one in their work): similarity to other images in the input set, coverage of important features in the scene, and orthogonality to previously selected images. Unlike previous work for mining a representative image from a static collection (especially for landmark images, e.g., [17], [23]), we are to recommend multiple search-related canonical images for arbitrary text queries at query time. However, as noted by [24], clustering or computing the similarity of hundreds of images is not efficient enough for use in practical online processes. Moreover, approaches (e.g., local coherence [17] or point-wise linking [18]) that involve the matching of local interest point descriptors are exceedingly time-consuming. To relieve the computational cost, we take a fundamentally different approach. Instead of computing any similarity measures, we select images according to the last two criteria proposed in [23], i.e., coverage and orthogonality, in a greedy fashion. Below we first introduce the adopted feature representation – visual word (VW) [29], and then describe the proposed greedy algorithm cannoG. VW is adopted as the feature set for its effectiveness of capturing salient object characteristics in images and its property of being invariant to changes in image/video capturing conditions such as

Figure 5: More results of selected informative visual words (yellow circles) for the query “golden gate bridge” by wc-tf-idf. It is evident that wc-tf-idf effectively selects the visual words corresponding to the bridge structure.

Figure 4: Procedure of computing the coverage scores of informative visual words for a text query (i.e., “golden gate bridge”): (a) an image from the returned result, (b) its extracted visual words, (c) top 50 selected visual words by wc-tf-idf, and (d) the covered areas of informative visual words relevant to the query.

203

variation in scale, viewing angle or lighting [29]. The construction of VW involves image feature point detection, description and quantization. Feature points are detected by either Difference-of-Gaussian (vwd) or Hessian-Affine detector (vwh) [30] and then described by Scale-invariant feature transform (SIFT) [31]. Typically an image can contain hundreds to thousands interest points. The descriptors from all images are quantized into 3500 clusters using kmeans, and the resulting centroid of a cluster is then defined as a VW. A feature point is assigned to as a VW by comparing its descriptor to the cluster centroids. Therefore, every image can be viewed as a collection of (discrete) VWs and empirically represented as a frequency histogram (FH) by counting the number of each VW it contains (empirically FH is normalized to ensure the largest value being one). Since an interest point can be visualized as an ellipse according to its SIFT parameters, we can naturally compute the coverage of a VW by summing over the areas of interest points that correspond to the VW. A more frequent VW may cover a larger area in the image. The coverage of each VW of an image forms a coverage vector (CV). Note both FH and CV can be pre-computed off-line. We now turn to a formal description of the proposed canonical image selection algorithm cannoG. Given the N retrieved images for a query and the associated VWs, we first select the M’ most informative VWs using wc-tf-idf (cf. Figs. 4(b) and 4(c)). Let’s denote the importance of selected VWs as bj (computed by Eq. (7)), the corresponding FH and CV for each image as Hij and Cij, j=1…M’, i=1…N, and the set of selected canonical images as Ψ, which is empty initially. The coverage score of an image is defined as

j ijjb C∑ , the weighted sum of CV to aggregate the

informativeness and coverage. A naïve strategy would be to select the images with highest coverage scores, but one major problem with this strategy is the selected images would be visually similar as they contain similar sets of VWs. An alterative is to set the importance of frequently appeared VWs of previous selected images to zero and select the next image whose updated coverage score is the highest among the remaining ones. This strategy has the advantage that the selected images are more likely to be dissimilar as they contain different sets of VWs. In this way both coverage and orthogonality are considered. To be brief, cannoG is composed of the following steps. In each iteration, the image s with highest coverage score is greedily selected and put into Ψ. Then we examine the FH of s and set the importance bj of the j-th VW to zero if Hsj exceeds a predefined threshold θ, θ∈ [0,1]. Note if θ is set too low, the importance of most informative VWs will be set to zero, and the selected images would not be representative. Empirically θ is set to 0.75 to balance coverage and orthogonality. The coverage score of each image is updated and the next canonical image is selected according to the new coverage scores. The iterative process ends when K images are selected, where K is the number the user set to obtain. Besides, it is worth mentioning that Fig. 4(c) also demonstrates the effectiveness of wc-tf-idf in selecting important context cues. It is evident that the selected VWs do correspond to the bridge structure, which is the most salient object for the query. More examples are shown in Fig. 5.

6. FLICKR550 DATA SET Since there are no public benchmarks for the shared consumer photo collections, we resort to build up the dataset by ourselves.

Over 0.5 million consumer photos are collected from Flickr and a number of queries are annotated for facilitating evaluation. The new dataset Flickr550 is made publicly available6 as a benchmark.

6.1 Data Acquisition 540,321 public photos are downloaded from Flickr through Flickr API7. We start with the “European Travel” Flickr group, where 486 members have uploaded their photos before 2007/10/25. From the 486 members, we further trace back to their own public photos and download the images and associated metadata including tags, text description, time stamp and geo tag. Visual feature extractors and concept detectors are then applied to generate context cues. The low-level visual features adopted by our system include grid color moments (gcm), Gabor textures (gbr), edge detection (edh) [7] and VWs (vwd and vwh). For detection of high-level semantic concepts, we utilize the SVM-based models donated by Columbia University8 to detect the concept scores for the LSCOM lexicon (cpt) [22], a set of 374 visual concepts annotated over an 80-hour subset of the TRECVID data. The associated geo tags and time stamps are also utilized. See Section 7.1.4 for more descriptions.

6.2 Query Design and Annotation To evaluate the search result quantitatively, we define 21 queries covering several types of information needs and manually annotate them. The queries can be classified to four categories: 1. Landmarks: Colosseum, Eiffel Tower, Golden Gate Bridge,

Louvre, Tower Bridge, Torre Pendente Di Pisa, Triomphe 2. Objects: beer, coffee cup, football, horse, pyramid, ramen,

spaghetti, Starbucks, Van Gogh painting 3. Scenes: beach, church, park, snow 4. Activity: Oktoberfest Since the annotation for 21 queries over 0.5 million images is almost infeasible, we employ the pooling strategy to sample a small subset of candidate images for human annotation similar to the tasks in TRECVID benchmark [13]. Several baseline models are employed in the search process, and the top retrieved images by different models are then fused to generate the candidate set. In this work, we build five baseline models using different context cues: Mtext, Mgcm, Mgbr, Medh and Mcpt. Mtext is simply the text-based baseline which is also used in Section 7.1.2. We employ the MySQL9 full-text search model and extract text information from associated tags, title, description and comments. We further run example-based search using the top five retrieved images of Mtext with different features (gcm, gbr, edh and cpt) to generate other baselines. The top 1000 retrieved images of each baseline model are then fused by linear combination to form the candidate set, from which we finally choose the top 3000 images to be annotated by human. See more explanations for pooling method in [13].

7. EXPERIMENTS 7.1 Evaluation of Reranking ListNet is implemented in MATLAB. Empirically we set the

6 http://mpac.ee.ntu.edu.tw/~yihsuan/reranking/ 7 http://www.flickr.com/services/api/flickr.people.getPublicPhotos.html 8 http://www.ee.columbia.edu/dvmm/

9 http://www.mysql.com/

204

fusion weight α to 0.5 for simplicity and use five-fold cross validation to conduct reranking. The learning rate η and convergent threshold δ are empirically set to 0.005 and 1e-4 respectively (parameter sensitivity tests are presented later). The programs are executed on a regular Intel Pentium server.

7.1.1 Comparison with existing methods using broadcast news video dataset To compare against existing reranking approaches, we first conduct experiment on the TRECVID 2005 (TV05) data set [13], which consists of 277 international broadcast news video programs and accumulates 170 hours of videos from 6 channels in 3 languages (Arabic, English, and Chinese). The automatic speech recognition and machine translation transcripts are provided by NIST [13]. The video data are segmented into shots and each shot is represented by a few keyframes (subshots). The performance is evaluated in terms of average precision (AP), which approximates the area under a non-interpolated recall/precision curve. Since AP only shows the performance of a query, we use mean average precision (MAP) [13], simply the mean of APs for multiple queries, to measure average performance over sets of different queries. TV05 contains 24 query topics and evaluates AP over the top 1000 shots (i.e., the AP depth is 1000). We compare ListNet with two existing reranking methods: IB reranking [7], which uses low-level visual features without selecting features, and SVM reranking [9], which employs mutual information to select 75 concepts from “cpt.” While the MAP of the text baseline “text-okapi” is 0.087, existing methods improve MAP to 0.105 and 0.112 respectively, as shown in Table 1. Table 1 also shows that with the 150 most informative concepts selected by wc-tf-idf, ListNet improves the MAP up to 0.122 (40.2% relative performance gain over the text baseline), which significantly outperforms existing methods. Moreover, thanks to the linear kernel, ListNet is surprisingly efficient, and takes less than one second to rerank a single query. This query-time efficiency is vital since reranking is an online process. Table 1 also indicates wc-tf-idf is more effective in selecting informative context cues than c-tf-idf.

7.1.2 Reranking for Flickr550 using visual content and concept scores Having seen in the previous section that ordinal reranking is more efficient and effective than existing methods for broadcast news videos, we can move on with confidence to the novel domain of shared consumer photos. We use the text-based model Mtext as a baseline, and first apply reranking with visual features and concept scores to Flickr550. We rerank the top 1000 retrieved images of the text baseline and evaluate AP at depth 1000.

Table 2 shows the performance of ordinal reranking using variant context cues across query categories. It can be observed that reranking based on VWs generally exhibits the best performance, especially for Landmarks and Objects, whose content appearance are relatively clearly defined. In particular, “vwd” achieves the best performance and improves the baseline from 0.422 to 0.469 (11.0% relative performance gain). Note the less relative gain comparing to that achieved in TV05 is largely due to the higher initial MAP of the text baseline in the latter dataset. Previous studies (e.g., [9]) also show that reranking offers less performance gain for a strong baseline model which might reach its ceiling. For queries lacking clear appearance patterns, context cues other than VWs may perform better; e.g., “cpt” performs the best for the query “oktoberfest” and “gcm” for “snow.” Since different context cues tend to capture different aspects of information, it should be beneficial to fuse their results. We apply late fusion by an average method on “gcm”, “vwd” and “vwh” and find that the MAP can be further improved to 0.483. A consistent improvement of each query categories is observed. Fig. 6 depicts the AP of baseline and reranked results for each query of Flick550. Ordinal reranking improves the result of nearly all queries, indicating that the approach is quite stable and robust. Salient improvements are found for the queries “colosseum,” “football” and “torre pendente di pisa.” However, as shown in [9], for queries which lack viable methods to get a good initial search results, such as “church,” “park” and “louvre,” ordinal reranking does not offer improvements since the contextual patterns are

Figure 6: Average precision of baseline and reranked results for each query in Flickr550. ListNet is employed, and the results of grid color moments and visual words are fused. Reranking improves almost every query and improves the MAP from 0.422to 0.483 (14.3% relative improvement) measured at depth 1000.

Table 2: Performance across each query category of ordinal reranking with variant context cues for Flickr550.

Category text gcm vwd vwh cpt fusion (gcm+vwd+vwh)

Landmarks (7) 0.474 0.472 0.546 0.512 0.476 0.563

Objects (9) 0.421 0.410 0.474 0.473 0.431 0.488

Scenes (4) 0.377 0.385 0.388 0.387 0.386 0.395

Activity (1) 0.255 0.232 0.210 0.230 0.312 0.225

All (21) 0.422 0.417 0.469 0.458 0.432 0.483 Improv.(%) - -1.2% 11.0% 8.4% 2.2% 14.3%

Table 1: Performance comparison of variant reranking methods on the TRECVID 2005 search task in terms of MAP (mean average precision) and computational time per query.

# reranking algorithm

feature set

# of feature

feature selection MAP improv.

(%) time/query

1 baseline text - 0.087 - -

2 IB [7] gcm+gbr 273 - 0.105 20.7% 18s

3 SVM [9] cpt 75 mutual info. 0.112 28.7% 17s

4 ListNet cpt 150 c-tf-idf 0.118 35.6% 0.4s

5 ListNet cpt 150 wc-tf-idf 0.122 40.2% 0.4s

205

difficult to discover. This may be an inherent limitation of the reranking methodology, which heavily relies on the quality of the initial model. We do not discuss this issue further, leaving it as part of the future research. In addition, as shown in Fig. 7, reranking consistently improves the search result at different AP depths. Notably, the relative gain reaches 20–30% at small AP depth. Considering only the top 10 retrieved images, a 30% relative performance gain can be obtained. Practically, users tend to browse the first few pages from a search result [12] and a reranking method which moves the semantic-related positives upward would favor user satisfaction. Fig. 8 depicts the top 5 baseline and reranked results for two queries.

7.1.3 Parameter sensitivity test We examine the performance sensitivity to the changes of several parameters that are set empirically for ListNet. We first analyze the impact of the fusion weights α (in Eq. (5)) for fusing text-based relevance scores Y and context-based reranked scores Z by varying α from 0.0 (totally text-based) to 1.0 (totally reranked). With α close to 1, the reranking process relies almost entirely on the contextual similarities and ignores the text search prior; hence, the performance degrades sharply, and totally reranked (α=1) yields similar MAP to that of the purely text-based one. Both modalities actually carry complementary information, and the fusion of them with α=0.6 gives rise to the optimal reranking result. A similar observation is also reported in [8]. Finally we conduct parameter sensibility test on the values of the learning rate η and the convergent threshold δ, which controls the degree of updating of the weights W and the convergence time. We find the performance is rather invariant under the changes of η and δ, as long as the value is set moderately small. A too large value of η results in an oscillation of W, and a too small value of δ may overfits the data. Therefore, we have set η and δ to 0.005 and 1e-4 in balance of convergence time and reranking performance.

7.1.4 Reranking for Flickr550 using time and geo tag We further examine the utilization of time and geo tags to the reranking framework. Time stamps are readily associated with each photo and contain both day and time information. We assume month is the most informative one here and use it to quantize the time metadata to 12 bins. About one fifth of images in our dataset contain geo tags. We employ kmeans to quantize the geo tags to 36 bins according to the nearest cluster centroid. Reranking based on visual features cannot improve activity-related queries whose appearances are often not well defined. However, since activity-related queries are often events that only

take place in specific days or months in a year, it is possible to exploit the contextual patterns from time metadata. For example, a spike-like distribution of the time metadata for “oktoberfest” is shown in Fig. 9. We thus apply ordinal reranking to “oktoberfest” using solely time metadata. Remarkably, such uni-modal method improves the AP from 0.255 to 0.292 (14.5% relative gain), and we find most of the top ranked images after reranking are indeed associated with the time stamps of October. We then utilize geo tags to rerank the query category Landmarks, where geographical information intuitively plays more important roles. Experimental result shows the geo tags associated with top ranked images after reranking are indeed correct, but the MAP does not improve a lot (from 0.352 to 0.372 , a 5.9% relative gain). The performance does not get higher as we increase the number of cluster centroids for quantizing geo tags. We find it reasonable to ascribe the less salient improvement to our query design and annotation processes, which have been approached in a rigorous manner such that an image is considered positive only if the whole query object is exactly in sight. Images that are beneath, taken from, or of only a small portion of the target object are not counted as correct. This strategy ensures the credibility of our experiment, yet suppresses the power of geo tags since they only carry geographical information.

7.2 Evaluation of Recommendation 7.2.1 Recommending relevant tags Table 3 shows the recommended relevant tags for a number of queries. Scanning through this table, we can observe a number of interesting relationships which may be hard to discover lexically. The first, and perhaps most obvious, is essentially the discovery of geographic relations, such as Paris for “triomphe,” Thames for

Figure 8: Top five retrieved images of the baseline and the reranked result (using VWd as the feature set) for two queries, (a) “colosseum” and (b) “coffee cup.”

Figure 7: MAPs of ordinal reranking at different AP depths. A consistent improvement is observed. Notably, reranking improves the MAP from 0.5 to 0.65 over the top 10 images and offers 30% relative performance gain.

Figure 9: Distribution of the time metadata for the images tagged with “oktoberfest” in 2005. For such activity-related queries, a spike-like distribution can easily be observed.

206

“tower bridge” and Munich for “oktoberfest.” These are the relationships we expect to find, as Flickr550 is mainly composed of travel photos. Sometimes more geographic names are mined, such as Egypt and Louvre for “pyramid,” and Italian and Japanese for “spaghetti.” These tags provide broader knowledge of the query. The second type of tags is related to the intrinsic properties of the query, such as arc for “triomphe,” beer for “oktoberfest,” and monument for “pyramid.” From these tags a user gets deeper understanding of the query. More interestingly, Jean Chalgrin, the architect of Triomphe, is also discovered as relevant tags to “triomphe.” Finally, some tags describe the query in other languages, e.g., Arc of Titus is related to “triomphe” in Latin, while Wiesn to “oktoberfest” in Germany. These examples demonstrate richness of user-provided tags and the effectiveness of wc-tf-idf in selecting informative terms from them.

7.2.2 Recommending multiple canonical images We apply cannoG to the top 100 images retrieved after reranking to select canonical images, using the 200 most informative VWs selected by wc-tf-idf for each query as feature set. Two other baseline methods are used for comparison. The first one, topK, simply uses the top K retrieved images of the text-based search model. For the second baseline, we employ kmeans clustering to the same top 100 images, with the number of clusters chosen such that the average number of images in each resulting cluster is around 20. We then rank clusters by the ratio of inter-cluster distance to intra-cluster distance [17], and select the images closest to the top K cluster centroids as canonical images. The similarity among images is also computed over the 200 most informative VWs for each query. K is set to 6 in our following experiments. Sample results are shown in Fig. 10. Ideally, the canonical images will have a diverse set of photos, demonstrating different aspects of the target semantics using highly-representative images. Motivated by [17] and [18], we carry out a subjective test and ask users to answer the following three evaluation questions:

Representative. How many photos are representative to the query (0–6)?

Redundant. How many photos are redundant (0–6)? 1st place voted. Which canonical image set (among the three)

is most satisfactory? We present the results of the three selection methods in a random order.

The canonical images of 6 queries10 are used for evaluation, and 12 human evaluators are invited to participate in our experiment. The results for each question are averaged over all users and queries. Table 4 shows a summary of the subjective evaluation

10 Colosseum, Eiffel Tower, Golden Gate Bridge, Torre Pendente

Di Pisa, spagatti, and Starbucks.

along with the time needed by each method for a query. Evidently, cannoG performs the best in all scores, especially for 1st place voted. CannoG even gets higher representative score than topK, which selects the top retrieved photos. This implies our strategy of selecting images that have high coverage of important context cues is reasonable and practical. In addition, the consideration of orthogonality also helps cannoG obtain the least redundant score, indicating the selected images are not only representative, but also diverse. More importantly, cannoG is extremely efficient, and requires only 0.01 second (implemented in MATLAB) to select canonical images. Again, this query-time efficiency is essential to a system searching over large-scale databases. Though the performance of kmeans can be improved by considering other cluster ranking methods as described in [17], the time inefficiency of clustering or other advanced matching approaches (e.g., [18]) have prevented them from practical uses.

8. DISCUSSIONS A few issues arise from the nature of the reranking framework. First, since reranking considers only the top N retrieved images, could it retrieve other positive images not in the initial list? The answer is definitely yes. It is possible to improve the recall rate by applying the learnt contextual patterns to the whole corpus to retrieve photos that are not included in the initial list; for example, the text-like indexing method [29] can efficiently leverage selected VWs (cf. Figs. 4(c)) to further retrieve related photos in the database. Second, how do we know the users’ search targets? This issue not only effects the way how experimental evaluations are designed, but also has profound impacts to the success of such online systems. The evaluation of reranking reported in this paper cannot address this issue all since we have followed a rigorous manner to define and annotate the queries. The complex nature of information seeking makes the proper design of a search system arguable, and it should be left for further investigation. Currently

Table 3: Sample results of relevant tag recommendation

query relevant tags

Triomphe Paris, France, arc, Jean Chalgrin, Roman, Arc of Titus

Tower Bridge London, bridge, UK, England, Thames, night, Paris

pyramid Egypt, Paris, Louvre, monument, ancient, temple

spaghetti food, pasta, Italian, Japanese, garlic, tomato, sauce

Oktoberfest Munich, Wiesn, Wirteeinzug, Trachten, beer, Germany

Table 4: Comparison of variant canonical image selection methods by time efficiency and subjective evaluation.

method representative redundant 1st place voted time/query

topK 4.31 1.13 34.7% 0s kmeans 3.21 1.17 5.6% 2.43s cannoG 4.54 1.03 59.7% 0.01s

Figure 10: Canonical images for the query “starbucks”generated by three methods: (a) simply using the top retrieved images of the text baseline, (b) kmeans clustering and (c) the proposed algorithm, cannoG.

207

we address this issue by recommending contextually relevant tags and multiple canonical images, yet acknowledge the need to develop more advanced methods for probing user intentions and presenting satisfactory search results.

9. CONCLUSION In this paper we propose and evaluate a novel system, ContextSeer, for searching over large-scale shared consumer photos. The proposed methods introduce the use of rich context cues of consumer photos and the ability to automatically improve (by reranking) search result and recommend additional information (i.e., query-related tags and canonical images) at query time. By employing the ranking algorithm ListNet to the reranking framework, the underlying ordinal information can be better exploited and result in a more efficient and effective reranking method. A novel context cue selection measurement wc-tf-idf is developed to improve the performance of reranking. In addition, we propose an efficient canonical image selection algorithm, cannoG, to generate multiple representative views related to the query without time-consuming clustering procedures. To evaluate ContextSeer, we have collected 540,321 photos from Flickr and manually annotate a diverse set of queries. We make the benchmark Flickr550 publicly available. Experimenting in both Flickr550 and TECVID benchmarks, ordinal reranking achieves significant performance gains to the text baselines. Moreover, the proposed ordinal reranking and cannoG are both faster than the rival methods in orders of magnitude. Finally, we conduct a subjective test and show that the set of canonical images selected by cannoG is representative and satisfactory.

One of the proposing directions for future work is to extend ContextSeer to the domain of community-contributed videos which have an extra temporal dimension and pose more challenges for discovering contextual cues and selecting representative views.

10. REFERENCES [1] L. Kennedy et al, “How Flickr helps us make sense of the

world: Context and content in community-contributed media collections,” ACM Multimedia, pp. 631–640, 2007.

[2] M. Ames et al, “Why we tag: Motivations for annotation in mobile and online media,” ACM CHI, pp. 971–980, 2007.

[3] L. Kennedy, S.-F. Chang, and I. Kozintsev, “To search or to label?: Predicting the performance of search-based automatic image classifiers,” Proc. ACM Int. workshop on Multimedia information retrieval, pp. 249–258, 2006.

[4] A. K. Dey, “Understanding and using context,” Personal and Ubiquitous Computing, vol. 5, no. 1, 2001.

[5] M. Naphade et al, “Large-scale concept ontology for multimedia,” IEEE Multimedia Magazine, vol. 13, no. 3, pp. 86–91, 2006.

[6] K. Toyama et al, “Geographic location tags on digital images,” ACM Multimedia, pp. 156–166, 2003.

[7] W. Hsu et al, “Video search reranking via information bottleneck principle,” ACM Multimedia, pp. 35–44, 2006.

[8] W. Hsu, L. Kennedy, and S.-F. Chang, “Video search reranking through random walk over document-level context graph,” Proc. ACM Multimedia, pp. 971–980, 2007.

[9] L. Kennedy and S.-F. Chang, “A reranking approach for context-based concept fusion in video indexing and retrieval,” ACM CIVR, pp. 333–340, 2007.

[10] Y.-H. Yang and W.-H. Hsu, “Video search reranking via online ordinal reranking,” IEEE ICME, 2008.

[11] A. Natsev et al, “Semantic concept-based query expansion and re-ranking for multimedia retrieval,” ACM Multimedia, pp. 991-1000, 2007.

[12] J. Battelle, The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture, 2006.

[13] NIST TREC Video Retrieval Evaluation. [online] http://www-nlpir.nist.gov/projects/trecvid/.

[14] X. Li et al, “Video search in concept subspace: a text like paradigm,” ACM CIVR, pp. 603–610, 2007.

[15] A. Aizawa, “An information-theoretic perspective of tf-idf measures,” Information Processing and Management, vol. 39, pp. 45–65, 2003.

[16] S. Palmer et al, “Canonical perspective and the perception of objects,” Attention and Performance IX, pp. 135–151, 1981.

[17] L. Kennedy et al, “Generating diverse and representative image search results for landmarks,” WWW, 2008.

[18] Y. Jing et al, “Canonical image selection from the web,” ACM CIVR, pp. 280–287, 2007.

[19] R. Yan, A. Hauptmann, and R. Jin, “Multimedia search with pseudo-relevance feedback,” ACM CIVR, 2003.

[20] R. Herbrich et al, “Support vector learning for ordinal regression,” IEEE ICANN, pp. 97–102, 1999.

[21] Z. Cao et al, “Learning to rank: from pairwise approach to listwise approach,” IEEE ICML, pp. 129–136, 2007.

[22] S.-F. Chang et al, “Columbia University TRECVID-2005 video search and high-level feature extraction,” NIST TRECVID workshop, 2005.

[23] I. Simon et al, “Scene summarization for online image collections,” IEEE ICCV, pp. 1–8, 2007.

[24] S. Wang et al, “IGroup: presenting web image search results in semantic clusters,” ACM CHI, 2007, pp. 377–384.

[25] N. J. Belkin, “Helping people find what they don’t know,” Communication of the ACM, vol. 43, no. 8, pp. 58–61, 2000.

[26] C.-K. Huang et al, “Relevant term suggestion in interactive web search based on contextual information in query session logs,” Journal of the American Society for Information Science and Technology, vol. 54, no. 7, pp. 638–649, 2003.

[27] R. Jones, B. Rey, O. Madani, and W. Greiner, “Generating query substitutions,” ACM WWW, 2006.

[28] J. Xu and W. Croft, “Query expansion using local and global document analysis,” ACM SIGIR, pp. 4–11, 1996.

[29] J. Sivic and A. Zisserman, “Video Google: A text retrieval approach to object matching in videos, ICCV, 2003.

[30] K. Mikolajczyk and C. Schmid, “Scale & affine invariant interest point detectors,” IJCV, vol.60, no.1, pp. 63–86, 2004.

[31] D. Lowe, “Distinctive image features from scale-invariant keypoints,” IJCV, vol. 60, no. 2, pp. 91–110, 2004.

[32] S. Sontag, On Photography, Picador USA, 2001. [33] L. Page et al., “The PageRank citation ranking: Bringing

order to the web,” Stanford University, 1998.

208

ContextSeer: Context Search and Recommendation at Query ...jzwang/ustc11/mm2008/p199-yang.pdfas the...

Documents

Transcript of ContextSeer: Context Search and Recommendation at Query ...jzwang/ustc11/mm2008/p199-yang.pdfas the...