21. a memory learning framework for effective image retrieval

14
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 4, APRIL 2005 511 A Memory Learning Framework for Effective Image Retrieval Junwei Han, King N. Ngan, Fellow, IEEE, Mingjing Li, and Hong-Jiang Zhang, Fellow, IEEE Abstract—Most current content-based image retrieval systems are still incapable of providing users with their desired results. The major difficulty lies in the gap between low-level image features and high-level image semantics. To address the problem, this study reports a framework for effective image retrieval by employing a novel idea of memory learning. It forms a knowledge memory model to store the semantic information by simply accumulating user-provided interactions. A learning strategy is then applied to predict the semantic relationships among images according to the memorized knowledge. Image queries are finally performed based on a seamless combination of low-level features and learned se- mantics. One important advantage of our framework is its ability to efficiently annotate images and also propagate the keyword an- notation from the labeled images to unlabeled images. The pre- sented algorithm has been integrated into a practical image re- trieval system. Experiments on a collection of 10 000 general-pur- pose images demonstrate the effectiveness of the proposed frame- work. Index Terms—Annotation propagation, image retrieval, memory learning, relevance feedback, semantics. I. INTRODUCTION D UE to the rapidly growing amount of digital image data on the Internet and in digital libraries, there is a great need for large image database management and effective image retrieval tools. Content-based image retrieval (CBIR) is the set of tech- niques for searching for similar images from an image database using automatically extracted image features. Tremendous research has been devoted to CBIR and a variety of solutions have been proposed within the past ten years. By and large, research activities in CBIR have progressed in three major directions [5]: global features based, object/region-level features based, and relevance feedback. Initially, developed sys- tems [1], [2] are usually based on the carefully selected global image features, such as color, texture or shapes, and prefixed similarity measure. They are easy to implement and perform well for images that are either simple or contain few semantic contents (for example, medical images and face images). How- ever, for these systems, it is impossible to search for objects or regions of the image. Therefore, the second group of sys- tems [3]–[5] is proposed on image segmentation. Contrasting Manuscript received November 3, 2003; revised March 24, 2004. This work was supported in part by Nanyang Technological University, Singapore. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Gopal Pingali. J. Han and K. N. Ngan are with the Department of Electronic Engineering, The Chinese University of Hong Kong, Shatin, N.T., Hong Kong (e-mail: jun- [email protected]; [email protected]). M. Li and H.-J. Zhang are with Microsoft Research Asia, Beijing 100080, China (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TIP.2004.841205 to global feature-based systems, these schemes are designed to search for “things” by extracting local features from seg- mented regions, and describing images on the region or object level [5]. The performance of these systems mainly relies on the results of segmentation. Therefore, they cannot generate ex- tremely good performance since the image segmentation is still an open problem in computer vision so far. The limited retrieval accuracy of image-centric retrieval sys- tems is essentially due to the inherent gap between semantic concepts and low-level features. In order to reduce the gap, the interactive relevance feedback (RF) is introduced into CBIR. RF, originally developed for textural document retrieval [8], is a supervised learning algorithm used to improve the performance of information systems. Its basic idea is to incorporate human perception subjectivity into the query process and provide users with the opportunity to evaluate the retrieval results. The sim- ilarity measures are automatically refined on the basis of these evaluations. After RF for CBIR was first proposed by Rui et al. [9], this area of research has attracted much attention and become active in the CBIR community. Many groups have re- ported their RF techniques [5], [7], [9]–[14]. Recently, many researchers began to consider the RF as a learning or classification problem. That is, a user provides posi- tive and/or negative examples, and the systems learn from such examples to refine the retrieval results or train a classifier by the labeled examples to separate all data into relevant and irrelevant groups. Hence, many classical machine learning schemes may be applied to the RF, which include decision tree learning [17], Bayesian learning [10], [7], support vector machines (SVM) [14], boosting [18], and so on. For the latest developments in RF, please refer to [15] and [16]. Although RF can significantly improve the retrieval perfor- mance, its applicability still suffers from three inherent draw- backs. 1) Incapability of capturing semantics. Most RF tech- niques in CBIR absolutely copy ideas from textural information retrieval. They simply replace keywords with low-level features and then adopt the vector model for document retrieval to perform interactions. This strategy works well underlying the premise that the low-level features are as powerful in representing the semantic content of images, as keywords in representing textural information. Unfortunately, this requirement is often not satisfied. Therefore, it is difficult to capture high-level semantics of images when only low-level features are used in RF. 2) Scarcity and imbalance of feedback examples. Very few users are willing to go through endless iterations of 1057-7149/$20.00 © 2005 IEEE

description

 

Transcript of 21. a memory learning framework for effective image retrieval

Page 1: 21. a memory learning framework for effective image retrieval

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 4, APRIL 2005 511

A Memory Learning Framework forEffective Image Retrieval

Junwei Han, King N. Ngan, Fellow, IEEE, Mingjing Li, and Hong-Jiang Zhang, Fellow, IEEE

Abstract—Most current content-based image retrieval systemsare still incapable of providing users with their desired results. Themajor difficulty lies in the gap between low-level image featuresand high-level image semantics. To address the problem, this studyreports a framework for effective image retrieval by employinga novel idea of memory learning. It forms a knowledge memorymodel to store the semantic information by simply accumulatinguser-provided interactions. A learning strategy is then applied topredict the semantic relationships among images according to thememorized knowledge. Image queries are finally performed basedon a seamless combination of low-level features and learned se-mantics. One important advantage of our framework is its abilityto efficiently annotate images and also propagate the keyword an-notation from the labeled images to unlabeled images. The pre-sented algorithm has been integrated into a practical image re-trieval system. Experiments on a collection of 10 000 general-pur-pose images demonstrate the effectiveness of the proposed frame-work.

Index Terms—Annotation propagation, image retrieval,memory learning, relevance feedback, semantics.

I. INTRODUCTION

DUE to the rapidly growing amount of digital image data onthe Internet and in digital libraries, there is a great need for

large image database management and effective image retrievaltools. Content-based image retrieval (CBIR) is the set of tech-niques for searching for similar images from an image databaseusing automatically extracted image features.

Tremendous research has been devoted to CBIR and a varietyof solutions have been proposed within the past ten years. Byand large, research activities in CBIR have progressed in threemajor directions [5]: global features based, object/region-levelfeatures based, and relevance feedback. Initially, developed sys-tems [1], [2] are usually based on the carefully selected globalimage features, such as color, texture or shapes, and prefixedsimilarity measure. They are easy to implement and performwell for images that are either simple or contain few semanticcontents (for example, medical images and face images). How-ever, for these systems, it is impossible to search for objectsor regions of the image. Therefore, the second group of sys-tems [3]–[5] is proposed on image segmentation. Contrasting

Manuscript received November 3, 2003; revised March 24, 2004. This workwas supported in part by Nanyang Technological University, Singapore. Theassociate editor coordinating the review of this manuscript and approving it forpublication was Dr. Gopal Pingali.

J. Han and K. N. Ngan are with the Department of Electronic Engineering,The Chinese University of Hong Kong, Shatin, N.T., Hong Kong (e-mail: [email protected]; [email protected]).

M. Li and H.-J. Zhang are with Microsoft Research Asia, Beijing 100080,China (e-mail: [email protected]; [email protected]).

Digital Object Identifier 10.1109/TIP.2004.841205

to global feature-based systems, these schemes are designedto search for “things” by extracting local features from seg-mented regions, and describing images on the region or objectlevel [5]. The performance of these systems mainly relies onthe results of segmentation. Therefore, they cannot generate ex-tremely good performance since the image segmentation is stillan open problem in computer vision so far.

The limited retrieval accuracy of image-centric retrieval sys-tems is essentially due to the inherent gap between semanticconcepts and low-level features. In order to reduce the gap, theinteractive relevance feedback (RF) is introduced into CBIR.RF, originally developed for textural document retrieval [8], is asupervised learning algorithm used to improve the performanceof information systems. Its basic idea is to incorporate humanperception subjectivity into the query process and provide userswith the opportunity to evaluate the retrieval results. The sim-ilarity measures are automatically refined on the basis of theseevaluations. After RF for CBIR was first proposed by Rui etal. [9], this area of research has attracted much attention andbecome active in the CBIR community. Many groups have re-ported their RF techniques [5], [7], [9]–[14].

Recently, many researchers began to consider the RF as alearning or classification problem. That is, a user provides posi-tive and/or negative examples, and the systems learn from suchexamples to refine the retrieval results or train a classifier by thelabeled examples to separate all data into relevant and irrelevantgroups. Hence, many classical machine learning schemes maybe applied to the RF, which include decision tree learning [17],Bayesian learning [10], [7], support vector machines (SVM)[14], boosting [18], and so on. For the latest developments inRF, please refer to [15] and [16].

Although RF can significantly improve the retrieval perfor-mance, its applicability still suffers from three inherent draw-backs.

1) Incapability of capturing semantics. Most RF tech-niques in CBIR absolutely copy ideas from texturalinformation retrieval. They simply replace keywords withlow-level features and then adopt the vector model fordocument retrieval to perform interactions. This strategyworks well underlying the premise that the low-levelfeatures are as powerful in representing the semanticcontent of images, as keywords in representing texturalinformation. Unfortunately, this requirement is often notsatisfied. Therefore, it is difficult to capture high-levelsemantics of images when only low-level features areused in RF.

2) Scarcity and imbalance of feedback examples. Veryfew users are willing to go through endless iterations of

1057-7149/$20.00 © 2005 IEEE

Page 2: 21. a memory learning framework for effective image retrieval

512 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 4, APRIL 2005

feedback with the hopes of getting the best results. Hence,the number of feedback examples labeled by users duringa RF session is far smaller than the dimension of low-levelfeatures that characterize an image. Because of such smalltraining data sizes, many classical learning algorithmscannot give exciting results. Furthermore, in the RF sce-nario, the number of labeled negative examples is usu-ally greater than the number of labeled positive examples.As pointed out in [19], the imbalance of training data al-ways makes classification learning less reliable. Thus, thescarcity of feedback examples, especially positive exam-ples, definitely limits the accuracy of RF.

3) Lack of the memory mechanism. A disadvantage of thetraditional RF is that the potentially obtained semanticknowledge in the feedback processes of one query sessionis not memorized to continuously improve the retrievalperformance [6], [15]. Even with the same query, a userwill have to go through the same, often tedious, feedbackprocess to get the same result, despite the fact the user hasgiven the same query and feedbacks before. Hence, thereis an urgent need of building a memory mechanism toaccumulate and learn the semantic information providedby past user interactions.

To overcome the aforementioned difficulties, another schoolof thought [6], [20]–[23], generally called long-term learning,has become available in recent years. They memorize and accu-mulate users’ preferences in the RF process. The historical re-trieval experience will then be used to guide new users’ queries.These long-term learning algorithms are mainly based on pre-vious users’ behaviors, which basically embody more semanticinformation than low-level features. They try to narrow thatwell-known gap by other persons’ subjectivities because imagecontent understanding is very difficult with the present state ofthe computer vision and image processing technology.

Actually, the idea of long-term learning in CBIR is borrowedfrom the work of collaborative filtering [24]–[26] and link struc-ture analysis [27], [28] in the web information retrieval. Collab-orative filtering is a technique of predicting the preferences ofunknown users by using known attitudes of other users. It is builton the assumption that a good way to find interesting things isto discover other people who have similar interests, and thenrecommend objects that similar people like. Unlike the collabo-rative filtering, many web search engines search for web pagesby the link structure analysis. Two basic assumptions of the linkstructure analysis are: pages that are co-cited by a certain pageare likely to relate to the same topic, and pages that are oftenvisited in succession by a certain user are possibly similar. Thecommon idea behind the two above-mentioned techniques is toestimate similarity between objects by users’ behaviors, insteadof object contents.

Without doubt, the long-term learning methods can achievebetter retrieval precision compared to traditional RF techniques.However, they inevitably encounter two problems in practice.One is the sparsity of memorized feedback information. Thequality of long-term learning relies strongly on the amount ofuser log that the system has stored so far. Because of the largedatabase and limited interactions, it is not easy to collect suffi-cient log information. Hence, the long-term learning algorithms

are not very useful as the user log is scarce with respect tothe scale of image database. The other problem is that mostlong-term learning approaches only recommend the memorizedsemantic knowledge to users but lack a learning ability to pre-dict hidden semantics in terms of acquired semantics. Strictlyspeaking, there is no learning or limited learning in such ex-isting long-term learning systems.

In practical image retrieval systems, many users prefer usingkeywords to conduct queries [33]. Hence, images must first beannotated to support keyword searches. In general, two ways areemployed to annotate images: full annotation and partial annota-tion [31]. The former manually labels all images in the database.Although manual annotation is considered a best choice by ac-curacy, it is not a feasible solution because human labeling istedious and expensive. This was what motivated CBIR researcha few years ago. The latter only first manually marks a smallsubset of images. Then, the annotations are propagated from thesmall number of marked images to a large number of unmarkedimages according to the similarity measure or classical learningalgorithms.

Lately, research on image annotation and annotation propaga-tion are attracting growing interests [15], [29]–[33], [37], [38].In [29] and [30], annotations are propagated by visual similaritymeasures. Zhang et al. [32] and Chang et al. [33] focus on ac-tive learning for annotation propagation. Liu et al. [31] use theRF to improve annotation performance. Zhang et al. [15] fur-ther perform annotation propagation by integrating RF with aBayesian model. Li et al. [37] and Barnard et al. [38] apply ma-chine learning to predict words for images.

Despite many efforts, the accuracy of propagated annotationis still limited. The problem stems from the fact that imagesclose to each other in low-level feature space do not share thesame semantic meaning. On the contrary, most of the above-mentioned systems assume that images located near to eachother in the feature space are likely to relate to similar keywords.

As analyzed above, when designing an effective image re-trieval system, at least the following two issues should be con-sidered: how to reduce the gap between low-level features andhigh-level semantic concepts; how to annotate images and prop-agate the annotations efficiently. In this paper, we attempt to pro-pose a novel memory learning framework to address those twoissues. For the first issue, we introduce a feedback knowledgememory model to accumulate the previous users’ preferences.Furthermore, a learning strategy is presented to predict hiddensemantics using the memorized information, which is able to re-duce the limitation of user log sparsity to a certain extent. Thefeedback knowledge memory model and the learning strategyare joint by known as memory learning. In the process of imageretrieval, the memory learning can capture semantics from pre-vious users’ behaviors instead of image contents. The memorylearning and low-level feature-based RF are then combined toimprove the retrieval performance. According to the memorizedknowledge, the memory learning provides additional positiveexamples to low-level feature-based RF, which alleviates theproblem of scarcity and imbalance of feedback examples. In themeantime, the improved low-level feature-based RF suggestsmore fresh knowledge for the memory learning to memorize.In other words, the mutual reinforcement of memory learning

Page 3: 21. a memory learning framework for effective image retrieval

HAN et al.: MEMORY LEARNING FRAMEWORK 513

and low-level feature-based RF enhances the system’s ability tograsp semantics.

To address the second issue, we propose an annotation prop-agation scheme on a semantic level by the memory learning.It annotates images and propagates the annotations using bothmemorized and learned semantic information.

Here, we summarize our contributions as follows.

1) A feedback knowledge memory model is presented togather the users’ feedback information during the processof image search and feedback. It is efficient and can besimply implemented.

2) A learning strategy based on the memorized informationis proposed. It can estimate the hidden semantic relation-ships among images. Consequently, this technique couldaddress the problem of user log sparsity in a certain ex-tent.

3) During the interactive process, a seamless combinationof normal RF (low-level feature based) and the memorylearning (semantics based) is proposed to improve the re-trieval performance. Notice that this combination is nota pure linear summation. The memory learning providesthe normal RF with a pool of positive examples accordingto its captured knowledge, which helps the normal RF toalleviate the problem of scarcity and imbalance of feed-back examples.

4) A semantics-based image annotation propagation schemeis proposed using both memorized and learned semantics.In contrast with existing algorithms of propagating anno-tation by visual similarity, its precision is much better.

The rest of this paper is organized as follows. In Section II,we briefly review the related work. In Section III, we present thefeedback knowledge memory model. In Section IV, the learningstrategy to estimate the hidden semantics is described. In Sec-tion V, the image retrieval framework by memory learning isexplained. The experimental results are shown in Section VI.Finally, concluding remarks are given in Section VII.

II. REVIEW OF RELATED WORK

This section discusses some previous work in long-termlearning and image annotation propagation.

A. Related Work in Long-Term Learning

As previously stated, most traditional RF methods take intoaccount only the current query session, while the semantics cap-tured from past users is lost. Thus, in the recent years, a numberof long-term learning models [6], [20]–[23] have been presentedto gradually improve the retrieval performance through accumu-lating user query log.

The information-embedding framework [20] probably wasthe first attempt to explicitly memorize users’ behaviors toimprove retrieval accuracy. Its basic idea is to embed semanticinformation into CBIR processes through RF using a semanticcorrelation matrix and low-level feature distances. The se-mantic relationships among images are gained and embeddedinto the system by splitting/merging image clusters and up-dating the correlation matrix. Experiments have shown that this

new framework is effective. However, it is complex in terms ofcomputation and implementation, and also the system may takea long time to converge.

In [21], Bartolini et al. reported a system of FeedbackBypass.It assumes the existence a static mapping from each retrievalsample to “optimal” parameters including query point and dis-tance function. The “optimal” parameters are learned by feed-back loops over time. Afterwards, it is possible to either “by-pass” the feedback loop completely for already-seen queries orto “predict” near-optimal parameters for new queries.

Li et al. [6] described a bigram correlation model to cap-ture the semantic relationships among images from statistics ofusers’ RF information. The algorithm is simple but effective.Experimental results on a database of 100 000 images demon-strate its ability to improve the retrieval performance.

In [22], a long-term similarity learning algorithm was appliedto CBIR. In this model, user feedback refines the current searchresults. The interaction information is stored to build the se-mantic similarity among images. This similarity is updated withqueries and put into the content-based similarity.

Recently, He et al. [23] introduced an idea of learning a se-mantic space from user’s RF. It assumes that images relevant toa query belong to a semantic class. By aggregating lots of feed-back iterations, a semantic space is incrementally constructed.In addition, this paper discusses the singular value decomposi-tion (SVD) to reduce the dimensionality of the semantic space.

All of these systems have obtained good empirical results.However, their performance strongly depends on the amount ofgathered user log. A common problem of these methods is thatthey ignore the reality that it is a little hard to collect sufficientuser interaction data for a nonweb-based image retrieval system.In fact, the same case arises in the area of web information re-trieval. Researchers of this area have recognized this problemand suggested some solutions [26], whereas, to our best knowl-edge, very little work has been done in the CBIR. Hence, thispaper attempts to address this problem in a certain extent byusing a learning strategy.

B. Related Work in Image Annotation Propagation

Often, the users are accustomed to query by keywords. It isvery tedious and expensive to manually label all images of data-base. Therefore, a challenge of CBIR systems is how to propa-gate annotations from labeled images to the rest of the unlabeledimages effectively.

Picard and Minka [29] introduced an algorithm to propagateannotation by image texture. It consists of two steps. First, hu-mans label a patch of an image. Then, the label is propagated toother images with similar patches of texture.

In [30], Saber and Tekalp provided an image retrieval and an-notation propagation framework based on regions. It first seg-ments the image into regions and merges the neighboring re-gions to form objects. Afterward, each object is compared to aset of given template; if the match is successful, the annotationsof the template are shared by the matched object.

The above two methods propagate annotation in terms of vi-sual similarity, yet some researchers consider this issue fromthe active learning and classification perspective. Zhang et al.

Page 4: 21. a memory learning framework for effective image retrieval

514 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 4, APRIL 2005

[32] advised an active learning system to probabilistically prop-agate annotation. It estimates the attribute probabilities of unan-notated images by means of their annotated neighbors, which isfulfilled by the kernel regression. The next sample for labelingis the one with a high density of unannotated neighbors.

In [33], Chang et al. recommended a soft annotation model(CBSA) also using the active learning. It starts with labeling asmall set of training images, each with one semantic keyword.An ensemble of binary classifiers is then trained for predictingeach unlabeled image to give the image multiple soft keywords.As with the binary classifier, this paper experiments with twolearning algorithms, SVMs and Bayes point machines (BPMs).

Another class of ideas takes advantage of RF to implementimage annotation and annotation propagation. A semi-auto-matic image annotation framework was suggested in [31].This work embeds the annotation process in the process ofimage retrieval and RF. When the user submits a keywordquery and then offers RF, the search keyword is automaticallyadded or strengthened to the positive examples, and removed orweakened to the negative examples. The performance of imageannotation is improved progressively as the iteration of searchand feedback increases.

In [15], Zhang et al. discussed a probabilistic progressivekeyword propagation scheme integrating RF with a Bayesianmodel. While the user is providing feedback in a query ses-sion, it supposes that all positive examples belong to the samesemantic class and the features from the same semantic classfollow the Gaussian distribution. Therefore, all positive exam-ples in a query session are used to calculate and update the pa-rameters of the corresponding semantic Gaussian classes. Then,the probability of each image in the database belonging to suchsemantic class is estimated by the Bayesian model. The commonkeywords in positive examples are propagated to the imageswith a very high probability of belonging to this class.

Of late, there is a novel trend of using machine learning algo-rithms to learn concepts from images and automatically trans-late the content of images to text descriptions. A good job for au-tomatic linguistic indexing of images was presented by Li et al.[37]. In this system, categorized images are adopted to traina dictionary of many two-dimensional multiresolution hiddenMarkov Models [two-dimensional (2-D) MHMMs] each repre-senting a concept. Because each image category in the trainingset is annotated by humans, a mapping between 2-D MHMMsand groups of words can be built. To annotate a new image, thelikelihood of the image being generated by each 2-D MHMM isfirst estimated, and then words are picked from those categoriesyielding highest likelihoods. This work achieves a success on adatabase of 600 image categories. Another representative workof linking images and words was done by Barnard et al. [38]. Itexplores a variety of latent variable models to predict words forboth entire images and particular image regions.

It can be easily seen that the underlying assumption of mostof the earlier methods is that images with similar visual featuresshould be associated with the same keywords. Obviously, thisassumption often does not hold. Accordingly, this paper will tryto propagate annotation using memorized and learned semanticinformation.

III. FEEDBACK KNOWLEDGE MEMORY MODEL

In this section, we propose a simple statistical model to trans-form users’ preferences during query sessions into semantic cor-relations among images. Then, a semantic image link networkis formed to store those semantic correlations.

The key assumption of the proposed model is that two imagesshare similar semantics if they are jointly labeled as positive ex-amples in a query session. Intuitively, we can estimate the se-mantic correlation between two images by means of the numberof query sessions in which both images are positive examples.A query session contains a query phase and possibly severalrounds of feedback. For the sake of simplicity, the number oftimes that two images are jointly relevant to the same queryis referred to as the co-positive-feedback frequency, while thatwhen both are labeled as feedback images and at least one ofthem is positive is referred to as co-feedback frequency. Con-sequently, the correlation strength between two images is de-fined as the ratio between their co-positive-feedback frequencyand their co-feedback frequency. According to this definition,the correlation value is within the interval between 0 and 1. Thelarger the correlation is, the more likely that these two imagesare semantically similar to each other. If two images are nevermarked as the positive example together in a single query ses-sion, their correlation value is zero. It is important to note thatthe proposed model does not assume any correlation betweentwo negative examples because they may be irrelevant to theuser’s query in many different ways.

The semantic correlation between image andimage is formally defined as follows:

(1)ifif andif

(2)

where is the co-positive-feedback frequency anddenotes the co-feedback frequency.

The semantic correlation is created and updated by accumu-lating user feedback information, which can be described below.

1) Initialize all.

2) After the th query session offered by a user, collectall feedback images (the query image is treated as a posi-tive example).

3) For each feedback image pair , in this query session,update and as:

if both and are positive examples,;

if one of them is positive example,;

otherwise,.

4) Recalculate the semantic correlations for all feedbackimage pairs according to (2).

5) Repeat steps 2)–4) once a new query session is completed.

Page 5: 21. a memory learning framework for effective image retrieval

HAN et al.: MEMORY LEARNING FRAMEWORK 515

Fig. 1. Simple graphical representation of the semantic image link network.

According to the semantic correlations, a semantic image linknetwork can be easily constructed, which is represented by im-ages having links to other images in the database. Its simplegraphical representation is shown in Fig. 1. The link intensityon each individual link stands for the degree of semantic rel-evance between two images. Hence, the link intensity is as-signed to its corresponding semantic correlation. In the network,we say there is a “direct link” between two images and if

; otherwise, we say there is no “direct link” be-tween them.

We summarize the characteristics of the feedback knowledgememory model in the following three points.

1) It is able to automatically collect and analyze the users’historical judgments offline without additional cost ofuser interaction. Also, it hardly influences the speed ofthe real-time retrieval system.

2) Since the user log accumulates feedback knowledge fromvarious users, the semantic correlations can reflect thepreference of the majority of the users. In addition, byusing large amount of user log, the model calculates thesemantic correlations from a statistical point of view.Therefore, a small number of error feedbacks do notproduce great adverse effect to the final results.

3) Due to the symmetry of the semantic correlation, a trian-gular matrix is sufficient to keep all information. In orderto further reduce the memory size, all items with zerovalue are excluded. Thus, the representation of the modelis simple but highly efficient.

IV. SEMANTIC CORRELATION ANALYSIS

BY A LEARNING STRATEGY

Most existing long-term learning algorithms also utilizememory models similar to ours to gather knowledge. Theythen directly apply the memory knowledge to improve imageretrieval performance. There is no doubt that they alwaysget good results since the recorded information contains theuser-perceived semantics. Nevertheless, a problem arises inpractice. As can be easily seen from our memory model, onlythose “direct links” embody the users’ preferences, on thecontrary, many images having no “direct link” with each otherdo not convey any information. Should we, thus, doubtless saythat two images without “direct link” are not similar at all? Ac-tually, many cases of two similar images without “direct links”are due to the sparsity of user log. That is, the model has notenough feedback data to find the semantic relevance between

them. Hence, thanks to the reality of limited user log, a goodmodel should not only memorize retrieved relevant images,but also learn to discover more relevant images that have notbeen memorized. In this section, we will discuss a learningstrategy to estimate the hidden semantic correlation betweentwo images without “direct link.” Objectively speaking, theso-called learning of most long-term learning is essentiallycorresponding to the memory process of our framework. Also,it is the reason that our framework is named memory learning.The main objective of our work is to make the limited user logplay the fullest role.

The learning strategy proceeds in four steps. At first, imagesin the database are grouped into semantically relevant clustersby the gathered semantic correlations. Next, we assume thateach cluster is associated with a semantic topic. Within eachsemantic topic, the authoritative rank of the image is calculated.Third, hidden semantic correlation between two images is esti-mated by the authoritative ranks. Finally, the hidden semanticcorrelation between an image and the feedback examples is ap-proximated by a probabilistic scheme.

A. Image Semantic Clustering

This subsection introduces a clustering approach to groupingimages into a few semantically correlated clusters using memo-rized semantic correlations among images. Because of its sim-plicity, the -means algorithm is adopted. To use -means algo-rithm, two key issues have to be addressed: how to determinethe initial cluster centers and how to measure the similarity be-tween an unclassified image and a cluster. The following is oursolution.

Assume there are images in the database. For each image, a measure of cluster center is defined as

(3)

which is the sum of link intensities with all images in the linknetwork. Intuitively, images with strong links with many othersmight be representative for a specific topic. Thus, images areranked in the descending order of the value of and top im-ages that have no “direct link” with each other are selected asthe initial cluster centers.

Assume , is an image cluster, which con-tains images . The similarity betweenan unclassified image and cluster is defined as

(4)

which is the sum of link intensities between image and allimages in cluster .

After the above two issues have been addressed, the -meansalgorithm is performed to group images. Each unclassifiedimage is assigned to the cluster with the maximal similaritybetween them. This process is repeated until convergence.Notice that the image is assigned to an “unknown” cluster if ithas no “direct links” to any other images of database. However,once the image of “unknown” cluster gets the “direct links”

Page 6: 21. a memory learning framework for effective image retrieval

516 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 4, APRIL 2005

from users’ interactions, it takes part in the above semanticclustering and is assigned to a semantic class.

B. Image Authoritative Rank

In general, we might assume that an image cluster corre-sponds to a specific semantic topic. In a cluster, a member imagemay more or less share this topic. To reflect how likely an imagewithin the cluster contains its corresponding concept, an author-itative rank is estimated by

(5)

where , represents the th image cluster,represents an image of the th image cluster, and denotesthe image number of .

As can be seen from the definition of the authoritative rank,an image with strong links to images of its same cluster butwith weak links to images of other clusters always has the highauthoritative rank. Intuitively, the larger the authoritative rankof an image is, the more likely that this image can represent thecorresponding concepts of its cluster.

C. Hidden Semantic Correlation Between Two Images

In contrast to the clear relevance between images with “directlink,” the hidden semantic correlation is the potential semanticrelevance between images without “direct link.”

We first introduce the definition of semantic similarity be-tween two clusters that will be used to estimate the hidden se-mantic correlation between two images. Intuitively, for two dif-ferent clusters, stronger semantic links between them indicatethat they are more semantically similar. Hence, the sum of se-mantic correlations between members of two clusters could beused as the similarity measure. However, a regular similarityshould be within the range of . Accordingly, we adopt thelinear scaling algorithm to normalize it. After the -means algo-rithm is convergent, for each image in the cluster , thefollowing inequality holds:

(6)

which illustrates that the image is more similar to its own classthan to other classes. Thus, the semantic similarity between twoclusters and is formally defined as (7), shown at thebottom of the page. As mentioned before, each image clustercould be considered to correspond to a specific concept. Withineach image cluster, the authoritative rank of an image describes

how likely it contains the specific concept. Thus, if the authori-tative rank of an image is higher, it is reasonable to assume thatthis image is more semantically similar to other images in thesame cluster. Consequently, the hidden semantic correlation be-tween two images could be simply estimated by

(8)

D. Hidden Semantic Correlation Between an Image and theFeedback Examples

During a query session, after the user provides a set of feed-back examples, there emerges a question. If an image of data-base has no “direct link” to any of the feedback examples, howto determine the correlation degree between this image and thefeedback example set? To address this problem, in this subsec-tion, a probabilistic approach is suggested to approximate thehidden semantic correlation between an image and the feedbackexamples. Only positive examples are used in the probabilisticapproach.

Assume that refers to as the pos-itive feedback example set containing the query image andpositive examples stands for an imagewithout “direct links” to any members of denotes the setof semantic classes that the query image and positive examplesbelong to, and represents one semantic class in . We definethe hidden correlation between the image and the feedback ex-amples as the conditional probability . The probabilitycan be determined as follows:

(9)

When we consider the feedback example set as a conditionto qualify the conditional probability of semantic classes andthe image, it is independent of the other two items. Hence, it isreasonable to assume that , then

(10)

is the conditional probability that feedback exam-ples belong to the class when is provided by the user.

is the conditional probability of occurrence of ifthe class is selected. Supposing and

if

if(7)

Page 7: 21. a memory learning framework for effective image retrieval

HAN et al.: MEMORY LEARNING FRAMEWORK 517

may be estimated integrating the feedback exampleset , the authoritative rank , and semantic similarity be-tween and . Thus

suppose (11)

(12)

where

• is the number of examples that are in as wellas belong to the class ;

• is the number of examples in .

By combining (9)—(12), we get the hidden semantic corre-lation between image and feedback examples by

(13)

The hidden semantic correlation between an image and feed-back examples can be dynamically updated with each round ofuser’s feedback preferences. The objective of this probabilisticscheme is to make images with higher similarity to most positiveexamples more likely similar to the query image. The similarityis estimated by a combination of image semantic clusters andimage authoritative ranks.

V. IMAGE RETRIEVAL FRAMEWORK

BY THE MEMORY LEARNING

We have incorporated the memory learning framework intothe iFind image retrieval system [34] developed at Microsoft Re-search Asia. It supports query by examples, query by keywords,and RF. Fig. 2 displays the basic architecture of the memorylearning framework. In the following, we discuss the key tech-niques of this framework one by one.

A. Image Similarity Measure

A problem of typical long-term learning models is thesparsity of user log. When not enough memory knowledgeis available, their performance is poor. On the contrary, con-tent-based schemes are absolutely insensible to sparsity ofuser log. Hence, considering memory learning is limited by aninsufficient amount of user log, a combination of content-basedmethod and memory learning can lead to better retrieval perfor-mance. Moreover, the combined system may recommend freshimages that have not yet received any previous users’ query orfeedback so far.

Therefore, the similarity between the query image andimage in the database is defined as the weighted sum oflow-level feature-based similarity and learned semantic simi-larity

(14)

Fig. 2. Basic architecture of the proposed memory learning framework.

where is the low-level feature-based similarity, andstands for the semantic similarity. is calcu-

lated by the distance between feature vectors of and .is defined as

if there is direct link between andotherwise

(15)If the query is a new image outside database or in the “un-known” class, , and only the low-level features areused to produce the initial retrieval results.

In (14), the weights could be either predefined or dy-namically adjusted using Rui’s [9] weight refining method thattreats and as two different features.

B. Relevance Feedback Integrating SVM Learning WithMemory Learning

When the user submits a query to the system, the similaritybetween each image and is calculated using (14), and im-ages with the highest similarities are returned as the result. If theuser offers any feedback information, the similarity measure isrefined integrating low-level feature-based normal RF with se-mantics-based memory learning.

Due to its excellent capability in dealing with classificationissue with small sample size, this paper adopts SVM learning asthe low-level feature-based RF tool. SVMs are a family of ma-chine learning technologies originally invented by Vapnik [35].Let us consider SVMs in a binary classification setting. Given aset of linear separable training data with theirlabels , SVMs are trained andthe hyperplane is formed, which separates the training data bya maximal margin. Data points lying on one side of the hyper-plane are labeled , and points lying on the other side of thehyperplane are marked 1. When a new data point is inputted forclassification, a label (1 or ) is assigned according to its re-lationship to the decision boundary, that is

(16)

When the data is not linearly separable, SVMs first project theoriginal data to a higher dimensional space by a Mercer kernel

Page 8: 21. a memory learning framework for effective image retrieval

518 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 4, APRIL 2005

function , and then linearly separates the data in this space.The corresponding nonlinear decision boundary is

(17)

Commonly used kernel functions include Gaussian radius basisfunction (RBF) kernels, polynomial kernels, and Laplacian RBFkernels.

The SVM learning can be easily applied to image retrieval.During the process of RF, positive examples and negative ex-amples are used as the training data to construct a binary SVMclassifier. Images are ranked by their distances to the separatinghyperplane, and the top images are returned as the retrieval re-sults.

In our framework, once any RF examples are provided,both low-level feature-based similarity and semantics-basedsimilarity are refined. For the low-level feature part, the SVMlearning is used as the RF scheme. After each round of feed-back, a binary SVM classifier is trained using positive examplesand negative examples. Thereafter, the low-level feature-basedsimilarity between an image in the database and the queryimage is estimated by the distance from the image to theSVM classifier.

For the semantic part, the memory learning is employed. Forthe case of the image having no “direct links” with any memberof the positive feedback example set ,the hidden semantic correlation is used to refinethe . On the contrary, if the image has “direct links”with positive examples, a multi-query is carried out, whichmeans is updated as follows:

(18)

By combining both cases above, the semantic similarity is re-fined by (19), shown at the bottom of the page.As discussed inSection I, classical RF approaches suffer from the bottleneck ofinsufficient positive examples. The memory learning may makeuse of memory knowledge to lighten the burden. In the generalRF process, only very limited positive examples are possiblyoffered by the user, which results in the poor performance ofmany classical learning-based retrieval schemes. However, inour framework, images of database with the large semantic cor-relation to any one of positive examples are also regarded aspositive examples. That is, images who satisfy

(20)

are automatically added as the positive examples. In (20),is the positive feedback example set, in-

dicates an image of database, and is a threshold. Hence, by this

“help” from memory learning knowledge, the system can collectmuch more positive examples without accessional consume ofuser work. Moreover, this “help” can improve the retrieval per-formance of traditional RF techniques.

In the proposed framework, the normal RF and the memorylearning are not only purely linear integration. In fact, theyare able to reinforce each other. On the one hand, the memorylearning can provide additional positive examples to the SVMlearning according to its memorized knowledge. This helpstraditional RF techniques reduce the bottleneck of scarcity andimbalance of feedback examples. On the other hand, the im-proved SVM learning gives the memory learning more chancesto discover new images and then memorize them.

C. Image Annotation and Annotation Propagation

Many modern image retrieval systems support image anno-tation and annotation propagation, yet most systems propagatekeywords relying on visual similarity between images, whiletwo completely semantically different images may stay closeto each other in the visual space. In this subsection, we applythe memory learning framework to accomplish image annota-tion and probabilistic propagation of annotation on a semanticslevel.

Basically, there are two major issues in annotating image andpropagating keywords: which subset of images should be ini-tially labeled and which probability should be used to propagatekeywords from one annotated image to one unannotated image.We solve them as follows.

In the work of [32] and [33], the active learning is adoptedto select the initial labeling samples. The samples are chosenbased on how much information the annotation of each samplecan provide to decrease the uncertainty of the system. The anno-tated sample, once annotated, giving the maximum informationor knowledge gain to the system is selected [32], [33]. Consid-ering the memory learning model, authoritative rank and directlink number of one image are two factors to determine the ini-tial labeling samples. The image authoritative ranks reflect howlikely images contain their associated semantic topic. The di-rect link number of one image measures how many images thisimage has the direct link with in its semantic class. Intuitively,if an image with higher authoritative rank and larger direct linknumber is annotated, its annotations can be propagated to moreunannotated images on a high confidence level. For simplicity,the initial annotation measure of one image is formulated by

(21)

where is the direct link number of image . For each se-mantic class, we pick one image with the highest initial annota-tion measure for labeling. That is, images with the highest initialannotation measure in their corresponding semantic class con-struct the initial annotation set.

if there are no direct links between andotherwise (19)

Page 9: 21. a memory learning framework for effective image retrieval

HAN et al.: MEMORY LEARNING FRAMEWORK 519

Let us next discuss the issue of annotation propagation prob-ability. Actually, in the memory learning model, for any twoimages in the database, their semantic relevance can be esti-mated by the semantic correlation SC or the hidden semanticcorrelation between them. Accordingly, once an imageis annotated, the probability of this annotation propagating toanother unannotated image is assigned to the semantic relevancebetween them. In this way, the annotated keywords are propa-gated to the whole database. Every annotated image is assignedto a label vector. Each element in the vector is a keyword, andthe value for that keyword indicates the probability of this imagehaving it. The value of a keyword in the label vector can be rea-sonably set to the probability of propagating that keyword. Atypical label vector may be described by {(flower, 0.8), (bird,0.5), (mountain, 0.4), }.

Image annotation can be improved and updated by RF. Thework of [31] may be used to accomplish this task. Its principalidea is briefly described as follows. After a user submits a queryconsisting of one or more keywords, the system automaticallysearches in the database for those images relevant to the key-words. In the retrieved images, the user may tell the systemwhich images are relevant or irrelevant using RF. Then, positiveinstances append the query keywords or strength the weights ofthe query keywords. On the contrary, negative instances removeor weaken the keywords. Please refer to [31] for details.

Adding new images into the database is a very common op-eration for a retrieval system. An unconfirmed annotation algo-rithm is proposed by [31] to automatically annotate new images.It automatically adopts each new image as a query and performsa low-level feature-based image retrieval process. For the top

similar images to a query, a list of keywords sorted by theirfrequency in these images is stored in an unconfirmed key-word list. The new image is, thus, labeled by the unconfirmedkeywords. The unconfirmed annotation may be refined throughfuture query sessions. In this paper, we also use this algorithmto annotate new images.

The memory learning-based image annotation and annotationpropagation, thus, proceed as follows.

1) Select images with the highest initial annotation measureof each semantic topic as the initial annotation set.

2) For each sample of the initial annotation, manually labelkeywords and propagate these keywords to other unanno-tated images according to the semantic relevance.

3) During the query session, after the user marks feedbackinstances, reweigh the image annotation for the positiveand negative instances using [31].

VI. EXPERIMENTAL RESULTS

We tested the memory learning framework with a general-purpose image database that consists of 10 000 images of 79categories from the Corel Image Gallery. Corel images havebeen widely used by the image processing and CBIR researchcommunities. They cover a variety of topics, such as “flower,”“tiger,” “eagle,” “gun,” “horse,” etc. In all experiments, we usethe color correlogram [36] with 144 dimensions as the low-level

feature. Like [6], [7], and [23], the retrieval accuracy is definedas

Accuracyrelevant images retrieved in top returns

(22)

A retrieved image is considered to be relevant if it belongs to thesame category of the query. In all experiments, we determinethe weights of (14) by the Rui’s [9] reweighting algorithm. Thevalue of the image class number is predefined to the categorynumber of the image database used in the experiment.

Six aspects of experiment were conducted to evaluate the pro-posed framework. In Section VI-A, we test its retrieval perfor-mance. In Section VI-B, we evaluate the performance of thehidden semantics learning strategy. Section VI-C examines thesystem’s robustness to user errors. Section VI-D shows how thetraditional RF algorithms improve performance with the helpof the memory learning model. Afterwards, experiments aboutimage semantic clustering are presented in Section VI-E. Fi-nally, evaluation for memory learning-based annotation prop-agation is reported in Section VI-F.

A. Retrieval Performance Test

To build feedback knowledge memory model, we asked eightreal-world users to retrieve images using our system. Each oneof seven users was required to perform 200 query sessions, andthe last user provided 100 query sessions. Each query sessionconsisted of four iterations of feedback. At each iteration, theusers marked positive and negative examples according to theirpreferences. Totally, we collected 1 500 query sessions in theuser log.

One hundred images were randomly chosen from the imagedatabase as the query set. Based on the query set, the averageretrieval accuracy of the top 100 images [in (22), ]is used as the performance evaluation measure . In theexperiment, the RF process is conducted automatically. In thefirst round of feedback, the top 30 images are checked and la-beled as either positive or negative examples. A retrieved imageis considered to be relevant if it belongs to the same category ofthe query. In the following round of feedback, the labeled posi-tive images are placed at the beginning, while negative ones areplaced at the end, and the top 30 images in the rest of the list arechecked again.

To test the effectiveness of the proposed framework, we com-pared its retrieval performance with two classic nonmemory ap-proaches: SVM learning-based RF described in Section V-Band MARS presented by [9]. To show the effect of the amount ofmemorized knowledge, 1 500 queries were used in three stages:ML500 using 500 queries, ML1000 using 1 000 queries, andML1500 using 1 500 queries. Fig. 3 presents the experimentalresults. Clearly, the proposed framework improves the retrievalaccuracy substantially. Moreover, the more memorized feed-back information from the users’ interactions, the better the per-formance of the retrieval system.

B. Performance Evaluation of the Hidden SemanticsLearning Strategy

To illustrate the function of the hidden semantics learningstrategy, we compared the results with and without the hiddensemantics learning in all three stages. Considering queries

Page 10: 21. a memory learning framework for effective image retrieval

520 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 4, APRIL 2005

Fig. 3. Performance comparison between memory learning framework andnonmemory RF.

Fig. 4. Performance comparison with and without the hidden semanticslearning.

without the hidden semantics learning, in (14) isonly determined by . The results are displayed in theFig. 4. As can be seen from the experimental results, the hiddensemantics learning surely improves the retrieval performance.It is not surprising that the hidden semantics learning plays aless important role as the stored feedback knowledge increases.We can imagine that the hidden semantics learning would bedisabled when the system has collected enough user log wherethere is the “direct link” between any two images. From anotherpoint of view, this experiment verifies that the hidden semanticslearning strategy can indeed alleviate the problem of feedbackknowledge sparsity.

C. Robustness to User Errors

In the real-world RF processes, user errors may take place.For instance, a user carelessly labels images of “flower” as rel-evant examples while he/she is actually interested in images of“bird.” Hence, a good CBIR system should be able to toleratemoderate levels of noise from user errors. The proposed systemcan handle mild noise by two reasons. First, our semantic corre-lations are calculated on a great amount of user feedback knowl-edge. From the statistical perceptive, the overwhelming ma-jority of correct feedback information can filter a small numberof user errors. Second, in our empirical study, we employ feed-back information offered by real-world users, which inevitably

Fig. 5. Performance comparison with and without the simulated user errors.

contains noisy data. Therefore, the promising experimental re-sults presented in Section VI-A can partially confirm the pro-posed system is robust in a noisy environment.

In this subsection, an experiment was conducted to furtherexamine the robustness of the proposed work to user errors. Wedid 150 wrong query sessions to simulate the user errors (here,user errors mean images labeled as positive examples are not inthe same category of the query image), and then merged theminto those 1 500 real-world user log. In this way, the simulateduser error rate is around 10%. Under the simulated noisy en-vironment, we tested the retrieval performance of the proposedsystem again. As can be seen from Fig. 5, the proposed systemenjoys little performance degradation under this simulated noisyenvironment. This experiment demonstrates that the memorylearning framework is robust to mild user errors.

D. Image Retrieval by Traditional RF Techniques With theHelp of the Memory Learning

In the proposed framework, the traditional RF and memorylearning are not a purely linear combination. The memorylearning can automatically provide additional positive exam-ples to the traditional RF. More specifically, given a querysession, after the user offers feedback examples, images ofdatabase with the large semantic correlation [in (20), ]to any one of positive examples are also automatically regardedas positive examples. We argue this “help” can alleviate thelimitation of scarcity and imbalance of feedback examples. Anexperiment was designed to examine this “help.” This experi-ment used the two traditional RF algorithms: SVM learning andMARS [9]. We compared the performance with and without thememory learning’s “help.” In this test, the feedback knowledgememory model was built using 1 000 query sessions. Figs. 6–8report the experimental results. Clearly, the two RF methodsachieved the better retrieval performances with the help ofmemory learning.

E. Experiments About Image Semantic Clustering

To demonstrate the effectiveness of our image semantic clus-tering algorithm, we compared with the SVM-OPC (one perclass) classification scheme used in CBSA system [33]. Theso-called SVM-OPC implements image classification in termsof low-level features. For classes, it trains SVM clas-sifiers each distinguishing one class from the otherclasses. For each point , there exists SVM classifiers out-putting . The class of the point is assigned to

.

Page 11: 21. a memory learning framework for effective image retrieval

HAN et al.: MEMORY LEARNING FRAMEWORK 521

Fig. 6. Retrieval accuracy (P20) comparison with and without the help ofmemory learning.

Fig. 7. Retrieval accuracy (P50) comparison with and without the help ofmemory learning.

Fig. 8. Retrieval accuracy (P100) comparison with and without the help ofmemory learning.

In order to fulfill our experimental goals, we used twodatasets: one contains 2 500 images from 25 Corel categories;the other contains 10 000 images from 79 Corel categories.For the former database, we invited three subjects to collect150 query sessions. We further divided the user log into twophases: 100 queries and 150 queries. For the latter database, weadopted the same user log used in the foregoing experiments.The SVM-OPC classifiers were trained respectively by 10%,20%, and 30% of images randomly picked from the database.The number of SVM-OPC classifiers is decided by the numberof image category used in the experiment. Its category in thedatabase is the ground truth for an image. The classified imageis regarded as correct if it belongs to its ground truth. Fig. 9gives the classification precision comparison using the 2.5-K

Fig. 9. Classification precision comparison on the 2 500 images database.

Fig. 10. Classification precision comparison on the 10 000 images database.

dataset, and Fig. 10 shows the comparison results using the10-K dataset. Clearly, our algorithm achieves the better classi-fication performance with both databases. On the contrary, theSVM-OPC handles the situation of the small database well, butis not scalable to large database.

F. Evaluation for Memory Learning-Based AnnotationPropagation

In this subsection, we evaluate annotation propagation perfor-mance of our framework through comparing with two recentlypublished systems: CBSA [33] and Bayesian model [15]. Theso-called CBSA system first defines a set of labels each cor-responding to the semantics of an image category. Then, eachunannotated image is classified against those defined categoriesusing SVM-OPC. It produces a rank list for those categories,with each category assigned a confidence probability. The la-bels, together with their probabilities, become the annotation ofthis image. As for the Bayesian model, it performs the anno-tation propagation during the process of RF. After the user hasprovided feedback in a query session, it assumes that all positiveexamples belong to one semantic class and the features from thesame semantic class follow the Gaussian distribution. The pa-rameters for a semantic Gaussian class can be estimated usingthe feature vectors of all the positive examples. Then the poste-rior probability of each image in the database belonging to suchsemantic class is estimated by the Bayesian formulation. Thecommon keywords of positive examples are propagated to theimage by the estimated posterior probability.

In the experiment, 10 000 Corel images from 79 semanticcategories were adopted. For the proposed strategy, the feed-

Page 12: 21. a memory learning framework for effective image retrieval

522 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 4, APRIL 2005

Fig. 11. Annotation propagation examples propagated to the keyword of “flower.” (a) Thirteen matches; top 15 images using memory learning framework after1 000 queries. (b) Twelve matches; top 15 images using CBSA with 20% training data. (c) Six matches; top 15 images using Bayesian model after five feedbackiterations.

Fig. 12. Annotation valid propagation precision comparison of memorylearning framework, CBSA, and Bayesian model.

back knowledge memory model was constructed by 1 000 userlog and 1 500 user log, respectively. 79 images with the highestinitial annotation measure in their corresponding semantic classwere composed of the initial annotation image set. We assumethat each image of initial annotation image set is labeled byonly one keyword, which is exactly its category name. There-after, the keywords were propagated from annotated images tounannotated images based on the semantic correlation or thehidden semantic correlation between them. For the CBSA, ineach image category, 20% images were randomly selected tomake up the training data. They were annotated by their cate-gory names. Then, 79 SVM-OPC classifiers were trained usingthe training data and each classifier was associated with a se-mantic category. By the trained classifiers, a set of probabilitiesfor each unannotated image were produced. The probabilitiesdepict the likelihood of a category label describing an image.For the Bayesian model, 100 randomly selected images con-sisted of the query set and they were annotated by their cate-gory names. Only one keyword was associated with each queryimage, and other images in the database have no keyword anno-tation. The system used each query image for image retrieval.The RF was performed automatically. In each round of feed-back, the top 30 images are checked and labeled as positive ex-ample or not. A retrieved image is considered to be relevant if itbelongs to the same category of the query. After five iterationsof feedback, a Bayesian model was formed by the positive ex-amples and the keyword of query image was propagated to otherunannotated images according to the posterior probability.

Due to the limitation of space, we present only one keywordpropagation example. Fig. 11 shows the top 15 images that arepropagated to the keyword “flower” by memory learning frame-

work, CBSA, and Bayesian model, respectively. In Fig. 11,(a) displays the results using memory learning framework,(b) shows the results using CBSA, and (c) shows the resultsusing Bayesian model. A match happens when one image inthe top 15 images belongs to the category represented by thepropagated keyword. The visually results demonstrate that ourframework outperforms other two systems.

To objectively and comprehensively evaluate the annotationpropagation function of the proposed framework, we define aquantitative performance metric: valid propagation precision.This measure shows how often the valid propagations arecorrect. The so-called valid propagation means its propagationprobability is over a confidence threshold . A propagationprobability less than implies this propagation happens on alow confidence level. Moreover, in the practical query process,images with a low value to one keyword are hardly retrievedby the system when the user submits that keyword for query.Therefore, only valid propagations are used to finally calculatethe accuracy. In the experiment, , and the groundtruth keyword of an image is its corresponding category name.Fig. 12 shows the comparing results. The accuracy of Bayesianmodel is the average accuracy of 100 queries. As can be seenfrom the Fig. 12, the proposed framework is able to providemore trustworthy annotation propagations.

VII. CONCLUSION

In order to supply effective image retrieval to users, thispaper has presented a new memory learning frameworkin which low-level feature-based RF and semantics-basedmemory learning are combined to help each other to achievebetter retrieval performance. There are two novel characteris-tics that distinguish the memory learning framework from theexisting RF techniques. First, it creates a feedback knowledgememory model to accumulate user’s preferences. More im-portantly, a learning strategy is introduced to infer the hiddensemantics according to the gathered semantic information.In addition, a semantics-based image annotation propagationscheme is described.

The proposed framework is easy to implement and can beefficiently incorporated into an image retrieval system. Experi-mental evaluations on a large-scale image database have alreadyshown very promising results. However, a limitation of the pro-posed work is that it somewhat lacks sufficient theoretical jus-tification. Our future work will investigate the possibility to de-velop more sophisticated and theoretical learning schemes.

Page 13: 21. a memory learning framework for effective image retrieval

HAN et al.: MEMORY LEARNING FRAMEWORK 523

ACKNOWLEDGMENT

The authors would like to thank Dr. L. Zhang of MSRA forhis help in system work. They would also like to thank F. Jingand G. Xue for some valuable discussions.

REFERENCES

[1] M. Flickner, H. Sawhney, and W. Niblack, “Query by image and videocontent: The QBIC system,” IEEE Computer, vol. 28, no. 9, pp. 23–32,Sep. 1995.

[2] A. P. Penland, R. W. Picard, and S. Sclaroff, “Photobook: Content-basedmanipulation of image databases,” Int. J. Comput. Vis., vol. 18, no. 3, pp.233–254, 1996.

[3] W. Y. Ma and B. Manjunath, “NETRA: A toolbox for navigating largeimage databases,” Multimedia Syst., vol. 7, no. 3, pp. 184–198, 1999.

[4] J. Z. Wang, J. Li, and G. Wiederhold, “SIMPLIcity: Semantics-sensi-tive integrated matching for picture libraries,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 23, no. 9, pp. 947–963, Sep. 2001.

[5] A. Gaurav, T. V. Ashwin, and G. Sugata, “An image retrieval system withautomatic query modification,” IEEE Trans. Multimedia, vol. 4, no. 2,pp. 201–213, Jun. 2002.

[6] M. Li, Z. Chen, and H. Zhang, “Statistical correlation analysis in imageretrieval,” Pattern Recognit., vol. 35, pp. 2687–2693, 2002.

[7] Z. Su, H. Zhang, S. Li, and S. Ma, “Relevance feedback in content-basedimage retrieval: Bayesian framework, feature subspaces, and progressivelearning,” IEEE Trans. Image Process., vol. 12, no. 8, pp. 924–936, Aug.2003.

[8] J. J. Rocchio, “Relevance feedback in information,” in The SMART Re-trieval Systems: Experiments in Automatic Document processing, G.Salton, Ed. Upper Saddle River, NJ: Prentice-Hall, 1971, pp. 313–323.

[9] Y. Rui, T. S. Huang, and S. Mehrotra, “Relevance feedback: A powerfultool in interactive content-based image retrieval,” IEEE Trans. CircuitsSyst. Video Technol., vol. 8, no. 5, pp. 644–655, May 1998.

[10] I. J. Cox, M. L. Miller, T. P. Minka, T. V. Papathomas, and P. N. Yianilos,“The Bayesian image retrieval system, PicHunter: Theory, implemen-tation, and psychophysical experiments,” IEEE Trans. Image Process.,vol. 9, no. 1, pp. 20–37, Jan. 2000.

[11] T. P. Minka and R. W. Picard, “Interactive learning with a “society ofmodels”,” in Proc. IEEE Conf. Computer Vision Pattern Recognition,San Francisco, CA, Jun. 1996, pp. 447–452.

[12] Y. Rui and T. S. Huang, “Optimizing learning in image retrieval,” inProc. IEEE Conf. Computer Vision Pattern Recognition, Jun. 2000, pp.236–243.

[13] X. S. Zhou and T. S. Huang, “Small sample learning during multimediaretrieval using BiasMap,” in Proc. IEEE Conf. Computer Vision PatternRecognition, Dec. 2001, pp. 8–14.

[14] S. Tong and E. Chang, “Support vector machine active learning forimage retrieval,” in Proc. ACM Int. Conf. Multimedia, Ottawa, ON,Canada, Oct. 2001, pp. 107–118.

[15] H. Zhang and Z. Su, “Relevance feedback in CBIR,” presented at the Int.Workshop on Visual Databases, 2002.

[16] X. S. Zhou and T. S. Huang, “Relevance feedback in image retrieval: Acomprehensive review,” Multimedia Syst., vol. 8, pp. 536–544, 2003.

[17] S. D. MacArthur, C. E. Brodley, and C. R. Shyu, “Relevance feedbackdecision trees in content-based image retrieval,” in Proc. IEEE Work-shop Content-Based Access of Image and Video Libraries, Jun. 2000,pp. 68–72.

[18] K. Tieu and P. Viola, “Boosting image retrieval,” in Proc. IEEE Conf.Computer Vision Pattern Recognition, Jun. 2000, pp. 228–235.

[19] E. Chang, B. Li, G. Wu, and K. S. Goh, “Statistical learning for effec-tive visual information retrieval,” in Proc. IEEE Conf. Image Processing,Barcelona, Sep. 2003, pp. III-609–III-612.

[20] C. Lee, W. Y. Ma, and H. Zhang, “Information embedding based onuser’s relevance feedback for image retrieval,” presented at the SPIEConf. Multimedia Storage and Archiving Systems IV, Boston, MA,1999.

[21] I. Bartolini, P. Ciaccia, and F. Waas, “FeedbackBypass: A new approachto interactive similarity query processing,” in Proc. Int. Conf. Very LargeData Bases, Rome, Italy, Jun. 2001, pp. 201–210.

[22] F. Fournier and M. Card, “Long-term similarity learning in content-based image retrieval,” in Proc. IEEE Conf. Image Processing, NewYork, Sep. 2002, pp. 22–25.

[23] X. He, O. King, W. Y. Ma, M. Li, and H. Zhang, “Learning a semanticspace from user’s relevance feedback for image retrieval,” IEEE Trans.Circuits Syst. Video Technol., vol. 13, no. 1, pp. 39–48, Jan. 2003.

[24] D. Goldberg, D. Nichols, B. Oki, and D. Terry, “Using collaborativefiltering to weave an information tapestry,” Commun. ACM, vol. 35, no.12, pp. 218–277, 1992.

[25] A. Kohrs and B. Merialdo, “Improving collaborative filtering with mul-timedia indexing techniques to create user-adapting web sits,” in Proc.ACM Int. Conf. Multimedia, Seattle, WA, Nov. 1999, pp. 27–36.

[26] A. Kohrs and B. Merialdo, “Clustering for collaborative filtering appli-cations,” presented at the Int. Conf. Computational Intelligence for Mod-eling Control and Automation, Feb. 1999.

[27] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,”J. ACM, vol. 46, no. 5, pp. 604–632, 1999.

[28] R. Lempel and A. Soffer, “PicASHOW: Pictorial authority search by hy-perlinks on the web,” in Proc. 10th Int. WWW Conf., 2001, pp. 438–448.

[29] R. W. Picard and T. P. Minka, “Vision texture for annotation,” Multi-media Syst., vol. 3, pp. 3–14, 1995.

[30] E. Saber and A. M. Tekalp, “Region-based affine shape matching forautomatic image annotation and query-by-example,” J. Vis. Commun.Image Rep., vol. 8, no. 1, pp. 3–20, Mar. 1997.

[31] W. Liu, S. Dumais, Y. Sun, and H. Zhang, “Semi-automatic image an-notation,” in Proc. Conf. Human-Computer Interaction, Jul. 2001, pp.326–333.

[32] C. Zhang and T. Chen, “An active learning framework for con-tent-based information retrieval,” IEEE Trans. Multimedia, vol. 4, no.2, pp. 260–268, Jun. 2002.

[33] E. Chang, K. Goh, G. Sychay, and G. Wu, “CBSA: Content-based softannotation for multimodal image retrieval using bayes point machines,”IEEE Trans. Circuits Syst. Video Technol., vol. 13, no. 1, pp. 26–38, Jan.2003.

[34] H. Zhang, W. Liu, and C. Hu, “iFind-a system for semantics and fea-ture based image retrieval over internet,” in Proc. ACM Int. Conf. Mul-timedia, Los Angeles, CA, 2000, pp. 477–478.

[35] V. Vapnik, The Nature of Statistical Learning Theory. New York:Springer, 1995.

[36] J. Huang, S. R. Kumar, M. Mitra, W. Zhu, and R. Zabih, “Image indexingusing color correlograms,” in Proc. IEEE Conf. Computer Vision PatternRecognition, Jun. 1997, pp. 762–768.

[37] J. Li and J. Z. Wang, “Automatic linguistic indexing of pictures by astatistical modeling approach,” IEEE Trans. Pattern Anal. Mach. Intell.,vol. 25, no. 9, pp. 1075–1088, Sep. 2003.

[38] K. Barnard, P. Duygulu, D. Forsyth, N. D. Freitas, D. M. Blei, and M.I. Jordan, “Matching words and pictures,” J. Mach. Learn. Res., vol. 3,pp. 1107–1135, 2003.

Junwei Han received the Ph.D. degree from North-western Polytechnical University, Xi’an, China, in2003.

He is currently a Postdoctoral Fellow with theDepartment of Electronic Engineering, The ChineseUniversity of Hong Kong, Shatin, Hong Kong. Hisresearch interests include content-based image/videoretrieval and image/video segmentation.

King N. Ngan (M’79–SM’91–F’00) received thePh.D. degree in electrical engineering from Lough-borough University of Technology, Loughborough,U.K.

He is a Chair Professor with the Department ofElectronic Engineering, The Chinese University ofHong Kong, Shatin, Hong Kong. Previously, he was aFull Professor with Nanyang Technological Univer-sity, Singapore, and the University of Western Aus-tralia, Crawley, Australia. He is an Associate Editorfor the Journal on Visual Communications and Image

Representation and an Area Editor of the EURASIP Journal of Image Commu-nication and Journal of Applied Signal Processing. He has chaired a number ofprestigious international conferences on video signal processing and communi-cations and has served on the advisory and technical committees of numerousprofessional organizations. He has published extensively, including three au-thored books, five edited volumes, and over 200 refereed technical papers in theareas of image/video coding and communications.

Prof. Ngan is a Fellow of the IEE (U.K.) and a Fellow of IEAust (Australia).He was an Associate Editor of IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS

FOR VIDEO TECHNOLOGY.

Page 14: 21. a memory learning framework for effective image retrieval

524 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 14, NO. 4, APRIL 2005

Mingjing Li received the B.S. degree in electricalengineering from the University of Science andTechnology of China, Hefei, and the Ph.D. degree inpattern recognition from the Institute of Automation,Chinese Academy of Sciences, Beijing, in 1989 and1995, respectively.

He joined Microsoft Research Asia, Beijing,China, in July 1999. His research interests includehandwriting recognition, statistical language mod-eling, search engines, and image retrieval.

Hong-Jiang Zhang (F’03) received the Ph.D.degree in electrical engineering from the TechnicalUniversity of Denmar, Lyngby, and the B.S. degreein electrical engineering from Zhengzhou University,Zhengzhou, China, in 1991 and 1982, respectively.

From 1992 to1995, he was with the Institute ofSystems Science, National University of Singapore,Singapore, where he led several projects in video andimage content analysis and retrieval and computer vi-sion. He was also with the Massachusetts Institute ofTechnology Media Laboratory, Cambridge, as a Vis-

iting Researcher in 1994. From 1995 to 1999, he was a Research Manager atHewlett-Packard Laboratories, Palo Alto, CA, where he was responsible for re-search and technology transfers in the areas of multimedia management, in-telligent image processing, and Internet media. In 1999, he joined MicrosoftResearch Asia, Beijing, China, where he is currently a Senior Researcher andAssistant Managing Director in charge of media computing and informationprocessing research. He has authored three books, over 260 referred papers,seven special issues of international journals on image and video processing,content-based media retrieval, and computer vision, as well as over 50 patentsor pending applications.

Dr. Zhang currently serves on the editorial boards of five IEEE/ACM journalsand a dozen committees of international conferences.