Using personalization to improve xml retrieval

download Using personalization to improve xml retrieval

If you can't read please download the document

Transcript of Using personalization to improve xml retrieval

  • 1280 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014

    Using Personalization to Improve XML RetrievalLuis M. de Campos, Juan M. Fernndez-Luna, Juan F. Huete, and Eduardo Vicente-Lpez

    AbstractAs the amount of information increases every day and the users normally formulate short and ambiguous queries,personalized search techniques are becoming almost a must. Using the information about the user stored in a user profile, thesetechniques retrieve results that are closer to the user preferences. On the other hand, the information is being stored more and morein an semi-structured way, and XML has emerged as a standard for representing and exchanging this type of data. XML search allowsa higher retrieval effectiveness, due to its ability to retrieve and to show the user specific parts of the documents instead of the fulldocument. In this paper we propose several personalization techniques in the context of XML retrieval. We try to combine the differentapproaches where personalization may be applied: query reformulation, re-ranking of results and retrieval model modification. Theexperimental results obtained from a user study using a parliamentary document collection support the validity of our approach.

    Index TermsInformation retrieval, XML, personalization, query expansion, reranking, CAS queries

    1 INTRODUCTION

    OVER the last few years the amount of digitalinformation has increased exponentially. Therefore,the use of Information Retrieval Systems (IRS) has becomecrucial in finding relevant information within this hugeamount of data. These systems have been providing verygood results for the majority of users. However, in addi-tion to the aforementioned huge rise of digital information,there is the fact that users do not always specify accuratelyenough their information needs (they tend to formulateshort and ambiguous queries). It is then inevitable thatthe access to the relevant information is becoming moredifficult each day.

    These aforementioned factors have led to a growinginterest in personalization techniques [19], [44], [48]. In thiscontext, personalization can be defined as the process bywhich, using information about the user, generally storedin a user profile, and the issued query, the most appropri-ate results are provided with respect to the user interestsand preferences. In this way, personalization minimizesthe information overload of users, making it possible tobetter satisfy their information needs. Thanks to this poten-tiality, personalization has become one of the key chal-lenges and hot research areas in the information retrievalfield [1], [4].

    Another key aspect of this amount of digital informationis the increasing use of different types of documents, whosetextual content is organised around a well defined struc-ture. XML (eXtensible Markup Language) has recentlyemerged as the document standard forrepresenting and

    The authors are with the Departamento de Ciencias de la Computacin eInteligencia Artificial, E.T.S.I. Informtica y de Telecomunicacin, CITIC-UGR, Universidad de Granada, 18071-Granada, Spain.E-mail: {lci, jmfluna, jhg, evicente}@decsai.ugr.es.

    Manuscript received 19 Nov. 2012; revised 25 Apr. 2013; accepted 27Apr. 2013. Date of publication 12 May 2013; date of current version7 May 2014.Recommended for acceptance by J. Pei.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier 10.1109/TKDE.2013.75

    exchanging this type of semi-structured data. XML datais self-describing through content-oriented tags, which letcomputers interpret the meaning of the stored data. XMLallows us to explicitly represent the internal structure ofdocuments, which should be considered as aggregates ofinterrelated units, instead of atomic entities. Classical IR isnot able to exploit this characteristic to carry out a morefocused retrieval. In fact, the main XML-IR asset [29] is totake advantage of the documents internal structure, allow-ing one to retrieve both specific parts of the documents(we will call these parts Structural Units, SU) and completedocuments. This will depend on the users needs and thedistribution of relevant material across the different partsof the XML document.

    This new structural characteristic requires new designsand/or adaptations of the traditional IR techniques andevaluation metrics. They cannot simply be reused underthis new approach, because of the dependency betweenXML document components. This document componentdependency causes the following two main XML intrin-sic difficulties [28]: (1) near-misses, which are documentcomponents that are structurally related to relevant com-ponents, such as a neighbouring paragraph or a containersection; (2) overlap, which refers to the situation when thesame text fragment is referenced multiple times, for exam-ple where a paragraph and its container section are bothretrieved. Due to these dependencies, the development ofretrieval (and also personalization) techniques over XMLdocuments implicates some difficulties in terms of designand evaluation.

    The main goal of this article1 is to develop and eval-uate new personalization strategies designed for XMLdocuments, which is a relatively unexplored area. Wehave considered approaches to be used in the threedifferent steps where personalization may be applied(and their combinations): before search (query reformu-lation, in our case, query expansion and transformationon content-and-structure queries), after search (reranking

    1. This paper is an extended version of the conference paper [12].

    1041-4347 c 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

  • DE CAMPOS ET AL.: USING PERSONALIZATION TO IMPROVE XML RETRIEVAL 1281

    of results) and within the retrieval process (modifica-tion of the retrieval model). We focus on the effectiveuse of the information provided by the user profilerather than on the construction of the profile itself.Our personalization techniques are mainly designed fordocument collections such as digital libraries or cor-puses of big organizations, more than for the web dueto its great structural heterogeneity. However, most ofthe proposed personalization techniques could also beapplied to flat (non structured) documents with almost nochanges.

    Our main contribution is the proposal and evaluationof several new personalization techniques in the contextof XML retrieval. Most of them include new personal-ization aspects, such as the use of two retrieved lists ofresults in the reranking process, a modification in thesearch engine, or even the use of content and struc-ture queries for personalization purposes. Observing theobtained experimental results, we can conclude that all ofthem provide very good performance improvement overusing no personalization. We suggest to use the proposedtechniques, if possible, in this order: retrieval model modi-fication, content and structure approach and the rerankingapproach.

    The remainder of the article is organized in the followingway. We first give, in Section 2, an overview of the dif-ferent personalization strategies existing in the literature.Then, in Section 3, we show our proposed personalizationapproaches. Section 4 describes the experimental methodol-ogy. This includes a description of the document collectionconsidered, how we have obtained the user profiles andthe relevance assessments, our evaluation method, and theobtained results and our conclusions. Finally, we finish inSection 5 with some general conclusions and proposals forfuture work.

    2 RELATED WORKIn this section, we give a general overview of thedifferent personalization techniques found in the spe-cific literature. We should focus on XML personalization,since it is the field our work belongs to. However, webelieve that it is useful to first give a broader view ofpersonalization, not necessarily confined to XML docu-ments. We do not wish to make an exhaustive analysisbut merely to show the main types of existing per-sonalization techniques and, for those more similar toour proposals, to present the main differences betweenthem.

    So, we will start with some ideas relative to the repre-sentation of users preferences by means of user profiles,and then review some of the existing works about gen-eral personalization methods. Next, focusing on XML, wewill comment on the different ways of querying XML doc-uments and then describe specific XML personalizationmethods.

    2.1 User ProfilesThe first thing to take into account to deal with personaliza-tion is to have a good representation of the user information(his/her interests and preferences), which is stored in a user

    profile. The more accurately this information represents theuser, the better the retrieved results for this user.

    An accurate representation of the user profile is veryimportant in order to obtain good retrieval results, butthere is another key component: how to use this infor-mation, i.e. how good the whole retrieval process is inorder to exploit the information stored in the user profile.In this article we focus on the effective use of the infor-mation provided by the user profile (the personalizationstrategies) rather than on the construction of the profileitself. Anyway, some comments about building profiles arenecessary.

    There are many studies on how to build a good profilerepresentation, whose process has two key points: (1) infor-mation sources and acquisition techniques, and (2) userprofile representation and updating.

    The first point is beyond the scope of our research.Considering the second point, there are two main userprofile representations: a set of weighted keywords orterms [45] and rich semantic based structures, sometimesenhanced with the use of ontologies [43], [50]. The mostcommon representation for user profiles is the first one,which we will use. The (weighted) keywords can be auto-matically extracted from documents, other kind of sourcesor directly provided by the user. In our experiments, thekeywords will be extracted from the document collec-tion being considered, either automatically or using expertassessments, as explained in Section 4.2. Each keyword hasan associated numerical weight representing the impor-tance of the term for the user.

    2.2 General Personalization MethodsWe will classify the different personalization techniquesaccording to where they utilize the user profile informationwithin the retrieval process [46].Query reformulation. In this case, personalization isapplied before searching. The most used technique is queryexpansion with the user profile terms (called query aug-mentation in [38]), which is an easy and powerful tech-nique. For example, Shen et al. [42] select appropriate termsfrom related preceding queries and corresponding searchresults to expand the current query. Chirita et al. [17]generate the additional query terms by analysing userdata at increasing granularity levels and using externalthesauri.

    Query expansion is a technique in itself [13], which canbe used for personalization but also for other cases, as rel-evance and pseudo-relevance (blind) feedback2. In thesecases, the expansion terms are extracted from either thedocuments judged as relevant by the user or from thefirst retrieved documents, instead of using a profile. Ingeneral, when using query expansion recall is improvedbut usually at the expense of precision. However, if weuse a combined recall/precision measure query expan-sion results in better retrieval effectiveness, according torecent experimental studies [13]. Query expansion suffersthe so-called query-drift problem [31], [51]. It confusesthe user because the retrieved results may not contain

    2. Relevance feedback can be viewed as a method of short-termpersonalization [18].

  • 1282 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014

    the query terms the user was looking for in the origi-nal query. This is due to the change in the underlyingintent between the original query and its expanded form.Its effect may be particularly serious when applying queryexpansion to personalization. The reason is that the num-ber of terms in the profile may be high and these termscan be highly unrelated to the original query terms. Theusual way of dealing with this problem, especially in feed-back applications, is to emphasize the original query termswith respect to the expansion terms, for example givingless weights to the expansion terms. Our basic expan-sion method will also follow this approach, by applying aglobal normalization factor over the profile terms used forexpansion.Reranking of results. In this strategy personalization isapplied after the search has been executed. It tries toimprove precision by reranking the top results retrievedfrom the original query, taking into account the user profileinformation. There are plenty of studies following this strat-egy. For example, Sugiyama et al. [45] use a keyword-baseduser profile and rerank the results based on the similar-ity between each web page and the user profile. Chiritaet al. [16] focus on reranking the web search output accord-ing to the cosine distance between each page and a set ofDesktop terms describing user interests. In these cases, theoriginal ranking is not taken into account. Teevan et al. [47]and Matthijs and Radlinski [33] also propose methods toincorporate the original rank within the final ranking. Inmost of the studies about personalization using reranking,only the original query is submitted to the search engine.The effort of comparing the retrieved results with the pro-file information, in order to rerank these results, is carriedout outside the retrieval model implemented by the searchengine (which is almost unavoidable in the case of websearch).

    In a context different from personalization, namely thatof methods for fusion of retrieved lists, Meister et al. [34],[35] rerank a list retrieved in response to a query utiliz-ing a second list. This list is retrieved by using a differentretrieval method and/or query representation, and exploit-ing inter-document similarities between the lists, so as toimprove precision in the very top ranks. Their methodscan be used in the context of blind feedback-based auto-matic query expansion by reranking the list produced byblind feedback using the list retrieved in response to theoriginal query [34]. Similarly, Zighelnic and Kurland [51]fuse the results retrieved in response to the original queryand to its expanded form. This contributes to alleviatethe query-drift problem. Our approach to personaliza-tion using reranking is close to these studies but used inthe opposite way: we use the results from the expandedquery to rerank the results obtained by the originalquery.Retrieval model modification. There are not many articleswhich modify the search engine retrieval model in orderto account for the user profile. Most of them are focusedon link analysis. Haveliwala [22] computed a topic orientedPageRank, in which 16 PageRank vectors, biased on each ofthe main topics of the Open Directory, were initially calcu-lated off-line. Then they were combined at run-time basedon the similarity between the user query and each of the

    16 topics. Jeh and Widom [26] were able to manage arbi-trary topic vectors instead of a predefined set of topics. Inorder to generate topic oriented rankings, Nie et al. [36]distributed the PageRank of a page across the topics itcontains. Chang et al. [14] personalized HITS instead ofPageRank. Lastly, outside the field of link analysis, Teevanet al. [47] modified the probabilistic ranking function BM25by weighting terms appearing in the user profile higher.We will also propose a modification of the search engineused in our experiments, in order to treat differently theterms appearing in the profile from those appearing in theoriginal query.

    2.3 Querying XML DocumentsThe most straightforward and effective querying methodfor non-structured document collections is the well-knownkeyword search. One of its key advantages is simplicity,since users only need to specify the keywords they areinterested in. However, XML document collections haveboth content and structure, and may be queried by con-tent, structure or both. In the terminology used withinthe Initiative for the Evaluation of XML Retrieval (INEX),keyword queries are known as content-only (CO) queries.Content-and-structure (CAS) queries are those containingboth structure and content constraints. There are state-of-the art querying languages such as XQuery3 (supportedby XPath4) or NEXI [49], that allow us to retrieve XMLdocuments based on content and structure. But they havetwo key disadvantages: (1) they are complex to learn touse and (2) the users must know the structure of the docu-ments, which most of the time is not the case. These querylanguages are more suitable for expert users, letting themto specify these kinds of SUs that will much better sat-isfy their information needs, in opposition to the classickeyword search.

    In this paper, we keep the simple keyword search queryinterface, although we exploit XML structure during thequery processing, so that the retrieved results can be anykind of document components. We have decided to takethis approach, because the users from the user study wecarried out knew neither the structure of the underlyingXML document collection, nor any of the complex query-ing languages. Perhaps for these reasons, although thereare many IRS able to deal with XML documents5, oftenthese systems can only process CO queries. Nevertheless,a general method used to convert some of these only CO-able systems into a fully structured IRS, which can processCAS queries, has recently been proposed by de Camposet al. [8]. This may be useful in our case because someof the proposed personalization strategies internally trans-form the original CO queries submitted by the users intoCAS queries.

    2.4 XML Personalization MethodsXML search personalization is not a very explored researcharea yet, and we have found very few studies dealing

    3. http://www.w3.org/standards/techs/xquery4. http://www.w3.org/standards/techs/xpath5. The series of INEX Workshop proceedings is an excellent source

    of information, see [20], [21].

  • DE CAMPOS ET AL.: USING PERSONALIZATION TO IMPROVE XML RETRIEVAL 1283

    with this topic. Amer-Yahia et al. [2] developed theirXML personalization system PIMENT. This is a systemwhich enables query personalization by query rewritingand answer ranking. It is composed of a profile repositorythat stores user profiles, a query customizer that rewritesuser queries based on user profiles and a ranking mod-ule to rank query answers. In PIMENT a user profile isa set of rules in the form (condition, action, conclusion).The condition and conclusion parts are XQuery Full Text6,and action can be to add, remove or replace. Whenever aquery matches a rule condition, it is rewritten accordingly.However, the generation of the rules in the user profilerequires the users active participation. Chernishev [15]takes PIMENT architecture as the base, adding a feedbackmodule which tries to extract, from query history, the usersawareness of the documents structure. The query historycontains user queries, query results, and user responses(e.g. the set of chosen items or the user time to examinea particular item). The user knowledge about the structureof the documents is stored in the user repository, whichwill be used in the query rewriting process. As queryrewriting it uses a mechanism based on a modified andwell-known technique of query rewriting called relaxations.Amer-Yahia et al. [3] extended their previous work to a newframework called PIMENTO. With this approach, the userprofile is a set of scoping and ordering rules (SRs and ORs,respectively). SRs allow for narrowing or broadening thescope of the query, while ORs are used to enforce rank-ing preferences by reranking the results of the previouslySRs modified queries. SRs may be conflicting due to theirorder of application and ORs may be ambiguous, althoughthe authors describe an algorithm to detect and resolveconflicting SRs and ambiguous ORs. They also define anOR-aware top-k pruning algorithm to guarantee an efficientquery personalization process. Our approach for XML per-sonalization is fairly different from these studies, as we usea keyword-based user profile to expand the query (whichis a much more simple process than using a set of rules),together with reranking methods and modification of theretrieval model.

    As we have already discussed, relevance and blind feed-back techniques, although different, are related in severalways to personalization. Therefore, it is also interesting tobriefly review existing work on relevance and blind feed-back over XML documents. Within this area, Mass andMandelbrod [32] propose a component ranking algorithmfor XML retrieval and show how to apply known rele-vance feedback algorithms from traditional IR on top of it,to achieve relevance feedback for XML. Pan [37] proposesquery expansion based on ontological similarities. A queryis firstly expanded with the use of a global ontology. Then,after the first round of feedback from the user, a specificontology is built from some parts of the global ontologyand the query itself. This new ontology is then used foreach round of query expansion and modified according tothe user feedback. De Campos et al. [7], [9] propose proba-bilistic methods for reweighting and expanding both COand CAS queries (adding terms extracted from relevant

    6. http://www.w3.org/TR/xpath-full-text-10/

    components instead of terms extracted from complete doc-uments). Hsu et al. [24] devise a context-aware approachfor searching XML to improve the effectiveness of keywordsearch on XML via query expansion. They find a set ofXML path expressions that capture the contextual meaningof a keyword query based on pseudo-feedback. Paths inthe contexts of the query are used to expand the originalquery.

    Schenkel and Theobald [40], [41] present a formal frame-work to integrate different dimensions of feedback, beyondcontent based feedback, into XML retrieval. Concretely,they present methods that expand a CO query into a CASquery based on relevance feedback, by taking into accountthe structured dimension of XML. Further advances in thisdirection have been more recently proposed by Hlaouaet al. [23]. One of our proposals for personalization is alsobased on transforming the original CO query into a CASquery that incorporates the profile terms.

    3 PERSONALIZATION STRATEGIESIn this section we are going to describe the differentapproaches considered to perform personalization on XMLdocuments. More specifically, we have designed severalpersonalization strategies based on: query expansion (addi-tion of terms coming from the user profile to the originalquery); reranking (combination of the output of two queries the original and the expanded queries); conversion of COqueries to CAS queries, making the most of the structure ofthe documents; and finally, modifying the retrieval modelin order to natively differentiate original query terms fromprofile terms. These strategies are applied in the three typi-cal scenarios where personalization is implemented: beforeand after search time, and changing the search engine.

    These approaches will be experimentally compared inSection 4. One of the principles guiding our research is thatwe want most of the work to be carried out by the searchengine of the IRS. We do not want to use expensive addi-tional processes or calculations in order to integrate the userprofile information (as in many of the reranking strategiesmentioned in Section 2).

    We shall assume that we have an XML IRS that, givena query, returns a list of results ordered by decreasing val-ues of the Relevance Status Value (RSV) or retrieval scoreassigned by the IRS. Each result is an SU of an XML docu-ment in the collection. The list of results contains, at most, afixed number of SUs (1500 in the experiments in Section 4)and follows the Focused INEX Task specification [27], i.e.overlapping has been eliminated.

    3.1 Normalized Query ExpansionThe first approach we are going to use is simply queryexpansion: concretely, we add to the original query thefirst k terms in the profile. The profile terms are rankedin descending order of importance, so that we select the kterms which are of greater importance. The number k ofadded terms is a parameter that should be adjusted. Thisis a very easy and efficient technique and only requiresto perform a longer query. But its main drawback is theaforementioned query-drift problem. The expanded (origi-nal+profile) query could retrieve results closer to the user

  • 1284 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO. 5, MAY 2014

    profile itself than to the original query (which representshis/her current information need). Moreover, as we aredealing with XML documents, the added profile termscould also provoke an increase in the size of the retrievedSUs, as a bigger SU probably is necessary to accommodatethe increased number of query terms. Both problems willbecome more pronounced as more profile terms are added.On the other hand, adding too few terms may cause a poorrepresentation of the true preferences of the user, so thatsome kind of trade-off becomes necessary.

    To alleviate these problems, we propose the use of aglobal normalization factor applied to the weights of theprofile terms, making their influence over the expandedquery weaker. It is a kind of upper bound for the weightsof the profile terms, in order to differentiate their impor-tance with respect to the original query terms. More pre-cisely, let t1, . . . , tm be the original terms in the query andtm+1, . . . , tm+k be the first k terms in the profile, whoseweights, within the profile, are wm+1, . . . , wm+k. Let 0 /JPEG2000ColorACSImageDict > /JPEG2000ColorImageDict > /AntiAliasGrayImages false /CropGrayImages true /GrayImageMinResolution 200 /GrayImageMinResolutionPolicy /OK /DownsampleGrayImages true /GrayImageDownsampleType /Bicubic /GrayImageResolution 300 /GrayImageDepth -1 /GrayImageMinDownsampleDepth 2 /GrayImageDownsampleThreshold 1.50000 /EncodeGrayImages true /GrayImageFilter /DCTEncode /AutoFilterGrayImages false /GrayImageAutoFilterStrategy /JPEG /GrayACSImageDict > /GrayImageDict > /JPEG2000GrayACSImageDict > /JPEG2000GrayImageDict > /AntiAliasMonoImages false /CropMonoImages true /MonoImageMinResolution 400 /MonoImageMinResolutionPolicy /OK /DownsampleMonoImages true /MonoImageDownsampleType /Bicubic /MonoImageResolution 600 /MonoImageDepth -1 /MonoImageDownsampleThreshold 1.50000 /EncodeMonoImages true /MonoImageFilter /CCITTFaxEncode /MonoImageDict > /AllowPSXObjects false /CheckCompliance [ /None ] /PDFX1aCheck false /PDFX3Check false /PDFXCompliantPDFOnly false /PDFXNoTrimBoxError true /PDFXTrimBoxToMediaBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXSetBleedBoxToMediaBox true /PDFXBleedBoxToTrimBoxOffset [ 0.00000 0.00000 0.00000 0.00000 ] /PDFXOutputIntentProfile (None) /PDFXOutputConditionIdentifier () /PDFXOutputCondition () /PDFXRegistryName () /PDFXTrapped /False

    /CreateJDFFile false /Description >>> setdistillerparams> setpagedevice