Claudio Biancalana, Alessandro Micarelli, Francesco ...€¦ · Social Tagging in Query Expansion:...
Transcript of Claudio Biancalana, Alessandro Micarelli, Francesco ...€¦ · Social Tagging in Query Expansion:...
Social Tagging in Query Expansion: Social Tagging in Query Expansion: a new Way for Personalized Web Searcha new Way for Personalized Web Search
Claudio Biancalana, Alessandro Micarelli, Francesco Saverio Profiti
Department of Computer Science an Automation
Artificial Intelligence Laboratory
Rome Tre University
and
LAzio Innovazione Tecnologica S.p.A.
MotivationsMotivations
� Information overload
� Human Factors
� Semantics
� Vocabulary problem
� Ineffectiveness of Short Queries
QueryQuery ExpansionExpansion (1/2)(1/2)
� The process of expanding a user querywith additional related words and phrases
� In the context of web search engines,query expansion involves evaluating a userquery expansion involves evaluating a userinput typed into the search query area andexpanding the search query to matchadditional documents
Personalized Web Search ArchitecturePersonalized Web Search Architecture
Search Engine
User
Visited pages
Query
ImplicitFeedback
User Model
a,b a,b,c,d,e
Personalization
QueryQuery ExpansionExpansion (2/(2/22))
Original Query: Q = {q1, q2, …, qk, qk+1, …, qn}
Terms to add: Q+ = {e1, e2, …, em}
Terms to remove: Q- = {qk+1, …, qn}
Expanded Query
EQ = (Q U Q+) - Q-
{q1, q2, …, qk, e1, e2, …, em}
UserUser modelingmodeling and and personalizationpersonalization in in webweb
� Personalization is tailoring a consumerproduct, electronic or written medium to auser, based on personal details oruser, based on personal details orcharacteristics that user provides
� In web search engines, personalization istailoring search results based on theinterests of a user
PrePre--processingprocessing
� HTML Tag Elimination
� Semantic Analysis: Monty POS tagger
◦ adjective, noun, proper noun, preposition◦ adjective, noun, proper noun, preposition
� Stop Word Elimination
� Stemming
UserUser ModelModel and and CoCo--occourenceoccourence
� Co-occurrence is the extent of whichtwo terms tend to appear simultaneouslyin the same context
� User Model: Co-occurrence terms matrix
t1
t2
t3
t4
t5
t1 t2 t3 t4 t5
0.0
0.0
0.0
0.0
0.0
2.0
0.0
0.0
0.0
0.0
1.0
9.0
1.0
4.0
1.0
2.0
2.0
1.0
0.0
3.0
3.0
4.00.0
2.0
9.0
how do we build matrix values?
PersonalizationPersonalization and and QueryQuery ExpansionExpansion
� Method I
◦ Bigrams
� Method II
◦ Hyperspace Analogue to Language◦ Hyperspace Analogue to Language
� Method III
◦ Page Level co-occurence
� Method IV
◦ Page Level co-occurence and term proximity
MethodMethod I I -- BigramsBigrams
� The user model is built around the concept of bigrams, namely a pair consisting of two adjacent terms in the text of a web page. Two terms are text of a web page. Two terms are considered co-occurring only if adjacent.
� The context of a term is thus exclusively limited to the term that is directly next to it, either to the left or to the right;
MethodMethod II II -- HALHAL
� Given a window of N terms, that can be scrolled inside a page text, two terms are considered co-occurring only if they are within such window. within such window.
� The co-occurrence value will be inversely proportional to the distance between the two terms within the window;Example N=5
t1 t2 t3 t4 t5 t6
1/11/21/31/41/5
MethodMethod III III –– PagePage LevelLevel coco--ococ (1/2)(1/2)
�Within this method, the context of a term is expanded to the entire page considered.
� Two terms are then deemed co-occurring � Two terms are then deemed co-occurring only if they are both present, simultaneously, in the same page;
MethodMethod III III –– PagePage LevelLevel coco--ococ (2/(2/22))
� For each document, a co-occurrence matrix is generated and then summed up in a single matrix
� POS tagger extracts the nouns, proper nouns and adjectives
� Only the first k keyword are used, following an order � Only the first k keyword are used, following an order based on tf*idf
Co-occurence matrix Weighted co-occurrence matrix
MethodMethod IV IV –– PagePage LevelLevel coco--ococ and and termtermproximityproximity
Co-occurence at page level
(method III)
++
Term Proximity
t1 t2 t3 t4 t5 t6
1/11/21/31/41/5
PersonalizationPersonalization and and QueryQuery ExpansionExpansion
� In this research we implement method III
� Method III has better performancecompared to others as we present in:compared to others as we present in:
[1] C. Biancalana, A. Micarelli, A. Lapolla, "Personalized Web Search using
Correlation Matrix for Query Expansion" in Joaquim Filipe, José Cordeiro,
Vitor Pedrosa (Eds.): "Web Information Systems and Technologies", LNBIP,
2009
Query ExpansionQuery Expansion
� In Query Expansion process, we select the rows representing original query terms (ex. t2,t3).
t1 t2 t3 t4 t5
Q = t2,t3t1
t2
t3
t4
t5
t1 t2 t3 t4 t5
0.0
0.0
0.0
0.0
0.0
2.0
0.0
0.0
0.0
0.0
1.0
9.0
1.0
4.0
1.0
2.0
2.0
1.0
0.0
3.0
3.0
4.00.0
2.0
9.0
Query ExpansionQuery Expansion
� Sum up selected rows
� Select the first N terms (high values) of new vector.
N=1, Q = t2,t3
1.0 3.0 3.0 11 0.0
t4
t1
t2
t3
t4
t5
t1 t2 t3 t4 t5
0.0
0.0
0.0
0.0
0.0
2.0
0.0
0.0
0.0
0.0
1.0
9.0
1.0
4.0
1.0
2.0
2.0
1.0
0.0
3.0
3.0
4.00.0
2.0
9.0
CoCo--occurrence matrices occurrence matrices limitslimits
� Semantic aspects:◦ In particular polisemy and homonimy
� For example:
◦ Possible results:
http://www.amazon.com/
http://en.wikipedia.org/wiki/Amazon_River
◦ User query : “amazon”
◦ Expanded query : “amazon buy river”
OurOur solutionsolution forfor CoCo--occurrence occurrence matrices matrices limitslimits
� Extention of Co-occourence matrix:
◦ Introduction of metadata as third dimension of the matrix
� Use of Social Bookmarking services for � Use of Social Bookmarking services for metadata retrival:◦ del.icio.us
◦ stumbleupon.com
◦ …
ThreeThree--dimensionaldimensional coco--occurenceoccurence matrixmatrixstructurestructure
ExampleExample
� User query: amazon
Category Expanded
Query
Results
e-commerce
amazon AND buy
AND
(book OR books)
http://www.amazon.com/
http://www.amazon.co.uk/
…
nature
amazon
AND
(river OR rivers)
http://en.wikipedia.org/wiki/Amazon_River
http://www.mbarron.net/Amazon/
…
ExperimentationsExperimentations
� The employed benchmark: Lazio Region Portal Data (LRDP)
� An example of topics is Top/Sala Stampa/Presidente/Biografia; Top/Sala Stampa/Presidente/Biografia;
◦ Level I: Sala Stampa;
◦ Level II: Presidente;
◦ Level III: Biografia
� Given the large quantity of links contained in LRPD, we decided to consider only level III links
ExperimentationsExperimentations
� Lazio is a region situated in thecentral of Italy, whose largest city isRome
� Lazio Region is also the name ofpublic administration that governscitizens of this regioncitizens of this region
� Like most italian publicadministrations, Lazio has a webportal through which provides e-government services to its citizens
ExperimentationsExperimentationsTop Level I
Level IILevel III
ExperimentationsExperimentations
� Each topic’s links were then subdivided in a training set, corresponding to 25% of the links, and set of tests, corrisponding to 75% of the links
ExperimentationsExperimentations
�We use for experimentations:
◦ Page Level Co-oc metric for the Co-occourencematrix costruction
◦ del.icio.us Social Bookmarking service
◦ F1-measure as performance indicator:◦ F1-measure as performance indicator:
where stands for the number of returned linksbelonging to topic t, only the first 50 pages are taken inconsideration for our tests, and the overall numberof test links belonging to topic t present in the index.
ExperimentationsExperimentations
�We have compared our system with:
◦ a system based on a traditional content-based user-modeling approach, where documents user-modeling approach, where documents are represented in the Vector Space Model and without Query Expansion (no QE)
◦ system focuses on the update of the user model by means of Relevance Feedback (RF) techniques (no Social Bookmarking)
ExperimentationExperimentation
� The following table shows the resultsobtained: