Automatic Query Expansion in Information Retrieval
description
Transcript of Automatic Query Expansion in Information Retrieval
BY RYAN HERBECK
Automatic Query Expansion in Information Retrieval
What is Automatic Query Expansion (AQE)?
“A process which consists of selecting and adding terms to the user's query with the goal of minimizing query-document mismatch and thereby improving retrieval performance.”
Takes a user’s original query and selects and adds related words to it
Used to increase effectiveness of relevant document retrieval in information retrieval systems
Current Information Retrieval (IR) Systems
Standard interface (one textbox, accepts keywords)
Keywords matched against keyword collection
Results are sorted and returnedUsing multiple topic-specific keywords
returns quality resultsIssues:
User queries are usually short Natural language is ambiguous Prone to errors and omissions as a result
Vocabulary Problem
System indexers and users often use different words “Saltines” and “crackers”
Synonymy: same word, different meanings “Java,” “Ruby”
Polysemy: different words, similar meaning “TV” and “television,” “CD” and “compact disk”
Synonymy + word inflections => decrease in recall Recall: ability to retrieve all relevant documents
Polysemy => decrease in precision Precision: ability to retrieve only relevant documents
Proposed Solutions
Interactive query refinementRelevance feedbackWord sense disambiguationSearch results clusteringAQE
Early AQE
Suggested as early as 1960Investigated a variety of techniques
Vector feedback Term-term clustering Comparative analysis of term distributions
Experimented on small scale collectionsYielded inconclusive results about
effectivenessGain in recall was often compensated by loss
in precision
Queries Today
Volume of data has increased significantlyNumber of terms in a user’s query has
remained low 2009: average query length was 2.30 words; same as
in 1999Most common queries are 1-3 words in lengthVocabulary problem is worse
Scarcity of query terms reduces synonymy handling Diversity and size of data increases effects of
polysemyThe need for and scope of AQE have
increased
Applications of AQE
Question Answering Goal: Provide direct responses as opposed to whole
documents Expand question with related terms expected to be
found in documents with answersMultimedia Information Retrieval
IR systems search over metadata (annotations, captions, etc.)
When no metadata exists, IR systems use content analysis which can be combined with AQE techniques Automatic speech recognition, visual features
Applications of AQE
Information Filtering Monitor a stream of documents and select relevant
ones Documents arrive continuously (e-news, blogs, e-mail,
etc.)Cross-Language Information Retrieval
Retrieve documents in a language differing from the query
Issues: Insufficient language coverage Untranslatable terms Translation ambiguity
Related Techniques
Interactive Query RefinementRelevance FeedbackWord Sense DisambiguationSearch Results Clustering
Interactive Query Refinement (IQE)
Example: Google SuggestSystem suggests several formulations of the
queryDecision of query formulation made by userDoes not handle feature selection and query
reformulation issuesPotential for producing better results than
AQE, but requires user expertise
Relevance Feedback
Returns initial query resultsReceives user feedback about the relevancy
of resultsPerforms a new query based on user
feedbackMakes the new query more similar to the
relevant documents retrieved, whereas AQE forms a query more similar to the user’s intentions
Data sources of relevance feedback may have more reliability than that of AQE
Word Sense Disambiguation (WSD)
Identifies word meanings in contextApproaches
Represent words by their text definitions Use of WordNet
English lexical database which groups words into synonym subsets (synsets), gives general definitions and records semantic relations between synsets
Find all of a word’s contexts and cluster similar onesComputational and effectiveness limitationsTypical queries may be too short for WSD
Example: “CD”
Search Results Clustering (SRC)
Organizes and groups search results by topicAttempts to optimize clustering structure and
label qualityLabels could be seen as query refinements,
but intended to help the user browse through results
Example: http://clusty.com
How AQE Works
Data PreprocessingFeature Generation and RankingFeature SelectionQuery Reformulation
Data Preprocessing
Reformat data source for more effective subsequent processing
Index the collection of documents and run the query against the collection index1. Extract text from documents2. Extract words without punctuation and ignoring case3. Remove articles and prepositions4. Reduce word inflections and derivations5. Assign a weighted importance value to each word
Data Preprocessing
Example: HTML:
‘<b>Automatic query expansion</b> expands queries automatically.’
Indexed representation (weight determined by frequency): automat 0.33, queri 0.33, expan 0.16, expand 0.16
Each document is represented as a collection of weighted terms
Feature Generation and Ranking
Input: original query, transformed data source
Output: set of candidate expansion features (terms that could be added to the original query)
Original query may be preprocessed to have common words removed and/or important words extracted
Techniques: One-to-One Associations One-to-Many Associations Analysis of Top-Ranked Documents Query Language Modeling
Feature Generation and Ranking
One-to-One Associations Between expansion features and query terms
One feature is related to one query term One or more features are generated and ranked for
each term Approaches
Stemming algorithm: reduces words to root form WordNet: synonym sets (synsets), records semantic
relations Prevents ambiguity (select one synset for one query
term) Compute term-to-term similarities in a document
collection Mine user query logs
Feature Generation and Ranking
One-to-Many Associations One feature is related to one or more query terms Approaches
Extend one-to-one association techniques to other query terms Generate a term if it is related to more than one term Filters weakly related features
Combine multiple relationships between term pairs Construct term network for the query Network contains word pairs linked by relations
(synonyms, stems, etc.)
Feature Generation and Ranking
Analysis of Top-Ranked Documents Retrieve top results for original query Generate expansion features from related terms in
these documents Features are related to query as a whole, as opposed
to individual query terms Approach: Pseudo-Relevance Feedback
Score each term in top documents by a applying a weighting function to the whole collection of documents
Sum up all weights of each term and sort the terms based on sums
Issue: weights reflect importance over collection more than importance over query
Feature Generation and Ranking
Query Language Modeling Generate probability distribution over query terms
Best features have the highest probabilities Approaches:
Mixture Model Builds a model from top-ranked documents collection Extracts the most distinct part from the document
collection Use an expectation-maximization algorithm to get
probabilities Relevance Model
Builds a model from top-ranked documents individually Documents further down the list have less and less
influence on word probabilities
Feature Selection
Select top features for query expansionFeatures are not evaluated further, simply
selected based on rankLimited number of features selected for rapid
processingUsing all features is not necessarily better
than using only a fewTypically select 10-30 featuresCould select features only within a certain
rank range
Query Reformulation
Modify the original query by adding selected features to it and perform search
Approaches: Query reweighting: assign a weight to each feature
using a weighting formula Simply add selected features to the original query
without weighting
Classification of AQE Techniques
Linguistic AnalysisCorpus-Specific Global TechniquesQuery-Specific Local TechniquesSearch Log AnalysisWeb Data
Linguistic Analysis
Focus on morphological, lexical, syntactic and semantic relationships for expansion
Analysis based on dictionaries, thesauri, or sources such as WordNet
Sensitive to word sense ambiguityExamples:
Stemming algorithm: reduce terms to root form Ontology browsing: paraphrase user’s query in
context Syntactic analysis: extract relations between terms to
find features that appear in related relations
Corpus-Specific Global Techniques
Corpus: large structured set of textsAnalyze contents of a full database to find
features used similarly Find correlations between term pairs at document
level or within paragraphs or sentencesData-driven
May not have a simple interpretation
Query-Specific Local Techniques
Utilize local context provided by the queryMake use of top-ranked documentsExamples:
Analysis of feature distribution difference Model-based AQE Top-document preprocessing
Removes irrelevant features before using term-ranking function
Search Log Analysis
Mines users’ search logs for implicit query associations Search logs contain queries and URLs of clicked pages Example: user searches “apple,” find a past query
“iPhone”May encode implicit relevance feedback instead
of retrieval feedbackExamples:
Extract features from past related queries that are related to the current query
Use top documents from past related queries Extract terms directly from visited documents
Web Data
Use of anchor texts to generate features Anchor text: visible, clickable text of a hyperlink
Most anchor texts are similar to real user queries Anchor texts typically describe contents of the
documentIssues:
“click here” One-word/short anchor texts
Use of Wikipedia documents and hyperlinks
Critical Issues
Parameter SettingEfficiencyUsability
Parameter Setting
Rely on several parameters Number of pseudo-relevant documents Number of expansion terms Variables within term-ranking and weighting functions
Could use fixed values for key parameters Fixed values may not work well for all queries
Efficiency
Need to deliver real-time results to a large volume of users
Balancing performance time with quality results
Good AQE is computationally expensive Execution time of expanded query
Usability
Implementation hidden to usersUsers may receive high-ranked documents
that contain none of the query terms Query terms substituted entirely by synonyms Irrelevant document contains query terms in anchor
textCould increase user control
Show user the features used Allow user to revise expanded query
AQE is better for non-expert users
Conclusions
No perfect solution for the vocabulary problem
Overcomes user reluctance and difficulty to provide better refined queries to meet their needs
Variety of implementationsEfficiency is gradually increasingNear the end of its experimental stageNot yet ready to be implemented on large-
scale IR systems such as web search engines
Questions?
References
Carpineto, C. and Romano, G. 2012. A survey of automatic query expansion in information retrieval. ACM Comput. Surv. 44, 1, Article 1 (January 2012), 50 pages. Furnas, G. W., Landauer, T. K., Gomez, L. M., and Dumais, S. T. 1987. The vocabulary problem in human-system communication. Comm. ACM 30, 11, 964-971. Mitra, M., Singhal, A., and Buckley, C. 1998. Improving automatic query expansion. In Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM Press, 206-214. Vechtomova, O. 2009. Query expansion for information retrieval. In Encyclopedia of Database Systems, L. Liu and M. T. Özsu Eds., Springer, 2254-2257.