From federated to aggregated search
-
Upload
mounia-lalmas -
Category
Technology
-
view
6.948 -
download
6
description
Transcript of From federated to aggregated search
From federated to aggregated search
Fernando Diaz, Mounia Lalmas and Milad Shokouhi
[email protected]@acm.org
OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography
OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography
Introduction
What is federated search?What is aggregated search?
MotivationsChallengesRelationships
A classical example of federated search
www.theeuropeanlibrary.org
Collectionsto be searched
One query
A classical example of federated search
www.theeuropeanlibrary.orgMerged listof results
Motivation for federated search
Search a number of independent collections, with a focus on hidden web collectionsCollections not easily crawlable (and often
should not)
Access to up-to-date information and dataParallel search over several collectionsEffective tool for enterprise and digital
library environments
Challenges for federated search
How to represent collections, so that to know what documents each contain?
How to select the collection(s) to be searched for relevant documents?
How to merge results retrieved from several collections, to return one list of results to the users?
Cooperative environmentUncooperative environment
From federated search to aggregated search“Federated search on the web”
Peer-to-peer network connects distributed peers (usually for file sharing), where each peer can be both server and client
Metasearch engine combines the results of different search engines into a single result list
Vertical search – also known as aggregated search – add the top-ranked results from relevant verticals (e.g. images, videos, maps) to typical web search results
A classical example of aggregated search
News
Homepage
Wikipedia
Real-time results
Video
StructuredData
Motivation for aggregated search
Increasingly different types of information being available, sough and relevante.g. news, image, wiki, video, audio, blog, map, tweet
Search engine allows accessing these through so-called verticals
Two “ways” to searchUsers can directly search the verticalsOr rely on so called aggregated search
Google universal search 2007: [ … ] search across all its content sources, compare and rank all the information in real time, and deliver a single, integrated set of search results [ … ] will incorporate information from a variety of previously separate sources – including videos, images, news, maps, books, and websites – into a single set of results. http://www.google.com/intl/en/press/pressrel/universalsearch_20070516.html
Motivation for aggregated search
(Arguello et al, 09)
25K editorially classified queries
Motivation for aggregated search
Motivation for aggregated search
Challenges in aggregated search
Extremely heterogeneous collectionsWhat is/are the vertical intent(s)? And
Handling ambiguous (query | vertical) intent Handling non-stationary intent (e.g. news, local)
How many results from each to return and where to position them in the result page? Slotting results Users looking at 1st result page
Page optimization and its evaluation
Ambiguous non-stationary intent
Query - Travel - Molusk - Paul
Vertical - Wikipedia - News - Image
Recap – Introduction
federated search
aggregated search
heterogeneity low high
scale (documents,
users)small large
user feedback little a lot
Terminology
1. federated search, distributed information retrieval, data fusion, aggregated search, universal search, peer-to-peer network
2. resource, vertical, database, collection, source, server, domain, genre
3. merging, blending, fusion, aggregation, slotted, tiled
Problem definition
Present the “querier” with a summary of search results from one or more resources.
User
Search Interface/Portal/Broker
Source/Server/Vertical
Query Query Query Query Query
Source/Server/Vertical
Source/Server/Vertical
Source/Server/Vertical
Source/Server/Vertical
Raw Query
General architecture
Peer-to-peer network
Peer Directory Server
Peer to Peer (P2P) networksBroker-based
Single centralized broker with documents lists shared from peer (e.g. Napster, original version)
DecentralizedEach peer acts as both client and server (e.g.
Gnutella v0.4)
Structure-basedUse distributed hash tables (DHT) (e.g. Chord (Stocia
et al, 03) )
HierarchicalUse local directory services for routing and merging
(e.g. Swapper.NET)
Query
Broker
Collection A
Query Query Query Query Query
Collection B
Collection C
Collection D
Collection E
SumA
SumB
SumC
SumD
SumE
Merged results
Federated search
Federated search
Also known as distributed information retrieval (DIR) system
Provides one portal for searching information from multiple sourcescorporate intranets, fee-based databases,
library catalogues, internet resources, user-specific digital storage
Funnelback, Westlaw, FedStats, Cheshire, etc (see also http://federatedsearchblog.com/)
http://funnelback.com/pdfs/brochures/enterprise.pdf
User
Metasearch engine
Query Query Query Query
Raw Query
WWW
Metasearch
Metasearch
Search engine querying several different search engines and combines results from them (blended), or displays results separately (non-blended)
Does not crawl the web but rely on data gathered by other search engines
Dogpile,Metacrawler, Search.com, etc (see http://www.cryer.co.uk/resources/searchengines/meta.htm)
User
QueryQuery Query
Query
Angelina Jolie Results
WWWIndex (text)
Aggregatedsearch
Aggregated search
Specific to a web search engine“Increasingly” more than one type of information
relevant to an information needmostly web page + image, map, blog, etc
These types of information are indexed and ranked using dedicated approaches (verticals)
Presenting the results from verticals in an aggregated way believed to be more useful
All major search engines are doing some levels of aggregated search
Query
GOV2
BM25 KL Inquery Anchor only Title only
One document collection
Different document representations
Different retrieval models
Merging
One ranked list of result (merged)
Data fusion
(e.g. Voorhees etal, 95)
Data fusionSearch one collectionDocument can be indexed in different ways
Title index, abstract index, etc (poly-representation)Weighting scheme
Different retrieval modelsRankings generated by different retrieval models
(or different document representations) merged to produce the final rank
Has often been shown to improve retrieval performance (TREC)
Terminology - Resource
SourceServerDatabaseCollection (federated search)
ServerVertical (aggregated search)
DomainGenre
Terminology - Aggregation
MergingBlendingFusion
SlottedTiled
Aggregated search (tiled)
http://au.alpha.yahoo.com/
Aggregated search (tiled)
Naver.com
Aggregated search (slotted)
Others
ClusteringFaceted searchMulti-document summarization
Document generationEntity search
(see special issue – in press – on “Current research in focused retrieval and result aggregation”, Journal of Information Retrieval (Trotman etal, 10))
Yippy – Clustering search engine from Vivisimo
clusty.com
Faceted search
Multi-document summarization
http://newsblaster.cs.columbia.edu/
“Fictitious” document generation
(Paris et al, 10)
Entity search
http://sandbox.yahoo.com/Correlator
Recap
Shown the relations between federated, aggregated search, and others
Exposed the various terminologies used
In the rest of the tutorial, we concentrate on federated search and aggregated search
Focus is on “effective search”
OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography
Architecture: what are the general components of federated and aggregated search systems.
Federated search architecture
Aggregated search architecture
Pre-retrieval aggregation: decide verticals before seeing results
Post-retrieval aggregation: decide verticals after seeing results
Pre-web aggregation: decide verticals before seeing web results
Post-web aggregation: decide verticals after seeing web results
Post-retrieval, pre-web
Pre and post-retrieval, pre-web
OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography
Resource representation: how to represent resources, so that we know what documents each contain.
Resource representation in federated search(Also known as resource summary/description)
Resource representation
Cooperative environmentsComprehensive term statisticsCollection size information
Uncooperative environmentsQuery-based samplingCollection size estimation
Resource representation(cooperative environments)
STARTS Protocol (Gravano et al, 97)
Source metadata Rich query language
Different types of term statistics (Callan et al, 95; Gravano et al, 94a,b,99; Meng et al, 01; Yuwono and Lee, 97; Xu and Callan, 98; Zobel, 97)
Anchor-textHARP (Hawking and Thomas, 05)
Resource representation(cooperative environments)
Resource representation(uncooperative environments)
Query-based sampling (Callan and Connell, 01)
Select a query, probe collectionDownload the top n documentsSelect the next query, repeat
Query selector
Query
Sampled documents
Query selector (Callan and Connell, 01)
Other resource description (ord)Learned resource description (lrd)
• Average tf, random, df, ctf
Query logs(Craswell, 00; Shokouhi et al, 07d)
Focused probing(Ipeirotis and Gravano, 02)
Resource representation(uncooperative environments)
Adaptive sampling(Shokouhi et al, 06a)
Rate of visiting new vocabulary(Baillie et al, 06a)
Rate of sample quality improvement (reference query log)
(Caverlee et al, 06)Proportional document ratio (PD)Proportional vocabulary ratio (PV)Vocabulary growth (VG)
Resource representation(uncooperative environments)
Improving incomplete samplesShrinkage (Ipeirotis, 04; Ipeirotis and Gravano, 04):
topically related collections should share similar terms
Q-pilot (Sugiura and Etzioni, 00):
sampled documents + backlinks + front page
Resource representation(uncooperative environments)
Capture-recapture (Liu et al, 01)
Resource representation(Collection size estimation)
Sample A(Capture)
Sample B (recapture)
http://www.dorlingkindersley-uk.co.uk/static/cs/uk/11/clipart/nature/image_nature040.html
Resource representation(Collection size estimation)
Multiple queries sampler (Thomas and Hawking, 07)
Random-walk sampler, and pool-based sampler (Bar-Yossef and Gurevich, 06)
Collection overlap estimation (Shokouhi and Zobel, 07)
Resource representation(Collection size estimation)
Resource representation(Updating summaries)
(Ipeirotis et al, 05) (Shokouhi et al, 07a)
Resource representation in aggregated searchVertical content
samples or access to vertical APIrepresents content supply
Vertical query logssamples or access to historic vertical searchesrepresents content demand
Vertical content includes text
NEWS
Vertical content includes structure
SPORTS
Vertical content includes images
IMAGES
Issues with vertical content
Dynamicssome vertical becomes stale fast
Heterogeneous contentheterogeneous ranking algorithms
Non-free text APIsaffects query-based sampling
Addressing content dynamics
sample most recently indexed documents(Diaz 09)
assumes users more likely to be interested in recent content
in practice, only need a fraction of the corpus to perform well
(Konig et al, 09)
Addressing heterogeneous content
1. use text available with documents (e.g. captions)
2. manually map to surrogates (e.g. wikipedia pages)
(Arguello et al, 09)
performance of two different methods of dealing with heterogeneous content
Vertical query logs
Queries issued directly to a vertical represent explicit vertical intent
Is similar to having a large body of labeled queries
Issues with vertical query logs
Dynamicssome verticals require temporally-sensitive
samplingfor example, we do not want to sample news
query logs for a whole year
Non-free text APIsaffects query modeling
Hybrid approaches
Should only sample documents likely to be useful for vertical selection/merginge.g. a document which is never requested is not
useful for representing a vertical
Suggests log-biased sampling (Shokouhi et al, 06; Arguello et al, 09)
Recap – Resource representation
federated search
aggregated search
Representation completeness
low low-high
Representation generation
sampling/shared dictionaries
sampling, API
Freshness important critical
OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography
Resource selection: how to select the resource(s) to be searched for relevant documents.
Resource selection for federated search
Query
Broker
Collection A
Query Query Query
Collection B
Collection C
Collection D
Collection E
SumA
SumB
SumC
SumD
SumE
“Big-document” bag of word summariesCORI (Callan et al, 95)
GlOSS (Gravano et al, 94b)
CVV (Yuwono and Lee, 97)
Resource selection(Lexicon-based methods)
Co
llect
ion
CC
olle
ctio
n A
Co
llect
ion
B
Sampling
Sampling
Sampling
Broker
Resource selection(Lexicon-based methods)
CORI
GlOSS
Sample documents with retained boundariesReDDE (Si and Callan, 03a)
CRCS (Shokouhi, 07a) SUSHI (Thomas and Shokouhi, 09)
Resource selection(Document-surrogate methods)
Co
llect
ion
CC
olle
ctio
n A
Co
llect
ion
B
Sampling
Sampling
Sampling
Broker
Resource selection(Document-surrogate methods)
ReDDE ReDDE assumes that the top-ranked
sampled documents are relevant.
ReDDE estimates the size of collections by sample-resample
Assuming that all collections have the same size we have: yellow > blue > red
CRCS is inspired by ReDDE but assigns different probability of relevance based on document position: red > yellow, blue
Broker
Query
Ranking
SUSHI
Resource selection(Document-surrogate methods)
http://www.monthly.se/nucleus/index.php?itemid=1464
SUSHI
Resource selection(Document-surrogate methods)
http://www.monthly.se/nucleus/index.php?itemid=1464
SUSHI
Resource selection(Document-surrogate methods)
Different regression functions for each collection and query
Scores are comparable (estimated over the same index)
http://www.monthly.se/nucleus/index.php?itemid=1464
Utility maximization techniquesModel the search effectivenessDTF (Nottelmann and Fuhr, 03), UUM (Si and
Callan, 04a), RUM (Si and Callan, 05b)
Classification-based methodsClassify collections/queries for better selectionClassification-aware server selection (Ipeirotis
and Gravano, 08), classification-based resource selection (Arguello et al, 09a), learning from past queries (Cetintas et al, 09)
Resource selection(Supervised methods)
Resource selection in aggregated SearchContent-based predictors
derived from (sampled) vertical content
Query string-based predictorsderived from query text, independent of any
resource associated with a vertical
Query log-based predictorsderived from previous requests issued by users
to the vertical portal
Content-based predictors
Distributed information retrieval (DIR) predictors
Simple result set predictorsnumresults, score distributions, etc
(Diaz 09; Konig etal, 09)
Complex result set predictorsClarity (Cronen-Townsend et al, 02)
Autocorrelation (Diaz, 07)
Many, many more (Hauff, 10)
Issues with content-based predictors
DIR (usually) assumes homogeneous content types
performance predictors (usually) assume text corpora
assumes ranking function consistencybetween verticalsbetween vertical selector machine and vertical ranker
machine
verticals have different dynamics (e.g. news vs. image)
String-based predictors
Dictionary lookupsterms correlated with a vertical (e.g., movie
titles)
Regular expressionspatterns correlated with explicit vertical
requests (e.g., obama news)
Named entitiesautomatically-detected entity types (e.g.,
geographic entities)
String-based predictors
Issuescurating lists and expressions (manual or
automatic)terms included in dictionary manually vetted for
relevancehigh precision/low recall
Log-based predictors
Classification approaches(Beitzel etal 07; Li etal, 08)
Language model approaches(Arguello etal, 09)
Issuesverticals with structured queries (e.g. local)query logs with dynamics (e.g. news)
(Diaz, 09)
Comparing predictor performance
(Arguello et al, 09)
Predictor cost
Pre-retrieval predictorscomputed without sending the query to the verticalno network cost
Post-retrieval predictorscomputed on the results from the verticalrequires vertical support of web scale query trafficincurs network latencycan be mitigated with vertical content caches
Combining predictors
Use predictors as features for a machine-learned model
Training data1. editorial data
2. behavioral data (e.g. clicks)
3. other vertical data
(Diaz, 09; Arguello etal, 09; Konig etal, 09)
Editorial data
Data: <query,vertical,{+,-}>Features: predictors based on
f(query,vertical)Models:
log-linear (Arguello etal, 09)
boosted decision trees (Arguello etal, 10)
Combining predictors
(Arguello etal, 09)
Click data
Data: <query,vertical,{click,skip}>, <query,vertical,click through rate>
Features: predictors based on f(query,vertical)
Models:log-linear (Diaz, 09)
boosted decision trees (Konig etal, 09)
Gathering click data
Exploration bucket: show suboptimal presentations in order to
gather positive (and negative) click/skip data
Cold start problem: without a basic model, the best exploration is
random
Random exploration results in poor user experience
Gathering click data
Solutionsreduce impact to small fraction of traffic/userstrain a basic high-precision non-click model
(perhaps with editorial data)
Other issuesPresentation bias: different verticals have
different click-through rates a priori Position bias: different presentation positions
have different click-through rates a priori
Click precision and recall
(Konig etal, 09)
ability to predict queries using thresholded click-through-rate to inferrelevance
Non-target data
have training data no data
Non-target data
Data: <query,source vertical,{+,-}>Features: predictors based on
f(query,target vertical)Models:
generic model+adaptation
(Arguello etal, 10)
Non-target data
(Arguello etal, 10)
Generic model
Objectivetrain a single model that performs well for all
source verticals
Assumptionif it performs well across all source verticals, it
will perform well on the target vertical
(Arguello etal, 10)
Non-target data
(Arguello etal, 10)
adapted model
Adapted model
Objectivelearn non-generic relationship between features
and the target vertical
Assumptioncan bootstrap from labels generated by the
generic model
(Arguello etal, 10)
Non-target query classification
(Arguello etal, 10)
average precision on target query classification; red (blue) indicates statistically significant improvements (degradations) compared to the single predictor
Training set characteristics
What is the cost of generating training datahow much money?how much time?how many negative impressions as a result of
exploration?
Are targets normalized?can we compare classifier output?
Training set cost summary
Online adaptation
Production vertical selection systems receive a variety of feedback signalsclicks, skipsreformulations
A machine-learned system can adjust predictions based on real time user feedbackvery important for dynamic verticals
(Diaz, 09; Diaz and Arguello, 09)
Online adaptation
Passive feedback: adjust prediction/parameters in response to feedbackallows recovery from false positivesdifficult to recover from false negatives
Active feedback/explore-exploit: opportunistically present suboptimal verticals for feedbackallows recovery from both errorsincurs exploration cost
(Diaz, 09; Diaz and Arguello, 09)
Online adaptation
Issuessetting learning rate for dynamic intent verticalsnormalizing feedback signal across verticalsresolving feedback and training signal
(click≠relevance)
(Diaz, 09; Diaz and Arguello, 09)
Recap – Resource selection
OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography
Resource presentation: how to return results retrieved from several resources to users.
Same source (web) different overlapped indexesDocument scores may not be availableTitle, snippet, position and timestamps
D-WISE (Yuwono and Lee, 96)
Inquirus (Glover et al., 99)
SavvySearch (Dreilinger and Howe, 1997)
Result merging(Metasearch engines)
Same corpusDifferent retrieval modelsDocument scores/positions available
Unsupervised techniquesCombSUM, CombMNZ (Fox and Shaw, 93, 94)
Borda fuse (Aslam and Montague, 01)
Supervised techniquesBayes-fuse, weighted Borda fuse (Aslam and Montague, 01)
Segment-based fusion (Lillis et al 06, 08; Shokouhi 07b)
Result merging(Data fusion)
Result merging in federated searchUser
Broker
Collection A
Query Query
Collection B
Collection C
Collection D
Collection E
SumA
SumB
SumC
SumD
SumE
Merged results
Query
CORI (Callan et al, 95)Normalized collection score + Normalized
document score.
Result merging
Result merging
SSL (Si and Callan, 2003b)
Broker
AA
GG
BB
CC
DD
EE
FF
HH
Query
Ranking
Selected resources
LL
RR
DD
FF
Result merging
http://upload.wikimedia.org/wikipedia/en/1/13/Linear_regression.png
Source-specific scoreB
roke
r sc
ore
Multi-lingual result mergingSSL with logistic regression
(Si and Callan, 05a; Si et al, 08)
Personalized metasearch (Thomas, 08)
Merging overlapped collectionsCOSCO (Hernandez and
Kambhampati 05):
exact duplicatesGHV (Bernstein et al, 06;
Shokouhi et al, 07b):
exact/near duplicates
Result merging - Miscellaneous scenarios
Images on top Images in the middle Images at the bottom
Images at top-right Images on the leftImages at the bottom-right
Slotted vs tiled result presentation
3 verticals3 positions3 degree of vertical intents (Sushmita et al, 10)
Designers of aggregated search interfaces should account for the aggregation styles
for both, vertical intent key for deciding on position and type of “vertical” results
slotted accurate estimation of the best position of “vertical” result
tiled accurate selection of the type of “vertical” result
Slotted vs tiled
Recap – Result presentation
federated search
aggregated search
Content typehomogenous
(text documents)heterogeneous
Document scoresdepends on environment
heterogeneous
Oracle centralized index none
OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography
Evaluation
Evaluation: how to measure the effectiveness of federated and aggregated search systems.
CTF ratio (Callan and Connell, 01)
Spearman rank correlation coefficient (SRCC), (Callan and Connell, 01)
Kullback-Leibler divergence (KL) (Baillie et al,06b;
Ipeirotis et al, 2005), topical KL (Baillie et al, 09)
Predictive likelihood (Baillie et al, 06a)
Resource representation (summaries) evaluation – Federated search
Resource selection evaluation – Federated search
Result merging evaluation – Federated searchOracle
Correct merging (centralized index ranking) (Hawking and Thistlewaite, 99)
Perfect merging (ordered by relevance labels) (Hawking and Thistlewaite, 99)
MetricsPrecisionCorrect matches (Chakravarthy and Haase, 95)
Vertical Selection Evaluation – Aggregated search
Majority of publications focus on single vertical selectionvertical accuracy,
precision, recallEvaluation data
editorial databehavioral data
single vertical selection
Editorial data
Guidelinesjudge relevance based on vertical results
(implicit judging of retrieval/content quality)judge relevance based on vertical description
(assumes idealized retrieval/content quality)
Evaluation metric derived from binary or graded relevance judgments
(Arguello etal, 09; Arguello et al, 10)
Behavioral data
Inference relevance from behavioral data (e.g. click data)
Evaluation metricregression error on predicted CTRinfer binary or graded relevance
(Diaz, 09; Konig etal, 09)
Test collections (a la TREC)
* There are on an average more than 100 events/shots contained in each video clip (document)
Statistics on Topics
number of topics 150
average rel docs per topic 110.3
average rel verticals per topic 1.75
ratio of “General Web” topics 29.3%
ratio of topics with two vertical intents
66.7%
ratio of topics with more than two vertical intents
4.0%
quantity/media text image video total
size (G) 2125 41.1 445.5 2611.6
number of documents 86,186,315 670,439 1,253* 86,858,007
(Zhou & Lalmas, 10)
ImageCLEFphoto retrieval
track
……TREC web track
INEXad-hoc track
TRECblog track
topict1
docd1
d2
d3
…dn
judgmentRNR…R
……BlogVertical
Reference(Encyclopedia)
Vertical
ImageVertical
General WebVertical
ShoppingVertical
topict1
docd1
d2
…dV1
judgmentRN…R
verticalV1
V2 d1
d2
…dV2
NN…R
……
Vk d1
d2
…dVk
NN…N
t1
existing test collections
(simulated) verticals
Test collections (a la TREC)
Recap – Evaluation
federated search
aggregated search
Editorial datadocument relevance judgments
query labels
Behavioral data none critical
OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography
Open problems in federated searchBeyond big document
Classification-based server selection (Arguello et al, 09a) Topic modeling
Query expansion Previous techniques had little success (Ogilvie and Callan, 01;
Shokouhi et al, 09)
Evaluating federated search Confounding factors
Federated search in other context Blog Search (Elsas et al, 08; Seo and Croft, 08)
Effective merging Supervised techniques
Open problems in aggregated search
Evaluation metricsslotted presentationtiled presentationmetrics based on behavioral signals
Models for multiple verticalsMinimizing the cost for new verticals,
markets
OutlineIntroduction and TerminologyArchitectureResource RepresentationResource SelectionResult PresentationEvaluationOpen ProblemsBibliography
Bibliography
J. Arguello, F. Diaz, J. Callan, and J.-F. Crespo, Sources of evidence for vertical selection. In SIGIR 2009 (2009).
J. Arguello, J. Callan, and F. Diaz. Classification-based resource selection. In Proceedings of the ACM CIKM, Pages 1277--1286, Hong Kong, China, 2009a.
J. Arguello, F. Diaz, J.-F. Paiement, Vertical Selection in the Presence of Unlabeled Verticals. In SIGIR 2010 (2010).
J. Aslam and Mark Montague. Models for metasearch, In Proceedings of ACM SIGIR, Pages, 276--284, New Orleans, LA, 2001.
M. Baillie, L. Azzopardi, and F. Crestani. Adaptive query-based sampling of distributed collections, In Proceedings of SPIRE, Pages 316--328, Glasgow, UK, 2006a.
M. Baillie, L. Azzopardi, and F. Crestani. Towards better measures: evaluation of estimated resource description quality for distributed IR. In X. Jia, editor, Proceedings of the First International Conference on Scalable Information systems, page 41, Hong Kong, 2006b.
M. Baillie, M. Carman, and F. Crestani. A topic-based measure of resource description quality for distributed information retrieval. In Proceedings of ECIR, pages 485--496, Toulouse, France, 2009.
Bibliography
Z. Bar-Yossef and M. Gurevich. Random sampling from a search engine's index. Proceedings of WWW, pages 367--376, Edinburgh, UK, 2006.
S. M. Beitzel, E. C. Jensen, D. D. Lewis, A. Chowdhury, O. and Frieder, Automatic classification of web queries using very large unlabeled query logs. ACM Trans. Inf. Syst. 25, 2 (2007), 9.
Y. Bernstein, M. Shokouhi, and J. Zobel. Compact features for detection of near-duplicates in distributed retrieval. Proceedings of SPIRE, Pages 110--121, Glasgow, UK, 2006.
J. Callan and M. Connell. Query-based sampling of text databases. ACM Transactions on Information Systems, 19(2):97--130, 2001.
J. Callan, Z. Lu, and B. Croft. Searching distributed collections with inference networks. In Proceedings of ACM SIGIR, pages 21--28. Seattle, WA, 1995
J. Caverlee, L. Liu, and J. Bae. Distributed query sampling: a quality-conscious approach. In Proceedings of ACM SIGIR, pages 340--347. Seattle, WA, 2006.
S. Cetintas, L. Si, and H. Yuan, Learning from past queries for resource selection, In Proceedings of ACM CIKM, Pages1867--1870, Hong Kong, China.
B.T. Bartell, G.W. Cottrell, and R.K. Belew. Automatic Combination of Multiple Ranked Retrieval Systems, ACM SIGIR, pp 173-181, 1994.
C. Baumgarten. A Probabilitstic Solution to the Selection and Fusion Problem in Distributed Information Retrieval, ACM SIGIR, pp 246-253, 1999.
N. Craswell. Methods for Distributed Information Retrieval. PhD thesis, Australian National University, 2000.
S. Cronen-Townsend, Y. Zhou, and W. B. Croft. Predicting query performance. ACM SIGIR, pp 299–306, 2002.
A. Chakravarthy and K. Haase. NetSerf: using semantic knowledge to find internet information archives, ACM SIGIR, pp 4-11, Seattle, WA, 1995.
F. Diaz. Performance prediction using spatial autocorrelation. ACM SIGIR, pp. 583–590, 2007.
F. Diaz. Integration of news content into web results. ACM International Conference on Web Search and Data Mining, 2009.
F. Diaz, J. and Arguello. Adaptation of offline vertical selection predictions in the presence of user feedback, ACM SIGIR, 2009.
D. Dreilinger and A. Howe. Experiences with selecting search engines using metasearch. ACM Transaction on Information Systems, 15(3):195-222, 1997.
J. Elsas, J. Arguello, J. Callan, and J. Carbonell. Retrieval and feedback models for blog feed search, ACM SIGIR, pp 347-354, Singapore, 2009.
Bibliography
E. Glover, S. Lawrence, W. Birmingham, and C. Giles. Architecture of a metasearch engine that supports user information needs, ACM CIKM, pp 210—216,1999.
L. Gravano, H. García-Molina, and A. Tomasic. Precision and recall of GlOSS estimators for database discovery. Third International conference on Parallel and Distributed Information Systems, pp 103--106, Austin, TX, 1994a.
L. Gravano, H. García-Molina, and A. Tomasic. The effectiveness of GlOSS for the text database discovery problem. ACM SIGMOD, pp 126--137, Minneapolis, MN, 1994b.
L. Gravano, C. Chang, H. García-Molina, and A. Paepcke. STARTS:Stanford proposal for internet metasearching, ACM SIGMOD, pp 207--218, Tucson, AZ, 1997.
L. Gravano, H. García-Molina, and A. Tomasic. GlOSS: text-source discovery over the internet, ACM Transactions on Database Systems, 24(2):229--264, 1999.
E. Fox and J. Shaw. Combination of multiple searches. Second Text REtrieval Conference, pp 243-252, Gaithersburg, MD, 1993.
E. Fox and J. Shaw. Combination of multiple searches, Third Text REtrieval Conference, pp 105-108, Gaithersburg, MD, 1994.
J. French, and A. Powell. Metrics for evaluating database selection techniques, World Wide Web, 3(3):153--163, 2000.
C. Hauff. Predicting the Effectiveness of Queries and Retrieval Systems, PhD thesis, University of Twente, 2010.
Bibliography
D. Hawking and P. Thomas. Server selection methods in hybrid portal search, ACM SIGIR, pp 75-82, Salvador, Brazil, 2005.
D. Hawking and P. Thistlewaite. Methods for information server selection, ACM Transactions on Information Systems, 17(1):40-76, 1999.
T. Hernandez and S. Kambhampati. Improving text collection selection with coverage and overlap statistics. WWW, pp 1128-1129, Chiba, Japan, 2005.
P. Ipeirotis and L. Gravano. When one sample is not enough: improving text database selection using shrinkage. ACM SIGMOD, pp 767-778, Paris, France, 2004.
P. Ipeirotis and L. Gravano. Distributed search over the hidden web: Hierarchical database sampling and selection. VLDB, pages 394-405, Hong Kong, China, 2002.
P. Ipeirotis and L. Gravano. Classification-aware hidden-web text database selection. ACM Transactions on Information Systems, 26(2):1-66, 2008.
P. Ipeirotis, A. Ntoulas, J. Cho, and L. Gravano. Modeling and managing content changes in text databases, 21st International Conference on Data Engineering, pp 606-617, Tokyo, Japan, 2005.
A. C. König, M. Gamon, and Q. Wu. Click-through prediction for news queries, ACM SIGIR, 2009.
Bibliography
X. Li, Y.-Y. Wang, and A. Acero, Learning query intent from regularized click graphs, ACM SIGIR, pp. 339–346.
D. Lillis, F. Toolan, R. Collier, and J. Dunnion. ProbFuse: a probabilistic approach to data fusion, ACM SIGIR, pp 139-146, Seattle, WA, 2006.
K. Liu, C. Yu, and W. Meng. Discovering the representative of a search engine. ACM CIKM, pp 652-654, McLean, VA, 2002.
N. Liu, J. Yan, W. Fan, Q. Yang, and Z. Chen. Identifying Vertical Search Intention of Query through Social Tagging Propagation, WWW, Madrid, 2009.
W. Meng, Z. Wu, C. Yu, and Z. Li. A highly scalable and effective method for metasearch, ACM Transactions on Information Systems, 19(3):310-335, 2001.
W. Meng, C. Yu, and K. Liu. Building efficient and effective metasearch engines. ACM Computing Surveys, 34(1):48-89, 2002.
V. Murdock, and M. Lalmas. Workshop on aggregated search, SIGIR Forum 42(2): 80-83, 2008.
H. Nottelmann and N. Fuhr. Combining CORI and the decision-theoretic approach for advanced resource selection, ECIR, pp 138--153, Sunderland, UK, 2004.
P. Ogilvie and J. Callan. The effectiveness of query expansion for distributed information retrieval, ACM CIKM, pp 1830--190, Atlanta, GA, 2001.
C. Paris, S. Wan and P. Thomas. Focused and aggregated search: a perspective from natural language generation, Journal of Information Retrieval, Special Issue, 2010.
Bibliography
S. Park. Analysis of characteristics and trends of Web queries submitted to NAVER, a major Korean search engine, Library & Information Science Research 31(2): 126-133, 2009.
F. Schumacher and R. Eschmeyer. The estimation of fish populations in lakes and ponds, Journal of the Tennessee Academy of Science, 18:228-249, 1943.
M. Shokouhi. Central-rank-based collection selection in uncooperative distributed information retrieval, ECIR, pp 160-172, Rome, Italy, 2007a.
J. Seo and B. Croft. Blog site search using resource selection, ACM CIKM, pp 1053-1062, Napa Valley, CA, 2008.
M. Shokouhi. Segmentation of search engine results for effective data-fusion, ECIR, pp 185-197, Rome, Italy, 2007b.
M. Shokouhi and J. Zobel. Robust result merging using sample-based score estimates, ACM Transactions on Information Systems, 27(3):1-29, 2009.
M. Shokouhi and J. Zobel. Federated text retrieval from uncooperative overlapped collections, ACM SIGIR, pp 495-502. Amsterdam, Netherlands, 2007.
M. Shokouhi, F. Scholer, and J. Zobel. Sample sizes for query probing in uncooperative distributed information retrieval, Eighth Asia Pacific Web Conference, pp 63--75, Harbin, China, 2006a.
Bibliography
M. Shokouhi, J. Zobel, F. Scholer, and S. Tahaghoghi. Capturing collection size for distributed non-cooperative retrieval, ACM SIGIR, pp 316-323, Seattle, WA, 2006b.
M. Shokouhi, J. Zobel, S. Tahaghoghi, and F. Scholer. Using query logs to establish vocabularies in distributed information retrieval, Information Processing and Management, 43(1):169-180, 2007d.
M. Shokouhi, P. Thomas, and L. Azzopardi. Effective query expansion for federated search, ACM SIGIR, pp 427-434, Singapore, 2009.
L. Si and J. Callan. Unified utility maximization framework for resource selection, ACM CIKM, pages 32-41, Washington, DC, 2004a.
L. Si and J. Callan. CLEF2005: multilingual retrieval by combining multiple multilingual ranked lists. Sixth Workshop of the Cross-Language Evaluation Forum, Vienna, Austria, 2005a. http://www.cs.purdue.edu/homes/lsi/publications.htm
L. Si, J. Callan, S. Cetintas, and H. Yuan. An effective and efficient results merging strategy for multilingual information retrieval in federated search environments, Information Retrieval, 11(1):1--24, 2008.
L. Si and J. Callan. Relevant document distribution estimation method for resource selection, ACM SIGIR, pp 298-305, Toronto, Canada, 2003a.
L. Si and J. Callan. Modeling search engine effectiveness for federated search, ACM SIGIR, pp 83-90, Salvador, Brazil, 2005b.
L. Si and J. Callan. A semisupervised learning method to merge search engine results, ACM Transactions on Information Systems, 21(4):457-491, 2003b.
Bibliography
A. Sugiura and O. Etzioni. Query routing for web search engines: architectures and experiments, WWW, Pages 417-429, Amsterdam, Netherlands, 2000.
S. Sushmita, H. Joho and M. Lalmas. A Task-Based Evaluation of an Aggregated Search Interface, SPIRE, Saariselkä, Finland, 2009.
S. Sushmita, H. Joho, M. Lalmas, and R. Villa. Factors Affecting Click-Through Behavior in Aggregated Search Interfaces, ACM CIKM, Toronto, Canada, 2010.
S. Sushmita, B. Piwowarski, and M. Lalmas. Dynamics of Genre and Domain Intents, Technical Report, University of Glasgow 2010.
S. Sushmita, H. Joho, M. Lalmas and J.M. Jose. Understanding domain "relevance" in web search, WWW 2009 Workshop on Web Search Result Summarization and Presentation, Madrid, Spain, 2009.
P. Thomas and D. Hawking. Evaluating sampling methods for uncooperative collections, ACM SIGIR, pp 503-510, Amsterdam, Netherlands, 2007.
P. Thomas. Server characterisation and selection for personal metasearch, PhD thesis, Australian National University, 2008.
P. Thomas and M. Shokouhi. SUSHI: scoring scaled samples for server selection, ACM SIGIR, pp 419-426, Singapore, Singapore, 2009.
A. Trotman, S. Geva, J. Kamps, M. Lalmas and V. Murdock (eds). Current research in focused retrieval and result aggregation, Special Issue in the Journal of Information Retrieval, Springer, 2010.
Bibliography
T. Tsikrika and M. Lalmas. Merging Techniques for Performing Data Fusion on the Web, ACM CIKM, pp 181-189, Atlanta, Georgia, 2001.
Ellen M. Voorhees, Narendra Kumar Gupta, Ben Johnson-Laird. Learning Collection Fusion Strategies, ACM SIGIR, pp 172-179, 1995.
B. Yuwono and D. Lee. WISE: A world wide web resource database system. IEEE Transactions on Knowledge and Data Engineering, 8(4):548--554, 1996.
B. Yuwono and D. Lee. Server ranking for distributed text retrieval systems on the internet. Fifth International Conference on Database Systems for Advanced Applications, 6, pp 41-50, Melbourne, Australia, 1997.
J. Xu and J. Callan. Effective retrieval with distributed collections, ACM SIGIR, pp 112-120, Melbourne, Australia, 1998.
A. Zhou and M. Lalmas. Building a Test Collection for Aggregated Search, Technical Report, University of Glasgow 2010.
J. Zobel. Collection selection via lexicon inspection, Australian Document Computing Symposium, pp 74--80, Melbourne, Australia, 1997.
Bibliography