Resource Selection in Distributed Information Retrieval – an
Experimental StudyHans Friedrich Witschel
(formerly) University of Leipzig
(now) SAP Research CEC Karlsruhe
H.F. Witschel, Global and Local Resources for
P2PIR
1. Motivation
2. Problem definition
3. Solutions to be explored
4. Experimental setup
5. Results
6. Conclusions
Overview
H.F. Witschel, Global and Local Resources for
P2PIR
Motivation
H.F. Witschel, Global and Local Resources for
P2PIR
Resource selection
Motivation
treatmentsurgeryradiationoncologydiagnosticbone marrowurology
staticsbuildingprojectlandscapeanti-seismicdesigncubature
clientserverservent p2ptermsalgorithmranking
combineharvestercattlecropstractoragriculturalacres
Whom could I ask about „information retrieval“??
H.F. Witschel, Global and Local Resources for
P2PIR
Resource selection
Reason for selecting only a subset of all available resources/peers: cost reduction
Distributed IR (DIR): time and load on databases Peer-to-peer IR (P2PIR): amount of messages, we will
concentrate on P2PIR here
Basic approach: treat peers/resources as giant documents, use existing (slightly modified) retrieval functions to rank them, visit top-ranked ones…
Motivation
H.F. Witschel, Global and Local Resources for
P2PIR
Problem definition
H.F. Witschel, Global and Local Resources for
P2PIR
Assumptions
Problem definition
Peers have profiles = lists of terms with weights (unigram language models)
Two options: Represent peers by what they have → extract terms from a
peer‘s shared documents Represent peers by queries for which they provide relevant
documents
Profiles have to be compact in order to reduce communication overhead absolute size of profiles dictated by available (network)
resources
H.F. Witschel, Global and Local Resources for
P2PIR
Research questions
Problem definition
How much will profile pruning degrade the quality of resource selection? That is, how many terms can we prune from a profile and still have acceptable results?
What can be done to improve peer selection? Improve queries → Query Expansion? Improve profiles → Profile adaptation?
H.F. Witschel, Global and Local Resources for
P2PIR
Solutions to be explored
H.F. Witschel, Global and Local Resources for
P2PIR
Preliminaries
Profiles: use CORI for weighting terms t in the collection of peer p, rank
by P(t|p) Compression: apply simple thresholding Profile sizes: 10,20,40,80,160,320,640,unpruned
Global term weights (I component of CORI) Use external reference corpus for estimating idf values
Local retrieval function at each peer: BM25 Uses the same idf estimations as above
=> document scores comparable across all peers=> can concentrate on resource selection process, results not blurred by result merging effects
Solutions to be explored
H.F. Witschel, Global and Local Resources for
P2PIR
Baselines
Random: Rank peers in random order
By-size: rank peers by the number of documents they hold, independent of offered content
Base CORI: rank peers by the sum of CORI weights of terms contained in both the query and the peer‘s profile
Solutions to be explored
H.F. Witschel, Global and Local Resources for
P2PIR
Query expansion
All methods use Local Context analysis Input passages are taken from:
The web: top 10 results snippets returned by Yahoo! API for the query
Local documents: best 10 documents returned by highest-ranked peer (local pseudo feedback)
For comparison („upper QE baseline“): use global view on collection (global pseudo feedback)
Solutions to be explored
H.F. Witschel, Global and Local Resources for
P2PIR
Profile adaptation
Idea: Boost weight of term t in peer p‘s profile if p has successfully
answered a query containing t Aim: profile allows the peer to answer popular queries for which
it has many relevant documents Can be done using a query log Extensions: collaborative tagging approach, allow user
interaction etc. (hard to evaluate)
Solutions to be explored
H.F. Witschel, Global and Local Resources for
P2PIR
Profile adaptation
Update formula for term i in profile of peer p
Dp = documents returned by p
Do = documents returned by all peers contacted
AVGRP = average relative precision (RP) over all peers the query has reached
Update is only executed if ratio > 1, i.e. if p‘s results are „better“ than the average
For evaluation purposes: split a query log into query and test set, use training set for updating profiles
Solutions to be explored
H.F. Witschel, Global and Local Resources for
P2PIR
Experimental Setup
H.F. Witschel, Global and Local Resources for
P2PIR
Simplifying it…
Experimental Setup
Evaluate distributed IR only, instead of running full P2PIR simulation
Decouple query routing from other aspects (overlay topology etc.)
Considerably reduces number of free parameters Underlying assumption: a resource selection algorithm A that
works better than algorithm B for DIR, will also be better for P2PIR (i.e. when only a subset of all resources is visible)
– A DIR scenario corresponds to a fully connected P2P overlay (e.g. PlanetP)
H.F. Witschel, Global and Local Resources for
P2PIR
Parameterising it…
DIR evaluation, but: use parameters typical of P2PIR settings: Pruned profiles >> 1000 Peers Peer collections: small and semantically (relatively)
homogeneous All this as opposed to DIR
Experimental Setup
H.F. Witschel, Global and Local Resources for
P2PIR
Applying it…
Basic evaluation procedure: Obtain a ranking R of all peers w.r.t. query q Visit the top 100 peers in the order implied by R After visiting each peer: merge documents found so far into a
ranking S, judge quality of R by the quality of S using e.g. relevance judgments for documents
Experimental Setup
H.F. Witschel, Global and Local Resources for
P2PIR
Test collections
a) Digital library scenario: peers = topics Ohsumed: medical abstracts, annotated with Medical Subject
Headings (MeSHs) GIRT: German sociology abstracts, annotated with terms from a
thesaurus For both collections, queries and relevance judgments are
available
b) Individuals sharing publications: Citeseer abstracts with peers = (co-)authors Query log available, but no relevance judgments
Experimental Setup
H.F. Witschel, Global and Local Resources for
P2PIR
Evaluation measures
Missing relevance judgements: introduce new measure relative precision (RP)
Idea: compare a given ranking D with ranking C of a reference retrieval system (here: centralised system)
Probability of relevance of a document estimated as inverse rank in reference ranking
RP@k = average probability of relevance among first k documents of ranking D
Experimental Setup
C = [K,L,M,N,O,P]D = [L,M,O]
34.05
1
3
1
2
1
3
13@
RP
H.F. Witschel, Global and Local Resources for
P2PIR
Results
H.F. Witschel, Global and Local Resources for
P2PIR
Profile pruning, CiteSeer
Results
H.F. Witschel, Global and Local Resources for
P2PIR
Profile pruning, GIRT
Results
H.F. Witschel, Global and Local Resources for
P2PIR
Profile pruning, space savings
Results
H.F. Witschel, Global and Local Resources for
P2PIR
Qualitative analysis
Results
H.F. Witschel, Global and Local Resources for
P2PIR
Query expansion
Results
M=intervals where QE runs significantly better than baselineM‘=intervals where QE significantly worse
H.F. Witschel, Global and Local Resources for
P2PIR
Profile adaptation
Results
H.F. Witschel, Global and Local Resources for
P2PIR
Profile adaptation, delayed updates
Results
H.F. Witschel, Global and Local Resources for
P2PIR
Conclusions
H.F. Witschel, Global and Local Resources for
P2PIR
Profile pruning: Pruning profiles hurts performance less than expected Whether or not pruning to a predefined size hurts, does not
necessarily depend on the original profile size In the experiments, it was always safe to prune for (total) space
savings of 90%
„Advanced“ techniques: Query expansion: more often hurts than improves performance Profile adaptation:
Stable improvement of over 10% among the first 15 peers visited Especially high improvement for the highest ranked peer delayed updates do not hurt effectiveness (weak locality)
Conclusions
H.F. Witschel, Global and Local Resources for
P2PIR
Questions?
Top Related