TREC-2003 Web Track Azadeh Shakery. TREC-2003 Web Track Web Track is a track featuring search tasks...

TREC-2003 Web Track

Azadeh Shakery

TREC-2003 Web Track

Web Track is a track featuring search tasks on a document set that is a snapshot of the World Wide Web. The TREC 2003 track will use the .GOV collections. This year’s aims: Investigate methods for effective topic

distillation. Investigate methods for effective navigational

search, with a mixture of homepage and named page queries.

.GOV Test Collection

A crawl of .gov Web sites (early 2002). Stopped after 1 million text/html pages Also including text/plain and the extracted text of pdf, doc and ps. Supplied with duplicate tables (URL==URL) and redirect (URL->URL) tables, in case these are useful e.g. for link-based ranking. Documents truncated to 100k Fewer documents than wt10g, but larger average docsize

.GOV Test CollectionDocuments: 1247753 (1.25

million)

text/html 1053372

application/pdf 131333

text/plain 43754

application/msword 13842

application/postscript 5673

other stuff which turned out to be text

44

Bundles: 4613

Total size: 19455030550 = 18.1G

Average bunsize: 4217435 = 4.0M

Average docsize: 15592 = 15.2k

Doc truncation length: 100kb

Docs without words: 55

Topic Distillation Task

Involves finding a list of “key resources” for a particular topic.Find as many different websites (represented by their entry pages) as possible within the first ten results.A “key resource” is a good entry page to a website which: Is principally devoted to the topic Provides credible information on the topic Is not part of a larger site also principally

devoted to the topic


Example topic format:<top><num> Number:<title> science<desc> Description: Find key government websites (represented by their home page) on the

subject of ‘science’.</top>

The title field only should be supplied to the system as the query.Systems will be judged according to how many good answers the find in the top ten results.Likely measures are precision at 10 and average precision at 10.

The home/named page finding task

The user is searching for a page by name.An effective search system should return the page at or near rank one.This year’s task involves a mixture of home page finding and named page finding.In both cases, there is only one target page and user queries are often the name of the page.Systems will be compared on the basis of the rank of the first correct answer.Likely measures:

Mean reciprocal rank of first correct answer (If r is rank of first correct page, get a score of 1/r)

Success rate at N (percentage of cases in which the correct answer or equivalent URL occurred in the first N documents)

Submissions

All submissions are due at NIST on or before August 6, 2003.Submission information: Topic distillation:

Up to 5 runs For each query, the top 1000 results

Home/named page finding Up to 5 runs For each query, the top 50 results

TREC-2002 Web Track

Topic Distillation Task: Finding relevant “key pages”.Named Page Finding Task: Finding a particular page. 23 research groups participated.


Premise: Some quality, in addition to relevance, is desirable.The quality has been called authority, quality, definitiveness in previous studies.Algorithms should strike a balance between relevance and quality in search algorithms.The main measure was precision at 10.

Topic Distillation Task Results

71 official runs were submitted from 17 participating groups.

Rank P@10 Group D? A? L?

1. 0.2510 tsinghua D A -

2. 0.2408 city-pliers - - -

3. 0.2306 chinese-academy - - -

4. 0.2286 ibm-haifa D A L

5. 0.2224 glasgow D A L

6. 0.2163 irit - - -

7. 0.1959 neuchtel D A -

8. 0.1939 fudan D - L

9. 0.1939 umelbourne - - -

10. 0.1755 uva - - -

11. 0.1510 yonsei D A L

12. 0.1143 umbc-cost - - -

13. 0.1082 cuny D A L

14. 0.1041 illinois-chicago - - L

15. 0.1000 csiro - - L

16. 0.0714 dgic-stokoe - - -

17. 0.0571 ajou - - L

Topic Distillation Task Results

0

0.05

0.1

0.15

0.2

0.25

0.3

Presicion at 10

Tsi

ngH

ua C

ity-P

liers

Chin

ese

-ac a

dem

yIB

M-h

aifa

gla

sgow

irit

neuch

ate

l

fudan

um

el b

ourn

euva

1. TsingHua University

Use link structure to estimate whether the document is a key resource: Kleinberg’s hub score Kleinberg’s authority score + hub score Out degree

Results of using link structure: The experiment on the training examples

showed some improvement The result was disappointing on 50 topics of

web track

1. TsingHua UniversitySite Uniting approaches:

Used to select proper pages as the representation of one server.

The document which has index characteristic and has a high enough similarity is reserved as key resource.

Documents of the same server in result list are given different reliability factor which is decaying by decreases of similarities

The SU algorithm (F1=1.03, F2=1.01, F3=1.005): Divide the list to sub-lists, all pages in one sub-list come from

an identical site. To one sub-list, give the first, second and third highest

similarity the weight F1, F2 and F3 respectively Merge all the sub-lists into one and re-rank it.

Better results were got by combining SU approach and out-degree factor

1. TsingHua UniversityThe URL is used to the retrieval:

Searching and shrinking Scoring and selecting Searching and shrinking:

Give a ‘right level’ to the return results within a server Shrinking is based on three principles:

1. Pages with more keywords matched in the URL are more important2. The location of the match effects the importance of the page, the

righter, the better3. From pages with the same conditions 1 and 2, shorter URL is better

Scoring and selecting: A keyword search is performed on URLs The result list is useful to re-rank content-based retrieval

results

1. TsingHua UniversityUsed Okapi to find similarities

There are a few parameters to be set Relevance judgments are not available while retrieving, thus

supervised learning algorithms do not help Used genetic algorithm for learning process The fitness function in GA determines whether each set of

parameters is good or not They have used the summation of the similarity scores of top n

relevant documents for the fitness functionfit_fun = i=1

50j=11000simi,j

Results: Anchor text was useful Out-degree was not useful Site uniting methods which worked well on the small number of

training examples improved average precision, but not P@10.

2. City University, London

Used a straightforward content retrieval run based on Okapi BM25.Used stemmingNo relevance feedbackDistillation: Only the highest ranked document from a web site is retrained in the top 10 results.Conclusion:

Non-distillation runs did better than the distillation runs. A straight BM25 term weighting with no relevance

feedback compares very well indeed with methods which use document/link structures and anchor text.

3. Chinese Academy

The first experiment is based on HITS algorithm Extract the page that had the maximum

Hub+Authority value from each group of pages and add it to the final result

Average result of this method was disappointing

In the second experiment: After the first retrieval, they scan the page list If they find a page’s URL containing the other’s,

re-weight the latter page by adding the former’s weight to the latter’s.

3. Chinese Academy

Results: The re-weighting method is not effective The baseline method worked the best

Their retrieval system is based on SMART, with Lnu-Ltu weighting method added to it

Recently, the have reported that they have made a mistake: some top results were not output for evaluation. After modification, their results outperformed the TsingHua University’s results.

4. IBM Haifa

Maintain a knowledge base (KAB) Stores information about a domain (user-

determined) Main pages in KAB

Root pages from search engines Pages inserted by the user KAB size is limited, so only best pages are maintained Page’s score updated every time a search occurs

Also satellite pages Pages that they point to and that point to them Provides anchor text information

Frequent terms in those pages and terms that co-occur with them

4. IBM Haifa

Site compression Do not permit more than three results from the same

“logical site” to appear in top 10 Logical site is basically everything between www and gov

Title filtering Pages with no query words in title considered frail Three frail pages with the lowest scores in KAB replaced by

three non-frail pages outside the KAB with highest scores

Duplicate elimination Threshold-based vector-space similarity Only applied to top 10 pages because of effectiveness

measure Applied after the previous two filters

4. IBM Haifa

Results: Site compression and title filters

improved precision significantly Duplicate elimination deteriorated the

overall precision

5. Glasgow UniversityUsed a probabilistic framework for combining link and content, called Absorbing Model.Static absorbing model:

Calculate authority score for page Combine it with content score using trained parameter

RSV’d = content RSVd + link authSAM

In their experiments: content = 1.0, link = 0.1

Dynamic absorbing model: Applied to top-ranked pages set, B Assume top of that set, A, is most authoritative Break links from A to B-A (but not from B-A to A) Calculate authority score for each set Note that A’s scores are boosted by links from B-A Scores in B-A do not get any of credit from set A

Maintains authoritativeness of very-top ranked pages

5. Glasgow University

Spreading activation: A portion of page’s score is propagated from its children Here, done only when pages are in the same domain Done only when source of link (child) is deeper within the

domainRSV’d = RSVd + s S RSVs

Query-biased spreading activation: More important to do activation for ambiguous queries Assumption: Generic queries may benefit more from the link

structure analysis Calculate generality of query

Position of query words in WordNet hierarchy Change to be the query scope value

RSV’d = RSVd + query-scope . s S RSVs

5. Glasgow UniversityResults: Content of pages is most important part of

process Can make DAM provide slightly better

results Use of anchor and title text was detrimental Query-biased spreading activation better

than static But both were worse than not using it at all Query expansion hurts effectiveness

14. University of Illinois at Chicago

Used a modified Okapi weighting scheme Replace the original parameter K (length of a passage) by

parameter K’ (norm of a passage)

Proximity re-ranking method: documents covering more query terms in a certain window size are ranked higher.PageRank of a document is combined with the similarity of the document to obtain the overall ranking.A document is assigned a value which is the sum of its similarity and a weighted sum of the similarities of its descendents within the same host

Conclusions of Results

Anchor text is usefulOut-degree is not usefulSite-uniting helped with average precision, but not prec@10Content based retrieval can work as wellEliminating documents without query words in title helpedRemoving near-duplicates by similarity hurt Query expansion hurtPageRank hurtLink analysis helped

Named Page Finding

Objective: Finding a particular Web page in .GOV, given a query which describes it by name.Topic is a single phrase US password renewal Child labor stamp

Main measure: Mean reciprocal rank of the first correct answer.

Name Page Finding Task Results

70 official runs were submitted from 18 participating groups.

Rank MRR Group D? A? L?

1. 0.719 tsinghua D A -

2. 0.676 cmu_lti D A -

3. 0.671 yonsei D A L

4. 0.654 glasgow D A -

5. 0.636 neuchtel D A L

6. 0.626 hummingbird D - -

7. 0.613 chinese-academy D A -

8. 0.587 iit - - -

9. 0.578 lit-singapore D A L

10. 0.576 umelbourne D A -

11. 0.573 csiro - - -

12. 0.564 illinois-chicago - - -

13. 0.535 watereloo - A L

14. 0.432 uva - A -

15. 0.418 city-pliers - - -

16. 0.263 cuny D A -

17. 0.132 ajou D - -

18. 0.010 kasetstart - A -

Name Page Finding Task Results

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

MRR

Tsi

ngH

ua

Cm

u-l

ti

Chin

ese

-aca

dem

yyonse

i

gla

sgow

iit

neuch

ate

l

hum

min

gbir

d um

el b

ourn

eL it-

singapor e


Built a collection of surrogate documents comprising keywords, titles and incoming anchor text.Ranks obtained with these surrogates were combined with ranks from the original documents, using

S’=a*1/rank1 + (1-a)*1/rank2The combined score outperformed the original score, which in turn outperformed the surrogate score.


Results: Using keywords, bold text and title

fields of the in-link pages do help on named page finding task.

URL classification (Root, Sub root, Path and File) does not help

2. CMU LTI

Their basic model is a generative language model.Hypothesis: the user’s query is what the user believes to be a reasonable estimate of the name of the page he is seeking.When estimating a language model, they want to estimate a model for the page’s “name” instead of the entire document.The language model for the document is a linear interpolation of several language models (title, in-link text, full text, meta tag text, url text, large fonts).

2. CMU LTI

Results: Using document structure in this way

did improve the performance over just using a simple language model.

Document in-link text is important for named-page finding.

3. Yonsei

Their major focus was on the use of sentential information in IR Idea: A sentence in a document that is much

relevant to the query can support relevance of the document to the query.

Obtain similarity values between sentences of a document and query

Use these values for computing the retrieval score of the document

They use the number of common words between the sentence and the query as the measure for similarity

3. Yonsei

Similarity computation: Adopt the vector-space model to compute the

document-query similarity sim(D, Q) using Cosine coefficient

Use anchor textsRSV(D, Q) = sim(D, Q) + i=1

nC(Si, Q) + i=1lsim(Li, Q) + i=1

lC(Li, Q)

Results: Using sentence-query similarity enables the system

to achieve a significant increase in performance Use of anchor texts improves the performance

noticeably


Retrieve top 1000 pagesRe-rank them based on whether page contains the query as a phrase RSV’d = RSVd

= 1.3 if page contains phrase = 1.0 else

Run DAM with smaller sets |B| = 10, |A| = 5 Here, goal is to find a single named page

Try various ways of using title and anchor text


Results: DAM resulted in slight improvement over

content Anchor text significantly improved

reciprocal precision Body only indexing and link analysis without

anchors worked well. Spreading activation on sites was equivocal. Query expansion and PageRank were

detrimental.

5. U. Neuchatel

A second representation of each document in the .GOV collection was created, comprising the document title and all its incoming anchor text.Okapi scores were computed for both representationsFinal Score = Scontent + (1 – ) Sanchortitle

The best results were obtained with = 0.6.

12. University of Illinois at Chicago

The title and the anchor text are used to construct a surrogate index.In one run, they have combined the document similarity with the surrogate similarityUsing only passage retrieval gives the best result

Conclusion

URL-type analysis did not bring improvement in performance of Named Page Finding taskSeveral leading participants reported an improvement in performance by adding anchor text and structural information to a content-only run

Thank you…

TREC-2003 Web Track Azadeh Shakery. TREC-2003 Web Track Web Track is a track featuring search tasks...

Documents

Transcript of TREC-2003 Web Track Azadeh Shakery. TREC-2003 Web Track Web Track is a track featuring search tasks...