Information Retrieval Search Engine Technology (3) Prof. Dragomir R. Radev.

Post on 17-Jan-2016

219 views 0 download

Tags:

Transcript of Information Retrieval Search Engine Technology (3) Prof. Dragomir R. Radev.

Information RetrievalSearch Engine Technology

(3)http://tangra.si.umich.edu/clair/ir09

Prof. Dragomir R. Radev

radev@umich.edu

SET/IR – W/S 2009

…5. Evaluation of IR systems Reference collections TREC…

Relevance

• Difficult to change: fuzzy, inconsistent

• Methods: exhaustive, sampling, pooling, search-based

Contingency table

w=tp x=fn

y=fp z=tn

n2 = w + y

n1 = w + x

N

relevant

not relevant

retrieved not retrieved

Precision and Recall

Recall:

Precision:

w

w+y

w+x

w

Exercise

Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Tolkien”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista.

Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs.

Later, try different queries.

n Doc. no Relevant? Recall Precision1 588 x 0.2 1.00

2 589 x 0.4 1.00

3 576 0.4 0.67

4 590 x 0.6 0.75

5 986 0.6 0.60

6 592 x 0.8 0.67

7 984 0.8 0.57

8 988 0.8 0.50

9 578 0.8 0.44

10 985 0.8 0.40

11 103 0.8 0.36

12 591 0.8 0.33

13 772 x 1.0 0.38

14 990 1.0 0.36

[From Salton’s book]

P/R graph

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cis

ion

P/R graph

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cis

ion

Interpolated average precision (e.g., 11pt)Interpolation – what is precision at recall=0.5?

Issues

• Why not use accuracy A=(w+z)/N?

• Average precision

• Average P at given “document cutoff values”

• Report when P=R

• F measure: F=(2+1)PR/(2P+R)

• F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of P and R

Kappa

• N: number of items (index i)

• n: number of categories (index j)

• k: number of annotators

)(1

)()(

EP

EPAP

N

i

n

jij k

mkNk

AP1 1

2

1

1

)1(

1)(

2

1

1

)(

Nk

mEP

N

iijn

j

Kappa example

J1+ J1- TOTAL

J2+ 300 10 310

J2- 20 70 90

TOTAL 320 80 400

Kappa (cont’d)

• P(A) = 370/400 = 0.925• P (-) = (10+20+70+70)/800 = 0.2125• P (+) = (10+20+300+300)/800 = 0.7875• P (E) = 0.2125 * 0.2125 + 0.7875 * 0.7875

= 0.665• K = (0.925-0.665)/(1-0.665) = 0.776• Kappa higher than 0.67 is tentatively

acceptable; higher than 0.8 is good

Sample TREC query<top><num> Number: 305<title> Most Dangerous Vehicles

<desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example.</top>

LA031689-0177FT922-1008LA090190-0126LA101190-0218LA082690-0158LA112590-0109FT944-136LA020590-0119FT944-5300LA052190-0048LA051689-0139FT944-9371LA032390-0172

LA042790-0172LA021790-0136LA092289-0167LA111189-0013LA120189-0179LA020490-0021LA122989-0063LA091389-0119LA072189-0048FT944-15615LA091589-0101LA021289-0208

<DOCNO> LA031689-0177 </DOCNO><DOCID> 31701 </DOCID><DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE><SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION><LENGTH><P>586 words </P></LENGTH><HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE><BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE><TEXT><P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-overaccidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P><P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of theSuzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents afterConsumer Reports magazine charged that the vehicle had basic design flaws. </P><P>Several Fatalities </P><P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs,particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigationconducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P><P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicleroll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involvingthe Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After theaccident report, NHTSA declined to investigate the Samurai. </P>...</TEXT><GRAPHIC><P> Photo, The Ford Bronco II "appears to have a highernumber of single-vehicle, first event roll-overs," a federal officialsaid. </P></GRAPHIC><SUBJECT><P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS;RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P></SUBJECT></DOC>

TREC (cont’d)

• http://trec.nist.gov/tracks.html• http://trec.nist.gov/presentations/presentations.ht

ml

Most used reference collections

• Generic retrieval: OHSUMED, CRANFIELD, CACM

• Text classification: Reuters, 20newsgroups• Question answering: TREC-QA• Web: DOTGOV, wt100g• Blogs: Buzzmetrics datasets• TREC ad hoc collections, 2-6 GB• TREC Web collections, 2-100GB

Comparing two systems

• Comparing A and B

• One query?

• Average performance?

• Need: A to consistently outperform B

[this slide: courtesy James Allan]

The sign test

• Example 1:– A > B (12 times)– A = B (25 times)– A < B (3 times)– p < 0.035 (significant at the 5% level)

• Example 2:– A > B (18 times)– A < B (9 times)– p < 0.122 (not significant at the 5% level)– http://www.fon.hum.uva.nl/Service/Statistics/Sign_Tes

t.html [this slide: courtesy James Allan]

The t-test

• Takes into account the actual performances, not just which system is better

• http://www.socialresearchmethods.net/kb/stat_t.php

SET/IR – S/W 2009

…6. Automated indexing/labeling Compression…

Indexing methods

• Manual: e.g., Library of Congress subject headings, MeSH

• Automatic: e.g., TF*IDF based

LOC subject headings

http://www.loc.gov/catdir/cpso/lcco/lcco.html

A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY. RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD -- HISTORY (GENERAL) AND HISTORY OF EUROPEE -- HISTORY: AMERICAF -- HISTORY: AMERICAG -- GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL SCIENCESJ -- POLITICAL SCIENCEK -- LAWL -- EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER -- MEDICINES -- AGRICULTURET -- TECHNOLOGYU -- MILITARY SCIENCEV -- NAVAL SCIENCEZ -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)

MedicineCLASS R - MEDICINESubclass RR5-920 Medicine (General)R5-130.5 General worksR131-687 History of medicine. Medical expeditionsR690-697 Medicine as a profession. PhysiciansR702-703 Medicine and the humanities. Medicine and disease in relation to

history, literature, etc.R711-713.97 DirectoriesR722-722.32 Missionary medicine. Medical missionariesR723-726 Medical philosophy. Medical ethicsR726.5-726.8 Medicine and disease in relation to psychology. Terminal care. DyingR727-727.5 Medical personnel and the public. Physician and the publicR728-733 Practice of medicine. Medical practice economicsR735-854 Medical education. Medical schools. ResearchR855-855.5 Medical technologyR856-857 Biomedical engineering. Electronics. InstrumentationR858-859.7 Computer applications to medicine. Medical informaticsR864 Medical recordsR895-920 Medical physics. Medical radiology. Nuclear medicine

Automatic methods

• TF*IDF: pick terms with the highest TF*IDF scores

• Centroid-based: pick terms that appear in the centroid with high scores

• The maximal marginal relevance principle (MMR)

• Related to summarization, snippet generation

Compression

• Methods– Fixed length codes– Huffman coding– Ziv-Lempel codes

Fixed length codes

• Binary representations– ASCII– Representational power (2k symbols where k

is the number of bits)

Variable length codes• Alphabet:

A .-  N -.  0 -----B -...  O ---  1 .----C -.-.  P .--.  2 ..---D -..  Q --.-  3 ...—E .  R .-. 4 ....-F ..-. S ... 5 .....G --. T -  6 -....H .... U ..-  7 --...I ..  V ...-  8 ---..J .---  W .--  9 ----.K -.-  X -..-L .-..  Y -.—M --  Z --..

• Demo:– http://www.scphillips.com/morse/

Most frequent letters in English

• Most frequent letters:– E T A O I N S H R D L U

• Demo:– http://www.amstat.org/publications/jse/secure/v7n2/

count-char.cfm • Also: bigrams:

– TH HE IN ER AN RE ND AT ON NT

Huffman coding

• Developed by David Huffman (1952)• Average of 5 bits per character (37.5%

compression)• Based on frequency distributions of

symbols• Algorithm: iteratively build a tree of

symbols starting with the two least frequent symbols

Symbol Frequency

A 7

B 4

C 10

D 5

E 2

F 11

G 15

H 3

I 7

J 8

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

c

b d

f

g

i j

he

a

Symbol Code

A 0110

B 0010

C 000

D 0011

E 01110

F 010

G 10

H 01111

I 110

J 111

Exercise

• Consider the bit string: 01101101111000100110001110100111000110101101011101

• Use the Huffman code from the example to decode it.

• Try inserting, deleting, and switching some bits at random locations and try decoding.

Extensions

• Word-based

• Domain/genre dependent models

Ziv-Lempel coding

• Two types - one is known as LZ77 (used in GZIP)

• Code: set of triples <a,b,c>

• a: how far back in the decoded text to look for the upcoming text segment

• b: how many characters to copy

• c: new character to add to complete segment

• <0,0,p> p• <0,0,e> pe• <0,0,t> pet• <2,1,r> peter• <0,0,_> peter_• <6,1,i> peter_pi• <8,2,r> peter_piper• <6,3,c> peter_piper_pic• <0,0,k> peter_piper_pick• <7,1,d> peter_piper_picked• <7,1,a> peter_piper_picked_a• <9,2,e> peter_piper_picked_a_pe• <9,2,_> peter_piper_picked_a_peck_• <0,0,o> peter_piper_picked_a_peck_o• <0,0,f> peter_piper_picked_a_peck_of• <17,5,l> peter_piper_picked_a_peck_of_pickl• <12,1,d> peter_piper_picked_a_peck_of_pickled• <16,3,p> peter_piper_picked_a_peck_of_pickled_pep• <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper• <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers

Links on text compression

• Data compression:– http://www.data-compression.info/

• Calgary corpus:– http://links.uwaterloo.ca/calgary.corpus.html

• Huffman coding:– http://www.compressconsult.com/huffman/ – http://en.wikipedia.org/wiki/Huffman_coding

• LZ– http://en.wikipedia.org/wiki/LZ77

100 alternative search engines

• http://rss.slashdot.org/~r/Slashdot/slashdot/~3/83468703/article.pl

Readings

• 2: MRS9

• 3: MRS13, MRS14

• 4: MRS15, MRS16