Information Retrieval Search Engine Technology (3) Prof. Dragomir R. Radev.

40
Information Retrieval Search Engine Technology (3) http://tangra.si.umich.edu/clair/ir09 Prof. Dragomir R. Radev [email protected]

Transcript of Information Retrieval Search Engine Technology (3) Prof. Dragomir R. Radev.

Page 1: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Information RetrievalSearch Engine Technology

(3)http://tangra.si.umich.edu/clair/ir09

Prof. Dragomir R. Radev

[email protected]

Page 2: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

SET/IR – W/S 2009

…5. Evaluation of IR systems Reference collections TREC…

Page 3: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Relevance

• Difficult to change: fuzzy, inconsistent

• Methods: exhaustive, sampling, pooling, search-based

Page 4: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Contingency table

w=tp x=fn

y=fp z=tn

n2 = w + y

n1 = w + x

N

relevant

not relevant

retrieved not retrieved

Page 5: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Precision and Recall

Recall:

Precision:

w

w+y

w+x

w

Page 6: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Exercise

Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Tolkien”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista.

Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs.

Later, try different queries.

Page 7: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

n Doc. no Relevant? Recall Precision1 588 x 0.2 1.00

2 589 x 0.4 1.00

3 576 0.4 0.67

4 590 x 0.6 0.75

5 986 0.6 0.60

6 592 x 0.8 0.67

7 984 0.8 0.57

8 988 0.8 0.50

9 578 0.8 0.44

10 985 0.8 0.40

11 103 0.8 0.36

12 591 0.8 0.33

13 772 x 1.0 0.38

14 990 1.0 0.36

[From Salton’s book]

Page 8: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

P/R graph

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cis

ion

Page 9: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

P/R graph

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Recall

Pre

cis

ion

Interpolated average precision (e.g., 11pt)Interpolation – what is precision at recall=0.5?

Page 10: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Issues

• Why not use accuracy A=(w+z)/N?

• Average precision

• Average P at given “document cutoff values”

• Report when P=R

• F measure: F=(2+1)PR/(2P+R)

• F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of P and R

Page 11: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Kappa

• N: number of items (index i)

• n: number of categories (index j)

• k: number of annotators

)(1

)()(

EP

EPAP

N

i

n

jij k

mkNk

AP1 1

2

1

1

)1(

1)(

2

1

1

)(

Nk

mEP

N

iijn

j

Page 12: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Kappa example

J1+ J1- TOTAL

J2+ 300 10 310

J2- 20 70 90

TOTAL 320 80 400

Page 13: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Kappa (cont’d)

• P(A) = 370/400 = 0.925• P (-) = (10+20+70+70)/800 = 0.2125• P (+) = (10+20+300+300)/800 = 0.7875• P (E) = 0.2125 * 0.2125 + 0.7875 * 0.7875

= 0.665• K = (0.925-0.665)/(1-0.665) = 0.776• Kappa higher than 0.67 is tentatively

acceptable; higher than 0.8 is good

Page 14: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Sample TREC query<top><num> Number: 305<title> Most Dangerous Vehicles

<desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example.</top>

LA031689-0177FT922-1008LA090190-0126LA101190-0218LA082690-0158LA112590-0109FT944-136LA020590-0119FT944-5300LA052190-0048LA051689-0139FT944-9371LA032390-0172

LA042790-0172LA021790-0136LA092289-0167LA111189-0013LA120189-0179LA020490-0021LA122989-0063LA091389-0119LA072189-0048FT944-15615LA091589-0101LA021289-0208

Page 15: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

<DOCNO> LA031689-0177 </DOCNO><DOCID> 31701 </DOCID><DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE><SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION><LENGTH><P>586 words </P></LENGTH><HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE><BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE><TEXT><P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-overaccidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P><P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of theSuzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents afterConsumer Reports magazine charged that the vehicle had basic design flaws. </P><P>Several Fatalities </P><P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs,particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigationconducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P><P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicleroll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involvingthe Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After theaccident report, NHTSA declined to investigate the Samurai. </P>...</TEXT><GRAPHIC><P> Photo, The Ford Bronco II "appears to have a highernumber of single-vehicle, first event roll-overs," a federal officialsaid. </P></GRAPHIC><SUBJECT><P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS;RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P></SUBJECT></DOC>

Page 16: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

TREC (cont’d)

• http://trec.nist.gov/tracks.html• http://trec.nist.gov/presentations/presentations.ht

ml

Page 17: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Most used reference collections

• Generic retrieval: OHSUMED, CRANFIELD, CACM

• Text classification: Reuters, 20newsgroups• Question answering: TREC-QA• Web: DOTGOV, wt100g• Blogs: Buzzmetrics datasets• TREC ad hoc collections, 2-6 GB• TREC Web collections, 2-100GB

Page 18: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Comparing two systems

• Comparing A and B

• One query?

• Average performance?

• Need: A to consistently outperform B

[this slide: courtesy James Allan]

Page 19: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

The sign test

• Example 1:– A > B (12 times)– A = B (25 times)– A < B (3 times)– p < 0.035 (significant at the 5% level)

• Example 2:– A > B (18 times)– A < B (9 times)– p < 0.122 (not significant at the 5% level)– http://www.fon.hum.uva.nl/Service/Statistics/Sign_Tes

t.html [this slide: courtesy James Allan]

Page 20: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

The t-test

• Takes into account the actual performances, not just which system is better

• http://www.socialresearchmethods.net/kb/stat_t.php

Page 21: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

SET/IR – S/W 2009

…6. Automated indexing/labeling Compression…

Page 22: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Indexing methods

• Manual: e.g., Library of Congress subject headings, MeSH

• Automatic: e.g., TF*IDF based

Page 23: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

LOC subject headings

http://www.loc.gov/catdir/cpso/lcco/lcco.html

A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY. RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD -- HISTORY (GENERAL) AND HISTORY OF EUROPEE -- HISTORY: AMERICAF -- HISTORY: AMERICAG -- GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL SCIENCESJ -- POLITICAL SCIENCEK -- LAWL -- EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER -- MEDICINES -- AGRICULTURET -- TECHNOLOGYU -- MILITARY SCIENCEV -- NAVAL SCIENCEZ -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)

Page 24: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

MedicineCLASS R - MEDICINESubclass RR5-920 Medicine (General)R5-130.5 General worksR131-687 History of medicine. Medical expeditionsR690-697 Medicine as a profession. PhysiciansR702-703 Medicine and the humanities. Medicine and disease in relation to

history, literature, etc.R711-713.97 DirectoriesR722-722.32 Missionary medicine. Medical missionariesR723-726 Medical philosophy. Medical ethicsR726.5-726.8 Medicine and disease in relation to psychology. Terminal care. DyingR727-727.5 Medical personnel and the public. Physician and the publicR728-733 Practice of medicine. Medical practice economicsR735-854 Medical education. Medical schools. ResearchR855-855.5 Medical technologyR856-857 Biomedical engineering. Electronics. InstrumentationR858-859.7 Computer applications to medicine. Medical informaticsR864 Medical recordsR895-920 Medical physics. Medical radiology. Nuclear medicine

Page 25: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Automatic methods

• TF*IDF: pick terms with the highest TF*IDF scores

• Centroid-based: pick terms that appear in the centroid with high scores

• The maximal marginal relevance principle (MMR)

• Related to summarization, snippet generation

Page 26: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Compression

• Methods– Fixed length codes– Huffman coding– Ziv-Lempel codes

Page 27: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Fixed length codes

• Binary representations– ASCII– Representational power (2k symbols where k

is the number of bits)

Page 28: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Variable length codes• Alphabet:

A .-  N -.  0 -----B -...  O ---  1 .----C -.-.  P .--.  2 ..---D -..  Q --.-  3 ...—E .  R .-. 4 ....-F ..-. S ... 5 .....G --. T -  6 -....H .... U ..-  7 --...I ..  V ...-  8 ---..J .---  W .--  9 ----.K -.-  X -..-L .-..  Y -.—M --  Z --..

• Demo:– http://www.scphillips.com/morse/

Page 29: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Most frequent letters in English

• Most frequent letters:– E T A O I N S H R D L U

• Demo:– http://www.amstat.org/publications/jse/secure/v7n2/

count-char.cfm • Also: bigrams:

– TH HE IN ER AN RE ND AT ON NT

Page 30: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Huffman coding

• Developed by David Huffman (1952)• Average of 5 bits per character (37.5%

compression)• Based on frequency distributions of

symbols• Algorithm: iteratively build a tree of

symbols starting with the two least frequent symbols

Page 31: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Symbol Frequency

A 7

B 4

C 10

D 5

E 2

F 11

G 15

H 3

I 7

J 8

Page 32: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

0

0

0

0

0

0

0

0

0

1

1

1

1

1

1

1

1

1

c

b d

f

g

i j

he

a

Page 33: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Symbol Code

A 0110

B 0010

C 000

D 0011

E 01110

F 010

G 10

H 01111

I 110

J 111

Page 34: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Exercise

• Consider the bit string: 01101101111000100110001110100111000110101101011101

• Use the Huffman code from the example to decode it.

• Try inserting, deleting, and switching some bits at random locations and try decoding.

Page 35: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Extensions

• Word-based

• Domain/genre dependent models

Page 36: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Ziv-Lempel coding

• Two types - one is known as LZ77 (used in GZIP)

• Code: set of triples <a,b,c>

• a: how far back in the decoded text to look for the upcoming text segment

• b: how many characters to copy

• c: new character to add to complete segment

Page 37: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

• <0,0,p> p• <0,0,e> pe• <0,0,t> pet• <2,1,r> peter• <0,0,_> peter_• <6,1,i> peter_pi• <8,2,r> peter_piper• <6,3,c> peter_piper_pic• <0,0,k> peter_piper_pick• <7,1,d> peter_piper_picked• <7,1,a> peter_piper_picked_a• <9,2,e> peter_piper_picked_a_pe• <9,2,_> peter_piper_picked_a_peck_• <0,0,o> peter_piper_picked_a_peck_o• <0,0,f> peter_piper_picked_a_peck_of• <17,5,l> peter_piper_picked_a_peck_of_pickl• <12,1,d> peter_piper_picked_a_peck_of_pickled• <16,3,p> peter_piper_picked_a_peck_of_pickled_pep• <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper• <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers

Page 38: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Links on text compression

• Data compression:– http://www.data-compression.info/

• Calgary corpus:– http://links.uwaterloo.ca/calgary.corpus.html

• Huffman coding:– http://www.compressconsult.com/huffman/ – http://en.wikipedia.org/wiki/Huffman_coding

• LZ– http://en.wikipedia.org/wiki/LZ77

Page 39: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

100 alternative search engines

• http://rss.slashdot.org/~r/Slashdot/slashdot/~3/83468703/article.pl

Page 40: Information Retrieval Search Engine Technology (3)   Prof. Dragomir R. Radev.

Readings

• 2: MRS9

• 3: MRS13, MRS14

• 4: MRS15, MRS16