Information Retrieval
description
Transcript of Information Retrieval
![Page 1: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/1.jpg)
(C) 2005, The University of Michigan 1
Information Retrieval
Dragomir R. RadevUniversity of Michigan
September 19, 2005
![Page 2: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/2.jpg)
(C) 2005, The University of Michigan 2
About the instructor
• Dragomir R. Radev• Associate Professor, University of Michigan
– School of Information– Department of Electrical Engineering and Computer Science– Department of Linguistics
• Head of CLAIR (Computational Linguistics And Information Retrieval) at U. Michigan
• Treasurer, North American Chapter of the ACL• Ph.D., 1998, Computer Science, Columbia University• Email: [email protected]• Home page: http://tangra.si.umich.edu/~radev
![Page 3: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/3.jpg)
(C) 2005, The University of Michigan 3
Introduction
![Page 4: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/4.jpg)
(C) 2005, The University of Michigan 4
IR systems
• Vivísimo
• AskJeeves
• NSIR
• Lemur
• MG
• Nutch
![Page 5: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/5.jpg)
(C) 2005, The University of Michigan 5
Examples of IR systems
• Conventional (library catalog). Search by keyword, title, author, etc.
• Text-based (Lexis-Nexis, Google, FAST).Search by keywords. Limited search using queries in natural language.
• Multimedia (QBIC, WebSeek, SaFe)Search by visual appearance (shapes, colors,… ).
• Question answering systems (AskJeeves, NSIR, Answerbus)Search in (restricted) natural language
![Page 6: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/6.jpg)
(C) 2005, The University of Michigan 6
![Page 7: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/7.jpg)
(C) 2005, The University of Michigan 7
![Page 8: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/8.jpg)
(C) 2005, The University of Michigan 8
Need for IR
• Advent of WWW - more than 8 Billion documents indexed on Google
• How much information? 200TB according to Lyman and Varian 2003.http://www.sims.berkeley.edu/research/projects/how-much-info/
• Search, routing, filtering
• User’s information need
![Page 9: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/9.jpg)
(C) 2005, The University of Michigan 9
Some definitions of Information Retrieval (IR)
Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.”
Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).”
![Page 10: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/10.jpg)
(C) 2005, The University of Michigan 10
Sample queries (from Excite)
In what year did baseball become an offical sport?play station codes . combirth control and depressiongovernment"WorkAbility I"+conferencekitchen applianceswhere can I find a chines rosewoodtiger electronics58 Plymouth FuryHow does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?emeril LagasseHubbleM.S Subalaksmirunning
![Page 11: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/11.jpg)
(C) 2005, The University of Michigan 11
Mappings and abstractions
Reality Data
Information need Query
From Korfhage’s book
![Page 12: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/12.jpg)
(C) 2005, The University of Michigan 12
Typical IR system
• (Crawling)
• Indexing
• Retrieval
• User interface
![Page 13: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/13.jpg)
(C) 2005, The University of Michigan 13
Key Terms Used in IR
• QUERY: a representation of what the user is looking for - can be a list of words or a phrase.
• DOCUMENT: an information entity that the user wants to retrieve
• COLLECTION: a set of documents
• INDEX: a representation of information that makes querying easier
• TERM: word or concept that appears in a document or a query
![Page 14: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/14.jpg)
(C) 2005, The University of Michigan 14
Documents
![Page 15: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/15.jpg)
(C) 2005, The University of Michigan 15
Documents
• Not just printed paper
• collections vs. documents
• data structures: representations
• Bag of words method
• document surrogates: keywords, summaries
• encoding: ASCII, Unicode, etc.
![Page 16: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/16.jpg)
(C) 2005, The University of Michigan 16
Document preprocessing
• Formatting
• Tokenization (Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc)
• Casing (cat vs. CAT)
• Stemming (computer, computation)
• Soundex
![Page 17: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/17.jpg)
(C) 2005, The University of Michigan 17
Document representations
• Term-document matrix (m x n)
• term-term matrix (m x m x n)
• document-document matrix (n x n)
• Example: 3,000,000 documents (n) with 50,000 terms (m)
• sparse matrices
• Boolean vs. integer matrices
![Page 18: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/18.jpg)
(C) 2005, The University of Michigan 18
Document representations
• Term-document matrix– Evaluating queries (e.g., (AB)C)– Storage issues
• Inverted files– Storage issues– Evaluating queries– Advantages and disadvantages
![Page 19: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/19.jpg)
(C) 2005, The University of Michigan 19
IR models
![Page 20: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/20.jpg)
(C) 2005, The University of Michigan 20
Major IR models
• Boolean
• Vector
• Probabilistic
• Language modeling
• Fuzzy retrieval
• Latent semantic indexing
![Page 21: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/21.jpg)
(C) 2005, The University of Michigan 21
Major IR tasks
• Ad-hoc
• Filtering and routing
• Question answering
• Spoken document retrieval
• Multimedia retrieval
![Page 22: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/22.jpg)
(C) 2005, The University of Michigan 22
Venn diagrams
x w y z
D1D2
![Page 23: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/23.jpg)
(C) 2005, The University of Michigan 23
Boolean model
A B
![Page 24: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/24.jpg)
(C) 2005, The University of Michigan 24
restaurants AND (Mideastern OR vegetarian) AND inexpensive
Boolean queries
• What types of documents are returned?
• Stemming
• thesaurus expansion
• inclusive vs. exclusive OR
• confusing uses of AND and OR
dinner AND sports AND symphony
4 OF (Pentium, printer, cache, PC, monitor, computer, personal)
![Page 25: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/25.jpg)
(C) 2005, The University of Michigan 25
Boolean queries• Weighting (Beethoven AND sonatas)
• precedence
coffee AND croissant OR muffin
raincoat AND umbrella OR sunglasses
• Use of negation: potential problems
• Conjunctive and Disjunctive normal forms
• Full CNF and DNF
![Page 26: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/26.jpg)
(C) 2005, The University of Michigan 27
Boolean model
• Partition
• Partial relevance?
• Operators: AND, NOT, OR, parentheses
![Page 27: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/27.jpg)
(C) 2005, The University of Michigan 28
Exercise
• D1 = “computer information retrieval”
• D2 = “computer retrieval”
• D3 = “information”
• D4 = “computer information”
• Q1 = “information retrieval”
• Q2 = “information ¬computer”
![Page 28: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/28.jpg)
(C) 2005, The University of Michigan 29
Exercise0
1 Swift
2 Shakespeare
3 Shakespeare Swift
4 Milton
5 Milton Swift
6 Milton Shakespeare
7 Milton Shakespeare Swift
8 Chaucer
9 Chaucer Swift
10 Chaucer Shakespeare
11 Chaucer Shakespeare Swift
12 Chaucer Milton
13 Chaucer Milton Swift
14 Chaucer Milton Shakespeare
15 Chaucer Milton Shakespeare Swift
((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
![Page 29: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/29.jpg)
(C) 2005, The University of Michigan 30
Stop lists• 250-300 most common words in English
account for 50% or more of a given text.
• Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.
• Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).
• Token/type ratio: 2256/859 = 2.63
![Page 30: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/30.jpg)
(C) 2005, The University of Michigan 31
Vector modelsTerm 1
Term 2
Term 3
Doc 1
Doc 2
Doc 3
![Page 31: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/31.jpg)
(C) 2005, The University of Michigan 32
Vector queries
• Each document is represented as a vector
• non-efficient representations (bit vectors)
• dimensional compatibility
W1 W2 W3 W4 W5 W6 W7 W8 W9 W10
C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
![Page 32: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/32.jpg)
(C) 2005, The University of Michigan 33
The matching process
• Document space
• Matching is done between a document and a query (or between two documents)
• distance vs. similarity
• Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.
![Page 33: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/33.jpg)
(C) 2005, The University of Michigan 34
Miscellaneous similarity measures
• The Cosine measure
(D,Q) = = (di x qi)
(di)2 * (qi)2
|X Y|
|X| * |Y|
(D,Q) =|X Y|
|X Y|
• The Jaccard coefficient
![Page 34: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/34.jpg)
(C) 2005, The University of Michigan 35
Exercise
• Compute the cosine measures (D1,D2) and (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1>
• Compute the corresponding Euclidean distances, Manhattan distances, and Jaccard coefficients.
![Page 35: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/35.jpg)
(C) 2005, The University of Michigan 36
Evaluation
![Page 36: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/36.jpg)
(C) 2005, The University of Michigan 37
Relevance
• Difficult to change: fuzzy, inconsistent
• Methods: exhaustive, sampling, pooling, search-based
![Page 37: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/37.jpg)
(C) 2005, The University of Michigan 38
Contingency table
w x
y z
n2 = w + y
n1 = w + x
N
relevant
not relevant
retrieved not retrieved
![Page 38: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/38.jpg)
(C) 2005, The University of Michigan 39
Precision and Recall
Recall:
Precision:
w
w+y
w+x
w
![Page 39: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/39.jpg)
(C) 2005, The University of Michigan 40
Exercise
Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista.
Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs.
Later, try different queries.
![Page 40: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/40.jpg)
(C) 2005, The University of Michigan 41
n Doc. no Relevant? Recall Precision1 588 x 0.2 1.00
2 589 x 0.4 1.00
3 576 0.4 0.67
4 590 x 0.6 0.75
5 986 0.6 0.60
6 592 x 0.8 0.67
7 984 0.8 0.57
8 988 0.8 0.50
9 578 0.8 0.44
10 985 0.8 0.40
11 103 0.8 0.36
12 591 0.8 0.33
13 772 x 1.0 0.38
14 990 1.0 0.36
[From Salton’s book]
![Page 41: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/41.jpg)
(C) 2005, The University of Michigan 42
P/R graph
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall
Pre
cis
ion
Interpolated average precision (e.g., 11pt)Interpolation – what is precision at recall=0.5?
![Page 42: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/42.jpg)
(C) 2005, The University of Michigan 43
Issues
• Why not use accuracy A=(w+z)/N?• Average precision• Average P at given “document cutoff
values”• Report when P=R• F measure: F=(2+1)PR/(2P+R)• F1 measure: F1 = 2/(1/R+1/P) : harmonic
mean of P and R
![Page 43: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/43.jpg)
(C) 2005, The University of Michigan 47
Relevance collections
• TREC ad hoc collections, 2-6 GB
• TREC Web collections, 2-100GB
![Page 44: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/44.jpg)
(C) 2005, The University of Michigan 48
Sample TREC query<top><num> Number: 305<title> Most Dangerous Vehicles
<desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example.</top>
LA031689-0177FT922-1008LA090190-0126LA101190-0218LA082690-0158LA112590-0109FT944-136LA020590-0119FT944-5300LA052190-0048LA051689-0139FT944-9371LA032390-0172
LA042790-0172LA021790-0136LA092289-0167LA111189-0013LA120189-0179LA020490-0021LA122989-0063LA091389-0119LA072189-0048FT944-15615LA091589-0101LA021289-0208
![Page 45: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/45.jpg)
(C) 2005, The University of Michigan 49
<DOCNO> LA031689-0177 </DOCNO><DOCID> 31701 </DOCID><DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE><SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION><LENGTH><P>586 words </P></LENGTH><HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE><BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE><TEXT><P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-overaccidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P><P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of theSuzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents afterConsumer Reports magazine charged that the vehicle had basic design flaws. </P><P>Several Fatalities </P><P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs,particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigationconducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P><P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicleroll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involvingthe Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After theaccident report, NHTSA declined to investigate the Samurai. </P>...</TEXT><GRAPHIC><P> Photo, The Ford Bronco II "appears to have a highernumber of single-vehicle, first event roll-overs," a federal officialsaid. </P></GRAPHIC><SUBJECT><P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS;RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P></SUBJECT></DOC>
![Page 46: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/46.jpg)
(C) 2005, The University of Michigan 50
TREC (cont’d)
• http://trec.nist.gov/tracks.html• http://
trec.nist.gov/presentations/presentations.html
![Page 47: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/47.jpg)
(C) 2005, The University of Michigan 51
Word distribution models
![Page 48: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/48.jpg)
(C) 2005, The University of Michigan 52
Shakespeare
• Romeo and Juliet:• And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me,
262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60;
• …
• A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1;
http://www.mta75.org/curriculum/english/Shakes/indexx.html
![Page 49: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/49.jpg)
(C) 2005, The University of Michigan 53
The BNC (Adam Kilgarriff)• 1 6187267 the det• 2 4239632 be v• 3 3093444 of prep• 4 2687863 and conj• 5 2186369 a det• 6 1924315 in prep• 7 1620850 to infinitive-marker• 8 1375636 have v• 9 1090186 it pron• 10 1039323 to prep• 11 887877 for prep• 12 884599 i pron• 13 760399 that conj• 14 695498 you pron• 15 681255 he pron• 16 680739 on prep• 17 675027 with prep• 18 559596 do v• 19 534162 at prep• 20 517171 by prep
Kilgarriff, A. Putting Frequencies in the Dictionary.International Journal of Lexicography10 (2) 1997. Pp 135--155
![Page 50: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/50.jpg)
(C) 2005, The University of Michigan 54
Stop lists• 250-300 most common words in English
account for 50% or more of a given text.
• Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%.
• Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%).
• Token/type ratio: 2256/859 = 2.63
![Page 51: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/51.jpg)
(C) 2005, The University of Michigan 55
Zipf’s law
Rank x Frequency Constant
Rank Term Freq. Z Rank Term Freq. Z
1 the 69,971 0.070 6 in 21,341 0.128
2 of 36,411 0.073 7 that 10,595 0.074
3 and 28,852 0.086 8 is 10,099 0.081
4 to 26.149 0.104 9 was 9,816 0.088
5 a 23,237 0.116 10 he 9,543 0.095
![Page 52: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/52.jpg)
(C) 2005, The University of Michigan 56
Zipf's law is fairly general!
• Frequency of accesses to web pages • in particular the access counts on the Wikipedia page,with s approximately equal to 0.3 • page access counts on Polish Wikipedia (data for late July 2003) approximately obey Zipf's law with s about 0.5
• Words in the English language • for instance, in Shakespeare’s play Hamlet with s approximately 0.5
• Sizes of settlements• Income distributions amongst individuals • Size of earthquakes• Notes in musical performances
http://en.wikipedia.org/wiki/Zipf's_law
![Page 53: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/53.jpg)
(C) 2005, The University of Michigan 57
Zipf’s law (cont’d)
• Limitations:– Low and high frequencies– Lack of convergence
• Power law with coefficient c = -1– Y=kxc
• Li (1992) – typing words one letter at a time, including spaces
![Page 54: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/54.jpg)
(C) 2005, The University of Michigan 60
Indexing
![Page 55: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/55.jpg)
(C) 2005, The University of Michigan 61
Methods
• Manual: e.g., Library of Congress subject headings, MeSH
• Automatic
![Page 56: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/56.jpg)
(C) 2005, The University of Michigan 62
LOC subject headings
http://www.loc.gov/catdir/cpso/lcco/lcco.html
A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY. RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD -- HISTORY (GENERAL) AND HISTORY OF EUROPEE -- HISTORY: AMERICAF -- HISTORY: AMERICAG -- GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL SCIENCESJ -- POLITICAL SCIENCEK -- LAWL -- EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER -- MEDICINES -- AGRICULTURET -- TECHNOLOGYU -- MILITARY SCIENCEV -- NAVAL SCIENCEZ -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)
![Page 57: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/57.jpg)
(C) 2005, The University of Michigan 63
MedicineCLASS R - MEDICINESubclass RR5-920 Medicine (General)R5-130.5 General worksR131-687 History of medicine. Medical expeditionsR690-697 Medicine as a profession. PhysiciansR702-703 Medicine and the humanities. Medicine and disease in
relation to history, literature, etc.R711-713.97 DirectoriesR722-722.32 Missionary medicine. Medical missionariesR723-726 Medical philosophy. Medical ethicsR726.5-726.8 Medicine and disease in relation to psychology.
Terminal care. DyingR727-727.5 Medical personnel and the public. Physician and the
publicR728-733 Practice of medicine. Medical practice economicsR735-854 Medical education. Medical schools. ResearchR855-855.5 Medical technologyR856-857 Biomedical engineering. Electronics. InstrumentationR858-859.7 Computer applications to medicine. Medical informaticsR864 Medical recordsR895-920 Medical physics. Medical radiology. Nuclear medicine
![Page 58: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/58.jpg)
(C) 2005, The University of Michigan 64
Finding the most frequent terms in a document
• Typically stop words: the, and, in
• Not content-bearing
• Terms vs. words
• Luhn’s method
![Page 59: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/59.jpg)
(C) 2005, The University of Michigan 65
Luhn’s method
WORDS
FREQUENCY
E
![Page 60: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/60.jpg)
(C) 2005, The University of Michigan 66
Computing term salience
• Term frequency (IDF)
• Document frequency (DF)
• Inverse document frequency (IDF)
N
wDFwIDF
)(log)(
![Page 61: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/61.jpg)
(C) 2005, The University of Michigan 67
Applications of TFIDF
• Cosine similarity
• Indexing
• Clustering
![Page 62: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/62.jpg)
(C) 2005, The University of Michigan 68
Vector-based matching
• The cosine measure
sim (D,C) =
(dk . ck . idf(k))
(dk)2 . (ck)2
k
k
k
![Page 63: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/63.jpg)
(C) 2005, The University of Michigan 69
IDF: Inverse document frequency
N: number of documentsdk: number of documents containing term kfik: absolute frequency of term k in document iwik: weight of term k in document i
idfk = log2(N/dk) + 1 = log2N - log2dk + 1
TF * IDF is used for automated indexing and for topicdiscrimination:
![Page 64: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/64.jpg)
(C) 2005, The University of Michigan 70
Asian and European news
622.941 deng306.835 china196.725 beijing153.608 chinese152.113 xiaoping124.591 jiang108.777 communist102.894 body 85.173 party 71.898 died 68.820 leader 43.402 state 38.166 people
97.487 nato92.151 albright74.652 belgrade46.657 enlargement34.778 alliance34.778 french33.803 opposition32.571 russia14.095 government 9.389 told 9.154 would 8.459 their 6.059 which
![Page 65: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/65.jpg)
(C) 2005, The University of Michigan 71
Other topics
120.385 shuttle 99.487 space 90.128 telescope 70.224 hubble 59.992 rocket 50.160 astronauts 49.722 discovery 47.782 canaveral 47.782 cape 40.889 mission 35.778 florida 27.063 center
74.652 compuserve65.321 massey55.989 salizzoni29.996 bob27.994 online27.198 executive15.890 interim15.271 chief11.647 service11.174 second 6.781 world 6.315 president
![Page 66: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/66.jpg)
(C) 2005, The University of Michigan 72
Compression
![Page 67: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/67.jpg)
(C) 2005, The University of Michigan 73
Compression
• Methods– Fixed length codes– Huffman coding– Ziv-Lempel codes
![Page 68: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/68.jpg)
(C) 2005, The University of Michigan 74
Fixed length codes
• Binary representations– ASCII– Representational power (2k symbols where k is
the number of bits)
![Page 69: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/69.jpg)
(C) 2005, The University of Michigan 75
Variable length codes• Alphabet:
A .- N -. 0 -----B -... O --- 1 .----C -.-. P .--. 2 ..---D -.. Q --.- 3 ...—E . R .-. 4 ....-F ..-. S ... 5 .....G --. T - 6 -....H .... U ..- 7 --...I .. V ...- 8 ---..J .--- W .-- 9 ----.K -.- X -..-L .-.. Y -.—M -- Z --..
• Demo:– http://www.babbage.demon.co.uk/morse.html– http://www.scphillips.com/morse/
![Page 70: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/70.jpg)
(C) 2005, The University of Michigan 76
Most frequent letters in English
• Most frequent letters:– E T A O I N S H R D L U– http://www.math.cornell.edu/~mec/modules/
cryptography/subs/frequencies.html• Demo:
– http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm
• Also: bigrams:– TH HE IN ER AN RE ND AT ON NT – http://www.math.cornell.edu/~mec/modules/
cryptography/subs/digraphs.html
![Page 71: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/71.jpg)
(C) 2005, The University of Michigan 77
Useful links about cryptography
• http://world.std.com/~franl/crypto.html
• http://www.faqs.org/faqs/cryptography-faq/
• http://en.wikipedia.org/wiki/Cryptography
![Page 72: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/72.jpg)
(C) 2005, The University of Michigan 78
Huffman coding
• Developed by David Huffman (1952)• Average of 5 bits per character (37.5%
compression)• Based on frequency distributions of
symbols• Algorithm: iteratively build a tree of
symbols starting with the two least frequent symbols
![Page 73: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/73.jpg)
(C) 2005, The University of Michigan 79
Symbol Frequency
A 7
B 4
C 10
D 5
E 2
F 11
G 15
H 3
I 7
J 8
![Page 74: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/74.jpg)
(C) 2005, The University of Michigan 80
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
c
b d
f
g
i j
he
a
![Page 75: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/75.jpg)
(C) 2005, The University of Michigan 81
Symbol Code
A 0110
B 0010
C 000
D 0011
E 01110
F 010
G 10
H 01111
I 110
J 111
![Page 76: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/76.jpg)
(C) 2005, The University of Michigan 82
Exercise
• Consider the bit string: 01101101111000100110001110100111000110101101011101
• Use the Huffman code from the example to decode it.
• Try inserting, deleting, and switching some bits at random locations and try decoding.
![Page 77: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/77.jpg)
(C) 2005, The University of Michigan 83
Ziv-Lempel coding
• Two types - one is known as LZ77 (used in GZIP)
• Code: set of triples <a,b,c>• a: how far back in the decoded text to look
for the upcoming text segment• b: how many characters to copy• c: new character to add to complete segment
![Page 78: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/78.jpg)
(C) 2005, The University of Michigan 84
• <0,0,p> p• <0,0,e> pe• <0,0,t> pet• <2,1,r> peter• <0,0,_> peter_• <6,1,i> peter_pi• <8,2,r> peter_piper• <6,3,c> peter_piper_pic• <0,0,k> peter_piper_pick• <7,1,d> peter_piper_picked• <7,1,a> peter_piper_picked_a• <9,2,e> peter_piper_picked_a_pe• <9,2,_> peter_piper_picked_a_peck_• <0,0,o> peter_piper_picked_a_peck_o• <0,0,f> peter_piper_picked_a_peck_of• <17,5,l> peter_piper_picked_a_peck_of_pickl• <12,1,d> peter_piper_picked_a_peck_of_pickled• <16,3,p> peter_piper_picked_a_peck_of_pickled_pep• <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper• <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers
![Page 79: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/79.jpg)
(C) 2005, The University of Michigan 85
Links on text compression
• Data compression:– http://www.data-compression.info/
• Calgary corpus:– http://links.uwaterloo.ca/calgary.corpus.html
• Huffman coding:– http://www.compressconsult.com/huffman/– http://en.wikipedia.org/wiki/Huffman_coding
• LZ– http://en.wikipedia.org/wiki/LZ77
![Page 80: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/80.jpg)
(C) 2005, The University of Michigan 86
Relevance feedback and
query expansion
![Page 81: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/81.jpg)
(C) 2005, The University of Michigan 87
Relevance feedback
• Problem: initial query may not be the most appropriate to satisfy a given information need.
• Idea: modify the original query so that it gets closer to the right documents in the vector space
![Page 82: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/82.jpg)
(C) 2005, The University of Michigan 88
Relevance feedback
• Automatic
• Manual
• Method: identifying feedback termsQ’ = a1Q + a2R - a3N
Often a1 = 1, a2 = 1/|R| and a3 = 1/|N|
![Page 83: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/83.jpg)
(C) 2005, The University of Michigan 89
Example
• Q = “safety minivans”• D1 = “car safety minivans tests injury statistics” -
relevant• D2 = “liability tests safety” - relevant• D3 = “car passengers injury reviews” - non-
relevant• R = ?• S = ?• Q’ = ?
![Page 84: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/84.jpg)
(C) 2005, The University of Michigan 90
Pseudo relevance feedback
• Automatic query expansion– Thesaurus-based expansion (e.g., using latent
semantic indexing – later…)– Distributional similarity– Query log mining
![Page 85: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/85.jpg)
(C) 2005, The University of Michigan 106
String matching
![Page 86: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/86.jpg)
(C) 2005, The University of Michigan 107
String matching methods
• Index-based
• Full or approximate– E.g., theater = theatre
![Page 87: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/87.jpg)
(C) 2005, The University of Michigan 108
Index-based matching
• Inverted files
• Position-based inverted files
• Block-based inverted files
1 6 9 11 1719 24 28 33 40 46 50 55 60
This is a text. A text has many words. Words are made from letters.
Text: 11, 19
Words: 33, 40
From: 55
![Page 88: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/88.jpg)
(C) 2005, The University of Michigan 109
Inverted index (trie)
Letters: 60
Text: 11, 19
Words: 33, 40
Made: 50
Many: 28
l
m
t
w
ad
n
![Page 89: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/89.jpg)
(C) 2005, The University of Michigan 110
Sequential searching
• No indexing structure given• Given: database d and search pattern p.
– Example: find “words” in the earlier example
• Brute force method– try all possible starting positions
– O(n) positions in the database and O(m) characters in the pattern so the total worst-case runtime is O(mn)
– Typical runtime is actually O(n) given that mismatches are easy to notice
![Page 90: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/90.jpg)
(C) 2005, The University of Michigan 111
Knuth-Morris-Pratt
• Average runtime similar to BF
• Worst case runtime is linear: O(n)
• Idea: reuse knowledge
• Need preprocessing of the pattern
![Page 91: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/91.jpg)
(C) 2005, The University of Michigan 112
Knuth-Morris-Pratt (cont’d)
• Example (http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm)
database: ABC ABC ABC ABDAB ABCDABCDABDE
pattern: ABCDABD
index 0 1 2 3 4 5 6 7 char A B C D A B D – pos -1 0 0 0 0 1 2 0
1234567ABCDABD ABCDABD
![Page 92: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/92.jpg)
(C) 2005, The University of Michigan 113
Knuth-Morris-Pratt (cont’d)ABC ABC ABC ABDAB ABCDABCDABDEABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^
![Page 93: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/93.jpg)
(C) 2005, The University of Michigan 115
Word similarity
• Hamming distance - when words are of the same length
• Levenshtein distance - number of edits (insertions, deletions, replacements)– color --> colour (1)– survey --> surgery (2)– com puter --> computer ?
• Longest common subsequence (LCS)– lcs (survey, surgery) = surey
![Page 94: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/94.jpg)
(C) 2005, The University of Michigan 116
Levenshtein edit distance
• Examples:– Theatre-> theater– Ghaddafi->Qadafi– Computer->counter
• Edit distance (inserts, deletes, substitutions)– Edit transcript
• Done through dynamic programming
![Page 95: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/95.jpg)
(C) 2005, The University of Michigan 117
Recurrence relation
• Three dependencies– D(i,0)=i– D(0,j)=j– D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j-
1)+t(i,j)]
• Simple edit distance: – t(i,j) = 0 iff S1(i)=S2(j)
![Page 96: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/96.jpg)
(C) 2005, The University of Michigan 118
Example
Gusfield 1997
W R I T E R S
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
V 1 1
I 2 2
N 3 3
T 4 4
N 5 5
E 6 6
R 7 7
![Page 97: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/97.jpg)
(C) 2005, The University of Michigan 119
Example (cont’d)
Gusfield 1997
W R I T E R S
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
V 1 1 1 2 3 4 5 6 7
I 2 2 2 2 2 3 4 5 6
N 3 3 3 3 3 3 4 5 6
T 4 4 4 4 4 *
N 5 5
E 6 6
R 7 7
![Page 98: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/98.jpg)
(C) 2005, The University of Michigan 120
Tracebacks
Gusfield 1997
W R I T E R S
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
V 1 1 1 2 3 4 5 6 7
I 2 2 2 2 2 3 4 5 6
N 3 3 3 3 3 3 4 5 6
T 4 4 4 4 4 *
N 5 5
E 6 6
R 7 7
![Page 99: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/99.jpg)
(C) 2005, The University of Michigan 121
Weighted edit distance
• Used to emphasize the relative cost of different edit operations
• Useful in bioinformatics– Homology information– BLAST– Blosum– http://eta.embl-heidelberg.de:8000/misc/mat/
blosum50.html
![Page 100: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/100.jpg)
(C) 2005, The University of Michigan 122
• Web sites:– http://www.merriampark.com/ld.htm– http://odur.let.rug.nl/~kleiweg/lev/
![Page 101: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/101.jpg)
(C) 2005, The University of Michigan 123
Clustering
![Page 102: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/102.jpg)
(C) 2005, The University of Michigan 124
Clustering
• Exclusive/overlapping clusters
• Hierarchical/flat clusters
• The cluster hypothesis– Documents in the same cluster are relevant to
the same query
![Page 103: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/103.jpg)
(C) 2005, The University of Michigan 125
Representations for document clustering
• Typically: vector-based– Words: “cat”, “dog”, etc.– Features: document length, author name, etc.
• Each document is represented as a vector in an n-dimensional space
• Similar documents appear nearby in the vector space (distance measures are needed)
![Page 104: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/104.jpg)
(C) 2005, The University of Michigan 126
Hierarchical clusteringDendrograms
http://odur.let.rug.nl/~kleiweg/clustering/clustering.html
E.g., language similarity:
![Page 105: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/105.jpg)
(C) 2005, The University of Michigan 127
Another example
• Kingdom = animal• Phylum = Chordata• Subphylum = Vertebrata• Class = Osteichthyes• Subclass = Actinoptergyii• Order = Salmoniformes• Family = Salmonidae• Genus = Oncorhynchus• Species = Oncorhynchus kisutch (Coho salmon)
![Page 106: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/106.jpg)
(C) 2005, The University of Michigan 128
Clustering using dendrograms
REPEATCompute pairwise similaritiesIdentify closest pairMerge pair into single node
UNTIL only one node leftQ: what is the equivalent Venn diagram representation?
Example: cluster the following sentences:
A B C B AA D C C A D EC D E F C D AE F G F D AA C D A B A
![Page 107: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/107.jpg)
(C) 2005, The University of Michigan 129
Methods
• Single-linkage– One common pair is sufficient– disadvantages: long chains
• Complete-linkage– All pairs have to match– Disadvantages: too conservative
• Average-linkage• Centroid-based (online)
– Look at distances to centroids
• Demo:– /clair4/class/ir-w05/clustering
![Page 108: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/108.jpg)
(C) 2005, The University of Michigan 130
k-means
• Needed: small number k of desired clusters
• hard vs. soft decisions
• Example: Weka
![Page 109: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/109.jpg)
(C) 2005, The University of Michigan 131
k-means
1 initialize cluster centroids to arbitrary vectors
2 while further improvement is possible do
3 for each document d do
4 find the cluster c whose centroid is closest to d
5 assign d to cluster c
6 end for
7 for each cluster c do
8 recompute the centroid of cluster c based on its documents
9 end for
10 end while
![Page 110: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/110.jpg)
(C) 2005, The University of Michigan 132
Example
• Cluster the following vectors into two groups:– A = <1,6>– B = <2,2>– C = <4,0>– D = <3,3>– E = <2,5>– F = <2,1>
![Page 111: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/111.jpg)
(C) 2005, The University of Michigan 133
Complexity
• Complexity = O(kn) because at each step, n documents have to be compared to k centroids.
![Page 112: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/112.jpg)
(C) 2005, The University of Michigan 136
Human clustering
• Significant disagreement in the number of clusters, overlap of clusters, and the composition of clusters (Maczkassy et al. 1998).
![Page 113: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/113.jpg)
(C) 2005, The University of Michigan 137
Lexical networks
![Page 114: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/114.jpg)
(C) 2005, The University of Michigan 138
Lexical Networks
• Used to represent relationships between words
• Example: WordNet - created by George Miller’s team at Princeton
• Based on synsets (synonyms, interchangeable words) and lexical matrices
![Page 115: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/115.jpg)
(C) 2005, The University of Michigan 139
Lexical matrix
Word FormsWord
Meanings F1 F2 F3 … Fn
M1 E1,1 E1,2
M2 E1,2
……
Mm Em,n
![Page 116: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/116.jpg)
(C) 2005, The University of Michigan 140
Synsets
• Disambiguation– {board, plank}– {board, committee}
• Synonyms– substitution– weak substitution– synonyms must be of the same part of speech
![Page 117: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/117.jpg)
(C) 2005, The University of Michigan 141
$ ./wn board -hypen
Synonyms/Hypernyms (Ordered by Frequency) of noun board
9 senses of board
Sense 1board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping
Sense 2board => sheet, flat solid => artifact, artefact => object, physical object => entity, something
Sense 3board, plank => lumber, timber => building material => artifact, artefact => object, physical object => entity, something
![Page 118: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/118.jpg)
(C) 2005, The University of Michigan 142
Sense 4display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something
Sense 5board, gameboard => surface => artifact, artefact => object, physical object => entity, something
Sense 6board, table => fare => food, nutrient => substance, matter => object, physical object => entity, something
![Page 119: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/119.jpg)
(C) 2005, The University of Michigan 143
Sense 7control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something
Sense 8circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something
Sense 9dining table, board => table => furniture, piece of furniture, article of furniture => furnishings => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something
![Page 120: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/120.jpg)
(C) 2005, The University of Michigan 144
Antonymy
• “x” vs. “not-x”
• “rich” vs. “poor”?
• {rise, ascend} vs. {fall, descend}
![Page 121: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/121.jpg)
(C) 2005, The University of Michigan 145
Other relations
• Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”.
• Hyponymy: {tree} is a hyponym of {plant}.
• Hierarchical structure based on hyponymy (and hypernymy).
![Page 122: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/122.jpg)
(C) 2005, The University of Michigan 146
Other features of WordNet
• Index of familiarity
• Polysemy
![Page 123: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/123.jpg)
(C) 2005, The University of Michigan 147
board used as a noun is familiar (polysemy count = 9)
bird used as a noun is common (polysemy count = 5)
cat used as a noun is common (polysemy count = 7)
house used as a noun is familiar (polysemy count = 11)
information used as a noun is common (polysemy count = 5)
retrieval used as a noun is uncommon (polysemy count = 3)
serendipity used as a noun is very rare (polysemy count = 1)
Familiarity and polysemy
![Page 124: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/124.jpg)
(C) 2005, The University of Michigan 148
Compound nouns
advisory boardappeals boardbackboardbackgammon boardbaseboardbasketball backboardbig boardbillboardbinder's boardbinder board
blackboardboard gameboard measureboard meetingboard memberboard of appealsboard of directorsboard of educationboard of regentsboard of trustees
![Page 125: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/125.jpg)
(C) 2005, The University of Michigan 149
Overview of senses1. board -- (a committee having supervisory powers; "the board has seven members")2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows")3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes)4. display panel, display board, board -- (a board on which information can be displayed to public view)5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces")6. board, table -- (food or meals in general; "she sets a fine table"; "room and board")7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree")8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")
![Page 126: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/126.jpg)
(C) 2005, The University of Michigan 150
Top-level concepts{act, action, activity}
{animal, fauna}
{artifact}
{attribute, property}
{body, corpus}
{cognition, knowledge}
{communication}
{event, happening}
{feeling, emotion}
{food}
{group, collection}
{location, place}
{motive}
{natural object}
{natural phenomenon}
{person, human being}
{plant, flora}
{possession}
{process}
{quantity, amount}
{relation}
{shape}
{state, condition}
{substance}
{time}
![Page 127: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/127.jpg)
(C) 2005, The University of Michigan 151
WordNet and DistSim
wn reason -hypen - hypernyms
wn reason -synsn - synsets
wn reason -simsn - synonyms
wn reason -over - overview of senses
wn reason -famln - familiarity/polysemy
wn reason -grepn - compound nouns
/data2/tools/relatedwords/relate reason
![Page 128: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/128.jpg)
(C) 2005, The University of Michigan 152
System comparison
![Page 129: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/129.jpg)
(C) 2005, The University of Michigan 153
Comparing two systems
• Comparing A and B
• One query?
• Average performance?
• Need: A to consistently outperform B
[this slide: courtesy James Allan]
![Page 130: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/130.jpg)
(C) 2005, The University of Michigan 154
The sign test
• Example 1:– A > B (12 times)
– A = B (25 times)
– A < B (3 times)
– p < 0.035 (significant at the 5% level)
• Example 2:– A > B (18 times)
– A < B (9 times)
– p < 0.122 (not significant at the 5% level)[this slide: courtesy James Allan]
![Page 131: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/131.jpg)
(C) 2005, The University of Michigan 155
Other tests
• The t test:– Takes into account the actual performances, not
just which system is better– http://nimitz.mcs.kent.edu/~blewis/stat/
tTest.html
• The sign test:– http://www.fon.hum.uva.nl/Service/Statistics/
Sign_Test.html
![Page 132: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/132.jpg)
(C) 2005, The University of Michigan 156
Techniques for dimensionalityreduction: SVD and LSI
![Page 133: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/133.jpg)
(C) 2005, The University of Michigan 157
Techniques for dimensionality reduction
• Based on matrix decomposition (goal: preserve clusters, explain away variance)
• A quick review of matrices– Vectors
– Matrices
– Matrix multiplication
1
1
2
*
1494
852
321
![Page 134: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/134.jpg)
(C) 2005, The University of Michigan 158
SVD: Singular Value Decomposition
• A=UVT
• This decomposition exists for all matrices, dense or sparse
• If A has 5 columns and 3 rows, then U will be 5x5 and V will be 3x3
• In Matlab, use [U,S,V] = svd (A)
![Page 135: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/135.jpg)
(C) 2005, The University of Michigan 159
Term matrix normalization
1
0
1
1
0
0
000
101
1010
0
0
0
1
1
110
110
101
A
71.0
00.0
71.0
58.0
00.0
00.0
00.000.000.0
45.000.058.0
45.000.058.000.0
00.0
00.0
00.0
58.0
58.0
45.071.000.0
45.071.000.0
45.000.058.0
)(nA
D1 D2 D3 D4 D5
D1 D2 D3 D4 D5
![Page 136: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/136.jpg)
(C) 2005, The University of Michigan 160
Example (Berry and Browne)
• T1: baby• T2: child• T3: guide• T4: health • T5: home• T6: infant• T7: proofing• T8: safety• T9: toddler
• D1: infant & toddler first aid• D2: babies & children’s room (for
your home)• D3: child safety at home• D4: your baby’s health and safety:
from infant to toddler• D5: baby proofing basics• D6: your guide to easy rust proofing• D7: beanie babies collector’s guide
![Page 137: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/137.jpg)
(C) 2005, The University of Michigan 161
Document term matrix
0001001
0001100
0110000
0001001
0000110
0001000
1100000
0000110
1011010
A
00045.00071.0
00045.058.000
071.071.00000
00045.00071.0
000058.058.00
00045.0000
71.071.000000
000058.058.00
71.0071.045.0058.00
)(nA
![Page 138: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/138.jpg)
(C) 2005, The University of Michigan 162
Decompositionu =
-0.6976 -0.0945 0.0174 -0.6950 0.0000 0.0153 0.1442 -0.0000 0 -0.2622 0.2946 0.4693 0.1968 -0.0000 -0.2467 -0.1571 -0.6356 0.3098 -0.3519 -0.4495 -0.1026 0.4014 0.7071 -0.0065 -0.0493 -0.0000 0.0000 -0.1127 0.1416 -0.1478 -0.0734 0.0000 0.4842 -0.8400 0.0000 -0.0000 -0.2622 0.2946 0.4693 0.1968 0.0000 -0.2467 -0.1571 0.6356 -0.3098 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 -0.3098 -0.6356 -0.3519 -0.4495 -0.1026 0.4014 -0.7071 -0.0065 -0.0493 0.0000 -0.0000 -0.2112 0.3334 0.0962 0.2819 -0.0000 0.7338 0.4659 -0.0000 0.0000 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 0.3098 0.6356
v =
-0.1687 0.4192 -0.5986 0.2261 0 -0.5720 0.2433 -0.4472 0.2255 0.4641 -0.2187 0.0000 -0.4871 -0.4987 -0.2692 0.4206 0.5024 0.4900 -0.0000 0.2450 0.4451 -0.3970 0.4003 -0.3923 -0.1305 0 0.6124 -0.3690 -0.4702 -0.3037 -0.0507 -0.2607 -0.7071 0.0110 0.3407 -0.3153 -0.5018 -0.1220 0.7128 -0.0000 -0.0162 -0.3544 -0.4702 -0.3037 -0.0507 -0.2607 0.7071 0.0110 0.3407
![Page 139: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/139.jpg)
(C) 2005, The University of Michigan 163
Decomposition
s = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0.7100 0 0 0 0 0 0 0 0.5692 0 0 0 0 0 0 0 0.1977 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Spread on the v1 axis
![Page 140: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/140.jpg)
(C) 2005, The University of Michigan 164
Rank-4 approximations4 =
1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
![Page 141: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/141.jpg)
(C) 2005, The University of Michigan 165
Rank-4 approximationu*s4*v' -0.0019 0.5985 -0.0148 0.4552 0.7002 0.0102 0.7002 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.1980 0.0514 0.0064 0.2199 0.0535 -0.0544 0.0535 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.2165 0.2494 0.4367 0.2282 -0.0360 0.0394 -0.0360 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008
![Page 142: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/142.jpg)
(C) 2005, The University of Michigan 166
Rank-4 approximationu*s4
-1.1056 -0.1203 0.0207 -0.5558 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.1786 0.1801 -0.1765 -0.0587 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.3348 0.4241 0.1149 0.2255 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0
![Page 143: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/143.jpg)
(C) 2005, The University of Michigan 167
Rank-4 approximations4*v'
-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 -0.7150 0.5544 0.6001 -0.4686 -0.0605 -0.1457 -0.0605 0.1808 -0.1749 0.3918 -0.1043 -0.2085 0.5700 -0.2085 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
![Page 144: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/144.jpg)
(C) 2005, The University of Michigan 168
Rank-2 approximations2 =
1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
![Page 145: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/145.jpg)
(C) 2005, The University of Michigan 169
Rank-2 approximationu*s2*v'
0.1361 0.4673 0.2470 0.3908 0.5563 0.4089 0.5563 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.1057 0.1205 0.1239 0.1430 0.0293 -0.0341 0.0293 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.2343 0.2454 0.2685 0.3027 0.0286 -0.1073 0.0286 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048
![Page 146: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/146.jpg)
(C) 2005, The University of Michigan 170
Rank-2 approximationu*s2
-1.1056 -0.1203 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.1786 0.1801 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.3348 0.4241 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0
![Page 147: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/147.jpg)
(C) 2005, The University of Michigan 171
Rank-2 approximations2*v'
-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
![Page 148: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/148.jpg)
(C) 2005, The University of Michigan 172
Documents to concepts and terms to concepts
A(:,1)'*u*s
-0.4238 0.6784 -0.8541 0.1446 -0.0000 -0.1853 0.0095
>> A(:,1)'*u*s4
-0.4238 0.6784 -0.8541 0.1446 0 0 0
>> A(:,1)'*u*s2
-0.4238 0.6784 0 0 0 0 0
>> A(:,2)'*u*s2
-1.1233 0.3650 0 0 0 0 0
>> A(:,3)'*u*s2
-0.6762 0.6807 0 0 0 0 0
![Page 149: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/149.jpg)
(C) 2005, The University of Michigan 173
Documents to concepts and terms to concepts
>> A(:,4)'*u*s2
-0.9972 0.6478 0 0 0 0 0
>> A(:,5)'*u*s2
-1.1809 -0.4914 0 0 0 0 0
>> A(:,6)'*u*s2
-0.7918 -0.8121 0 0 0 0 0
>> A(:,7)'*u*s2
-1.1809 -0.4914 0 0 0 0 0
![Page 150: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/150.jpg)
(C) 2005, The University of Michigan 174
Cont’d>> (s2*v'*A(1,:)')'
-1.7523 -0.1530 0 0 0 0 0 0 0
>> (s2*v'*A(2,:)')'
-0.6585 0.4768 0 0 0 0 0 0 0
>> (s2*v'*A(3,:)')'
-0.8838 -0.7275 0 0 0 0 0 0 0
>> (s2*v'*A(4,:)')'
-0.2831 0.2291 0 0 0 0 0 0 0
>> (s2*v'*A(5,:)')'
-0.6585 0.4768 0 0 0 0 0 0 0
![Page 151: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/151.jpg)
(C) 2005, The University of Michigan 175
Cont’d>> (s2*v'*A(6,:)')'
-0.4730 0.6078 0 0 0 0 0 0 0
>> (s2*v'*A(7,:)')'
-0.8838 -0.7275 0 0 0 0 0 0 0
>> (s2*v'*A(8,:)')'
-0.5306 0.5395 0 0 0 0 0 0 0
>> (s2*v'*A(9,:)')‘
-0.4730 0.6078 0 0 0 0 0 0 0
![Page 152: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/152.jpg)
(C) 2005, The University of Michigan 176
PropertiesA*A'
1.5471 0.3364 0.5041 0.2025 0.3364 0.2025 0.5041 0.2025 0.2025 0.3364 0.6728 0 0 0.6728 0 0 0.3364 0 0.5041 0 1.0082 0 0 0 0.5041 0 0 0.2025 0 0 0.2025 0 0.2025 0 0.2025 0.2025 0.3364 0.6728 0 0 0.6728 0 0 0.3364 0 0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066 0.5041 0 0.5041 0 0 0 1.0082 0 0 0.2025 0.3364 0 0.2025 0.3364 0.2025 0 0.5389 0.2025 0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066
A'*A
1.0082 0 0 0.6390 0 0 0 0 1.0092 0.6728 0.2610 0.4118 0 0.4118 0 0.6728 1.0092 0.2610 0 0 0 0.6390 0.2610 0.2610 1.0125 0.3195 0 0.3195 0 0.4118 0 0.3195 1.0082 0.5041 0.5041 0 0 0 0 0.5041 1.0082 0.5041 0 0.4118 0 0.3195 0.5041 0.5041 1.0082
A is a document to term matrix. What is A*A’, what is A’*A?
![Page 153: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/153.jpg)
(C) 2005, The University of Michigan 177
Latent semantic indexing (LSI)
• Dimensionality reduction = identification of hidden (latent) concepts
• Query matching in latent space
![Page 154: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/154.jpg)
(C) 2005, The University of Michigan 178
Useful pointers
• http://lsa.colorado.edu• http://lsi.research.telcordia.com/• http://www.cs.utk.edu/~lsi/• http://javelina.cet.middlebury.edu/lsa/out/
lsa_definition.htm• http://citeseer.nj.nec.com/
deerwester90indexing.html• http://www.pcug.org.au/~jdowling/
![Page 155: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/155.jpg)
(C) 2005, The University of Michigan 179
Models of the Web
![Page 156: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/156.jpg)
(C) 2005, The University of Michigan 180
Size
• The Web is the largest repository of data and it grows exponentially.– 320 Million Web pages [Lawrence & Giles 1998]
– 800 Million Web pages, 15 TB [Lawrence & Giles 1999]
– 8 Billion Web pages indexed [Google 2005]
• Amount of data– roughly 200 TB [Lyman et al. 2003]
![Page 157: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/157.jpg)
(C) 2005, The University of Michigan 181
Bow-tie model of the Web
SCC56 M
OUT44 M
IN44 M
Bröder & al. WWW 2000, Dill & al. VLDB 2001
DISC17 M
TEND44M
24% of pagesreachable froma given page
![Page 158: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/158.jpg)
(C) 2005, The University of Michigan 182
Power laws
• Web site size (Huberman and Adamic 1999)• Power-law connectivity (Barabasi and Albert
1999): exponents 2.45 for out-degree and 2.1 for the in-degree
• Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3.
![Page 159: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/159.jpg)
(C) 2005, The University of Michigan 183
Small-world networks
• Diameter = average length of the shortest path between all pairs of nodes. Example…
• Milgram experiment (1967)– Kansas/Omaha --> Boston (42/160 letters)– diameter = 6
• Albert et al. 1999 – average distance between two verstices is d = 0.35 + 2.06 log10n. For n = 109, d=18.89.
• Six degrees of separation
![Page 160: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/160.jpg)
(C) 2005, The University of Michigan 184
Clustering coefficient
• Cliquishness (c): between the kv (kv – 1)/2 pairs of neighbors.
• Examples:
n k d drand C crand
Actors 225226 61 3.65 2.99 0.79 0.00027
Power grid 4941 2.67 18.7 12.4 0.08 0.005
C. Elegans 282 14 2.65 2.25 0.28 0.05
![Page 161: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/161.jpg)
(C) 2005, The University of Michigan 185
Models of the Web
Npkk
kekP
kk
!)(
)()(
k
kP
A
B
a
b
• Erdös/Rényi 59, 60
• Barabási/Albert 99
• Watts/Strogatz 98
• Kleinberg 98
• Menczer 02
• Radev 03
• Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology
![Page 162: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/162.jpg)
(C) 2005, The University of Michigan 188
Social network analysis for IR
![Page 163: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/163.jpg)
(C) 2005, The University of Michigan 189
Social networks
• Induced by a relation• Symmetric or not• Examples:
– Friendship networks– Board membership– Citations– Power grid of the US– WWW
![Page 164: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/164.jpg)
(C) 2005, The University of Michigan 190
Krebs 2004
![Page 165: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/165.jpg)
(C) 2005, The University of Michigan 191
Prestige and centrality
• Degree centrality: how many neighbors each node has.
• Closeness centrality: how close a node is to all of the other nodes
• Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes
• Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects.
• Prestige = same as centrality but for directed graphs.
![Page 166: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/166.jpg)
(C) 2005, The University of Michigan 192
Graph-based representations
1
2
34
5
7
6 81 2 3 4 5 6 7 8
1 1 1
2 1
3 1 1
4 1
5 1 1 1 1
6 1 1
7
8
Square connectivity(incidence) matrix
Graph G (V,E)
![Page 167: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/167.jpg)
(C) 2005, The University of Michigan 193
Markov chains
• A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E.
• Path = sequence (x0, x1, …, xn).Xi = xi-1*E
• The probability of a path can be computed as a product of probabilities for each step i.
• Random walk = find Xj given x0, E, and j.
![Page 168: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/168.jpg)
(C) 2005, The University of Michigan 194
Stationary solutions
• The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions:– E is stochastic
– E is irreducible
– E is aperiodic
• To make these conditions true:– All rows of E add up to 1 (and no value is negative)
– Make sure that E is strongly connected
– Make sure that E is not bipartite
• Example: PageRank [Brin and Page 1998]: use “teleportation”
![Page 169: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/169.jpg)
(C) 2005, The University of Michigan 195
1
2
34
5
7
6 8
Example
This graph E has a second graph E’(not drawn) superimposed on it:E’ is the uniform transition graph.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Pag
eRan
k
t=0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Pag
eRan
k
t=1
![Page 170: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/170.jpg)
(C) 2005, The University of Michigan 196
Eigenvectors
• An eigenvector is an implicit “direction” for a matrix.Mv = λv, where v is non-zero, though λ can be any
complex number in principle.
• The largest eigenvalue of a stochastic matrix E is real: λ1 = 1.
• For λ1, the left (principal) eigenvector is p, the right eigenvector = 1
• In other words, ETp = p.
![Page 171: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/171.jpg)
(C) 2005, The University of Michigan 197
Computing the stationary distribution
0)(
pEI
pEpT
T
function PowerStatDist (E):begin p(0) = u; (or p(0) = [1,0,…0]) i=1; repeat p(i) = ETp(i-1)
L = ||p(i)-p(i-1)||1; i = i + 1; until L < return p(i)
end
Solution for thestationary distribution
![Page 172: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/172.jpg)
(C) 2005, The University of Michigan 198
1
2
34
5
7
6 8
Example
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Pag
eRan
k
t=0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Pag
eRan
k
t=1
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
Pag
eRan
k
t=10
![Page 173: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/173.jpg)
(C) 2005, The University of Michigan 199
How Google works
• Crawling
• Anchor text
• Fast query processing
• Pagerank
![Page 174: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/174.jpg)
(C) 2005, The University of Michigan 200
More about PageRank
• Named after Larry Page, founder of Google (and UM alum)
• Reading “The anatomy of a large-scale hypertextual web search engine” by Brin and Page.
• Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank.
![Page 175: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/175.jpg)
(C) 2005, The University of Michigan 201
HITS
• Query-dependent model (Kleinberg 97)• Hubs and authorities (e.g., cars, Honda)
• Algorithm– obtain root set using input query– expanded the root set by radius one– run iterations on the hub and authority scores together– report top-ranking authorities and hubs
hEa T'Eah '
![Page 176: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/176.jpg)
(C) 2005, The University of Michigan 202
The link-content hypothesis
• Topical locality: page is similar () to the page that points to it ().
• Davison (TF*IDF, 100K pages)– 0.31 same domain
– 0.23 linked pages
– 0.19 sibling
– 0.02 random
• Menczer (373K pages, non-linear least squares fit)
• Chakrabarti (focused crawling) - prob. of losing the topic
Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001
21)1()(
e 03.01=1.8, 2=0.6,
![Page 177: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/177.jpg)
(C) 2005, The University of Michigan 203
Measuring the Web
![Page 178: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/178.jpg)
(C) 2005, The University of Michigan 204
Bharat and Broder 1998
• Based on crawls of HotBot, Altavista, Excite, and InfoSeek
• 10,000 queries in mid and late 1997
• Estimate is 200M pages
• Only 1.4% are indexed by all of them
![Page 179: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/179.jpg)
(C) 2005, The University of Michigan 205
Example (from Bharat&Broder)
A similar approach by Lawrence and Giles yields 320M pages (Lawrence and Giles 1998).
![Page 180: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/180.jpg)
(C) 2005, The University of Michigan 212
Question answering
![Page 181: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/181.jpg)
(C) 2005, The University of Michigan 213
People ask questions
• Excite corpus of 2,477,283 queries (one day’s worth)
• 8.4% of them are questions– 43.9% factual (what is the country code for
Belgium)– 56.1% procedural (how do I set up TCP/IP) or
other
• In other words, 100 K questions per day
![Page 182: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/182.jpg)
(C) 2005, The University of Michigan 214
People ask questionsIn what year did baseball become an offical sport?Who is the largest man in the world?Where can i get information on Raphael?where can i find information on puritan religion?Where can I find how much my house is worth?how do i get out of debt?Where can I found out how to pass a drug test?When is the Super Bowl?who is California's District State Senator?where can I buy extra nibs for a foutain pen?how do i set up tcp/ip ?what time is it in west samoa?Where can I buy a little kitty cat?what are the symptoms of attention deficit disorder?Where can I get some information on Michael Jordan?How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?When did the Neanderthal man live?Which Frenchman declined the Nobel Prize for Literature for ideological reasons?What is the largest city in Northern Afghanistan?
How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero?When did the Neanderthal man live?
![Page 183: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/183.jpg)
(C) 2005, The University of Michigan 215
![Page 184: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/184.jpg)
(C) 2005, The University of Michigan 216
Question answering
What is the largest city in Northern Afghanistan?
![Page 185: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/185.jpg)
(C) 2005, The University of Michigan 217
Possible approaches
• Map?• Knowledge base
Find x: city (x) located (x,”Northern Afghanistan”) ¬exists (y): city (y) located (y,”Northern Afghanistan”) greaterthan (population (y), population (x))
• Database?• World factbook?• Search engine?
![Page 186: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/186.jpg)
(C) 2005, The University of Michigan 218
The TREC Q&A evaluation
• Run by NIST [Voorhees and Tice 2000]• 2GB of input• 200 questions• Essentially fact extraction
– Who was Lincoln’s secretary of state?– What does the Peugeot company manufacture?
• Questions are based on text• Answers are assumed to be present• No inference needed
![Page 187: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/187.jpg)
(C) 2005, The University of Michigan 219
Q: When did Nelson Mandela become president of South Africa?
A: 10 May 1994
Q: How tall is the Matterhorn?
A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches
Q: How tall is the replica of the Matterhorn at Disneyland?
A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years
Q: If Iraq attacks a neighboring country, what should the US do?
A: ??
Question answering
![Page 188: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/188.jpg)
(C) 2005, The University of Michigan 220
Q: Why did David Koresh ask the FBI for a word processor?Q: Name the designer of the shoe that spawned millions of plastic imitations, known as "jellies".Q: What is the brightest star visible from Earth?Q: What are the Valdez Principles?Q: Name a film that has won the Golden Bear in the Berlin Film Festival?Q: Name a country that is developing a magnetic levitation railway system?Q: Name the first private citizen to fly in space.Q: What did Shostakovich write for Rostropovich?Q: What is the term for the sum of all genetic material in a given organism?Q: What is considered the costliest disaster the insurance industry has ever faced?Q: What is Head Start?Q: What was Agent Orange used for during the Vietnam War?Q: What did John Hinckley do to impress Jodie Foster?Q: What was the first Gilbert and Sullivan opera?Q: What did Richard Feynman say upon hearing he would receive the Nobel Prize in Physics?Q: How did Socrates die?Q: Why are electric cars less efficient in the north-east than in California?
![Page 189: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/189.jpg)
(C) 2005, The University of Michigan 221
NSIR
• Current project at U-M– http://tangra.si.umich.edu/clair/NSIR/html/nsir.cgi
• Reading:– [Radev et al., 2005a]
• Dragomir R. Radev, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep Grewal. Probabilistic question answering on the web. Journal of the American Society for Information Science and Technology 56(3), March 2005
• http://tangra.si.umich.edu/~radev/bib2html/radev-bib.html
![Page 190: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/190.jpg)
(C) 2005, The University of Michigan 222
![Page 191: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/191.jpg)
(C) 2005, The University of Michigan 223
... Afghanistan, Kabul, 2,450 ... Administrative capital and largest city (1997 est ... Undetermined.Panama, Panama City, 450,668. ... of the Gauteng, Northern Province, Mpumalanga ... www.infoplease.com/cgi-bin/id/A0855603
... died in Kano, northern Nigeria's largest city, during two days of anti-American riotsled by Muslims protesting the US-led bombing of Afghanistan, according to ... www.washingtonpost.com/wp-dyn/print/world/
... air strikes on the city. ... the Taliban militia in northern Afghanistan in a significantblow ... defection would be the largest since the United States ... www.afgha.com/index.php - 60k
... Kabul is the capital and largest city of Afghanistan. . ... met. area pop. 2,029,889),is the largest city in Uttar Pradesh, a state in northern India. . ... school.discovery.com/homeworkhelp/worldbook/atozgeography/ k/k1menu.html
... Gudermes, Chechnya's second largest town. The attack ... location in Afghanistan's outlyingregions ... in the city of Mazar-i-Sharif, a Northern Alliance-affiliated ... english.pravda.ru/hotspots/2001/09/17/
... Get Worse By RICK BRAGG Pakistan's largest city is getting a jump on the ... Region: EducationOffers Women in Northern Afghanistan a Ray of Hope. ... www.nytimes.com/pages/world/asia/
... within three miles of the airport at Mazar-e-Sharif, the largest city in northernAfghanistan, held since 1998 by the Taliban. There was no immediate comment ... uk.fc.yahoo.com/photos/a/afghanistan.html
![Page 192: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/192.jpg)
(C) 2005, The University of Michigan 224
Document retrieval
Query modulation
Sentence retrieval
Answer extraction
Answer ranking
What is the largest city in Northern Afghanistan?
(largest OR biggest) city “Northern Afghanistan”
www.infoplease.com/cgi-bin/id/A0855603www.washingtonpost.com/wp-dyn/print/world/
Gudermes, Chechnya's second largest town … location in Afghanistan's outlying regionswithin three miles of the airport at Mazar-e-Sharif, the largest city in northern Afghanistan
GudermesMazer-e-Sharif
Mazer-e-SharifGudermes
![Page 193: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/193.jpg)
(C) 2005, The University of Michigan 225
![Page 194: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/194.jpg)
(C) 2005, The University of Michigan 226
![Page 195: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/195.jpg)
(C) 2005, The University of Michigan 227
Research problems• Source identification:
– semi-structured vs. text sources
• Query modulation:– best paraphrase of a NL question given the syntax of a search engine?– Compare two approaches: noisy channel model and rule-based
• Sentence ranking– n-gram matching, Okapi, co-reference?
• Answer extraction– question type identification– phrase chunking– no general-purpose named entity tagger available
• Answer ranking– what are the best predictors of a phrase being the answer to a given
question: question type, proximity to query words, frequency
• Evaluation (MRDR)– accuracy, reliability, timeliness
![Page 196: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/196.jpg)
(C) 2005, The University of Michigan 228
Document retrieval
• Use existing search engines: Google, AlltheWeb, NorthernLight
• No modifications to question
• CF: work on QASM (ACM CIKM 2001)
![Page 197: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/197.jpg)
(C) 2005, The University of Michigan 229
F
tfwtfwidftfwS
N
kk
N
jj
N
iii
i
321
13
12
11 ****
Sentence ranking
• Weighted N-gram matching:
• Weights are determined empirically, e.g., 0.6, 0.3, and 0.1
![Page 198: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/198.jpg)
(C) 2005, The University of Michigan 230
Probabilistic phrase reranking
• Answer extraction: probabilistic phrase reranking. What is:
p(ph is answer to q | q, ph)
• Evaluation: TRDR– Example: (2,8,10) gives .725– Document, sentence, or phrase level
• Criterion: presence of answer(s)
• High correlation with manual assessment
n
iirn 1
11
![Page 199: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/199.jpg)
(C) 2005, The University of Michigan 231
Phrase types
PERSON PLACE DATE NUMBER DEFINITIONORGANIZATION DESCRIPTION ABBREVIATIONKNOWNFOR RATE LENGTH MONEY REASONDURATION PURPOSE NOMINAL OTHER
![Page 200: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/200.jpg)
(C) 2005, The University of Michigan 232
Question Type Identification• Wh-type not sufficient:
• Who: PERSON 77, DESCRIPTION 19, ORG 6• What: NOMINAL 78, PLACE 27, DEF26, PERSON 18, ORG 16,
NUMBER 14, etc.• How: NUMBER 33, LENGTH 6, RATE 2, etc.
• Ripper:– 13 features: Question-Words, Wh-Word, Word-Beside-Wh-
Word, Is-Noun-Length, Is-Noun-Person, etc.– Top 2 question types
• Heuristic algorithm:– About 100 regular expressions based on words and parts of
speech
![Page 201: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/201.jpg)
(C) 2005, The University of Michigan 233
Ripper performance
-20.69%-TREC8,9,10
30%17.03%TREC10TREC8,9
24%22.4%TREC8TREC9
Test Error Rate
Train Error Rate
TestTraining
![Page 202: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/202.jpg)
(C) 2005, The University of Michigan 234
Regex performance
7.6%5.5%4.6%TREC8,9,10
18.2%6%7.4%TREC8,9
18%15%7.8%TREC9
Test on TREC10
Test on TREC8
Test on TREC9
Training
![Page 203: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/203.jpg)
(C) 2005, The University of Michigan 235
Phrase ranking
• Phrases are identified by a shallow parser (ltchunk from Edinburgh)
• Four features:– Proximity– POS (part-of-speech) signature (qtype)– Query overlap– Frequency
![Page 204: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/204.jpg)
(C) 2005, The University of Michigan 236
Proximity
• Phrasal answers tend to appear near words from the query
• Average distance = 7 words, range = 1 to 50 words
• Use linearrescalingof scores
![Page 205: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/205.jpg)
(C) 2005, The University of Michigan 237
Part of speech signature
NO (100%)NO (86.7%) PERSON (3.8%) NUMBER (3.8%) ORG (2.5%)PERSON (37.4%) PLACE (29.6%) DATE (21.7%) NO (7.6%)NO (75.6%) NUMBER (11.1%) PLACE (4.4%) ORG (4.4%)PLACE (37.3%) PERSON (35.6%) NO (16.9%) ORG (10.2%)ORG (55.6%) NO (33.3%) PLACE (5.6%) DATE (5.6%)
VBDDT NNNNPDT JJ NNPNNP NNPDT NNP
Phrase TypesSignature
Example: “Hugo/NNP Young/NNP”P (PERSON | “NNP NNP”) = .458
Example: “the/DT Space/NNP Flight/NNP Operations/NNP contractor/NN”P (PERSON | “DT NNP NNP NNP NN”) = 0
Penn Treebank tagset (DT = determiner, JJ = adjective)
![Page 206: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/206.jpg)
(C) 2005, The University of Michigan 238
Query overlap and frequency
• Query overlap:– What is the capital of Zimbabwe?– Possible choices:
Mugabe, Zimbabwe, Luanda, Harare
• Frequency:– Not necessarily accurate but rather useful
![Page 207: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/207.jpg)
(C) 2005, The University of Michigan 239
Reranking
Rank Probability and phrase
1 0.599862 the_DT Space_NNP Flight_NNP Operations_NNP contractor_NN ._.2 0.598564 International_NNP Space_NNP Station_NNP Alpha_NNP3 0.598398 International_NNP Space_NNP Station_NNP4 0.598125 to_TO become_VB5 0.594763 a_DT joint_JJ venture_NN United_NNP Space_NNP Alliance_NNP6 0.593933 NASA_NNP Johnson_NNP Space_NNP Center_NNP7 0.587140 will_MD form_VB8 0.585410 The_DT purpose_NN9 0.576797 prime_JJ contracts_NNS10 0.568013 First_NNP American_NNP11 0.567361 this_DT bulletin_NN board_NN12 0.565757 Space_NNP :_:13 0.562627 'Spirit_NN '_'' of_IN...41 0.516368 Alan_NNP Shepard_NNP
Proximity = .5164
![Page 208: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/208.jpg)
(C) 2005, The University of Michigan 240
Reranking
Rank Probability and phrase
1 0.465012 Space_NNP Administration_NNP ._.2 0.446466 SPACE_NNP CALENDAR_NNP _.3 0.413976 First_NNP American_NNP4 0.399043 International_NNP Space_NNP Station_NNP Alpha_NNP5 0.396250 her_PRP$ third_JJ space_NN mission_NN6 0.395956 NASA_NNP Johnson_NNP Space_NNP Center_NNP7 0.394122 the_DT American_NNP Commercial_NNP Launch_NNP Industry_NNP8 0.390163 the_DT Red_NNP Planet_NNP ._.9 0.379797 First_NNP American_NNP10 0.376336 Alan_NNP Shepard_NNP11 0.375669 February_NNP12 0.374813 Space_NNP13 0.373999 International_NNP Space_NNP Station_NNP
Qtype = .7288Proximity * qtype = .3763
![Page 209: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/209.jpg)
(C) 2005, The University of Michigan 241
Reranking
Rank Probability and phrase
1 0.478857 Neptune_NNP Beach_NNP ._.2 0.449232 February_NNP3 0.447075 Go_NNP4 0.437895 Space_NNP5 0.431835 Go_NNP6 0.424678 Alan_NNP Shepard_NNP7 0.423855 First_NNP American_NNP8 0.421133 Space_NNP May_NNP9 0.411065 First_NNP American_NNP woman_NN10 0.401994 Life_NNP Sciences_NNP11 0.385763 Space_NNP Shuttle_NNP Discovery_NNP STS-60_NN12 0.381865 the_DT Moon_NNP International_NNP Space_NNP Station_NNP13 0.370030 Space_NNP Research_NNP A_NNP Session_NNP
All four features
![Page 210: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/210.jpg)
(C) 2005, The University of Michigan 242
![Page 211: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/211.jpg)
(C) 2005, The University of Michigan 243
![Page 212: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/212.jpg)
(C) 2005, The University of Michigan 244
![Page 213: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/213.jpg)
(C) 2005, The University of Michigan 245
Document level performance
164163149#>0
1.33611.04950.8355Avg
GoogleNLightAlltheWebEngine
TREC 8 corpus (200 questions)
![Page 214: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/214.jpg)
(C) 2005, The University of Michigan 246
Sentence level performance
1351371591191211599999148#>0
0.490.542.550.440.482.530.260.312.13Avg
GOO
GOL
GOU
NLO
NLL
NLU
AWO
AWL
AWU
Engine
![Page 215: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/215.jpg)
(C) 2005, The University of Michigan 247
Phrase level performance
0.1990.1570.1170.105Combined
0.06460.0580.0540.038Global proximity
0.06460.0680.0480.026Appearance order
1.9412.6982.6522.176Upperbound
Google S+PGoogle D+PNorthernLightAlltheWeb
Experiments performedOct-Nov. 2001
![Page 216: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/216.jpg)
(C) 2005, The University of Michigan 248
![Page 217: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/217.jpg)
(C) 2005, The University of Michigan 249
![Page 218: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/218.jpg)
(C) 2005, The University of Michigan 250
Text classification
![Page 219: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/219.jpg)
(C) 2005, The University of Michigan 251
Introduction
• Text classification: assigning documents to predefined categories
• Hierarchical vs. flat• Many techniques: generative (maxent, knn, Naïve
Bayes) vs. discriminative (SVM, regression)• Generative: model joint prob. p(x,y) and use
Bayesian prediction to compute p(y|x)• Discriminative: model p(y|x) directly.
![Page 220: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/220.jpg)
(C) 2005, The University of Michigan 252
Generative models: knn
• K-nearest neighbors
• Very easy to program
• Issues: choosing k, b?
)(
),(),(qdkNNd
qcq ddsbdcscore
![Page 221: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/221.jpg)
(C) 2005, The University of Michigan 253
Feature selection: The 2 test
• For a term t:
• Testing for independence:P(C=0,It=0) should be equal to P(C=0) P(It=0)– P(C=0) = (k00+k01)/n– P(C=1) = 1-P(C=0) = (k10+k11)/n– P(It=0) = (k00+K10)/n– P(It=1) = 1-P(It=0) = (k01+k11)/n
It
0 1
C 0 k00 k01
1 k10 k11
![Page 222: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/222.jpg)
(C) 2005, The University of Michigan 254
Feature selection: The 2 test
• High values of 2 indicate lower belief in independence.
• In practice, compute 2 for all words and pick the top k among them.
))()()((
)(
0010011100011011
2011000112
kkkkkkkk
kkkknΧ
![Page 223: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/223.jpg)
(C) 2005, The University of Michigan 255
Feature selection: mutual information
• No document length scaling is needed
• Documents are assumed to be generated according to the multinomial model
x y yPxP
yxPyxPYXMI
)()(
),(log),(),(
![Page 224: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/224.jpg)
(C) 2005, The University of Michigan 256
Naïve Bayesian classifiers
• Naïve Bayesian classifier
• Assuming statistical independence
),(
)()|,...,(),...,|(
,...21
2121
k
kk FFFP
CdPCdFFFPFFFCdP
k
j j
k
j j
kFP
CdPCdFPFFFCdP
1
121
)(
)()|(),...,|(
![Page 225: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/225.jpg)
(C) 2005, The University of Michigan 257
Spam recognitionReturn-Path: <[email protected]>X-Sieve: CMU Sieve 2.2From: "Ibrahim Galadima" <[email protected]>Reply-To: [email protected]: [email protected]: Tue, 14 Jan 2003 21:06:26 -0800Subject: Gooday
DEAR SIR
FUNDS FOR INVESTMENTS
THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HADNO PREVIOUS CORRESPONDENCE WITH YOU
I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENTNATIONAL ELECTORAL COMMISSION INEC I GOT YOURCONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLEPERSON WITH WHOM TO HANDLE A VERY CONFIDENTIALTRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED ATTWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATESDOLLARS US$20M TO A SAFE FOREIGN ACCOUNT
THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITHARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OFOVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A
![Page 226: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/226.jpg)
(C) 2005, The University of Michigan 258
Well-known datasets• 20 newsgroups
– http://people.csail.mit.edu/u/j/jrennie/public_html/20Newsgroups/
• Reuters-21578– Cats: grain, acquisitions, corn, crude, wheat, trade…
• WebKB– http://www-2.cs.cmu.edu/~webkb/– course, student, faculty, staff, project, dept, other– NB performance (2000)– P=26,43,18,6,13,2,94– R=83,75,77,9,73,100,35
![Page 227: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/227.jpg)
(C) 2005, The University of Michigan 259
Support vector machines
• Introduced by Vapnik in the early 90s.
![Page 228: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/228.jpg)
(C) 2005, The University of Michigan 260
Semi-supervised learning
• EM
• Co-training
• Graph-based
![Page 229: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/229.jpg)
(C) 2005, The University of Michigan 261
Additional topics
• Soft margins
• VC dimension
• Kernel methods
![Page 230: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/230.jpg)
(C) 2005, The University of Michigan 262
• SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters.
• NB also good in many circumstances
![Page 231: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/231.jpg)
(C) 2005, The University of Michigan 263
Readings• Books:• 1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto; Modern Information Retrieval, Addison-
Wesley/ACM Press, 1999.• 2. Pierre Baldi, Paolo Frasconi, Padhraic Smyth; Modeling the Internet and the Web: Probabilistic
Methods and Algorithms; Wiley, 2003, ISBN: 0-470-84906-1
• Papers:• Barabasi and Albert "Emergence of scaling in random networks" Science (286) 509-512, 1999• Bharat and Broder "A technique for measuring the relative size and overlap of public Web search
engines" WWW 1998• Brin and Page "The Anatomy of a Large-Scale Hypertextual Web Search Engine" WWW 1998 • Bush "As we may thing" The Atlantic Monthly 1945 • Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999• Cho, Garcia-Molina, and Page "Efficient Crawling Through URL Ordering" WWW 1998• Davison "Topical locality on the Web" SIGIR 2000• Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999• Deerwester, Dumais, Landauer, Furnas, Harshman "Indexing by latent semantic analysis" JASIS
41(6) 1990
![Page 232: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/232.jpg)
(C) 2005, The University of Michigan 264
Readings• Erkan and Radev "LexRank: Graph-based Lexical Centrality as Salience in Text
Summarization" JAIR 22, 2004• Jeong and Barabasi "Diameter of the world wide web" Nature (401) 130-131, 1999• Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC
2000• Haveliwala "Topic-sensitive pagerank" WWW 2002• Kumar, Raghavan, Rajagopalan, Sivakumar, Tomkins, Upfal "The Web as a graph"
PODS 2000• Lawrence and Giles "Accessibility of information on the Web" Nature (400) 107-109,
1999• Lawrence and Giles "Searching the World-Wide Web" Science (280) 98-100, 1998• Menczer "Links tell us about lexical and semantic Web content" arXiv 2001• Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to
the Web" Stanford TR, 1998• Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web"
JASIST 2005• Singhal "Modern Information Retrieval: an Overview" IEEE 2001
![Page 233: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/233.jpg)
(C) 2005, The University of Michigan 265
More readings• Gerard Salton, Automatic Text Processing, Addison-
Wesley (1989)• Gerald Kowalski, Information Retrieval Systems: Theory
and Implementation, Kluwer (1997)• Gerard Salton and M. McGill, Introduction to Modern
Information Retrieval, McGraw-Hill (1983)• C. J. an Rijsbergen, Information Retrieval, Buttersworths
(1979)• Ian H. Witten, Alistair Moffat, and Timothy C. Bell,
Managing Gigabytes, Van Nostrand Reinhold (1994)• ACM SIGIR Proceedings, SIGIR Forum• ACM conferences in Digital Libraries
![Page 234: Information Retrieval](https://reader036.fdocuments.us/reader036/viewer/2022062723/56813f5b550346895daa28b6/html5/thumbnails/234.jpg)
(C) 2005, The University of Michigan 266
Thank you!
Благодаря!