2010 © University of Michigan
Document Preprocessing and Indexing
SI650: Information Retrieval
Winter 2010
School of Information
University of Michigan
1
2010 © University of Michigan
Typical IR system architecture
User
query
judgments
documents
results
QueryRep
Doc
Rep
Ranking
Feedback
INDEXING
SEARCHING
QUERY MODIFICATION
INTERFACE
- From ChengXiang Zhai’s slides
0 2 0 0 0
0 0 1 1 0
0 1 0 3 0
0 0 0 0 1
1 1 0 0 0
0 1 1 0 0
2
2010 © University of Michigan
Overload of text content
Content Type
Published Content
Professional web content
User generated content
Private text content
Amount / day 3-4G ~ 2G 8-10G ~ 3T
- Ramakrishnan and Tomkins 2007
3
2010 © University of Michigan
Data volume behind online information systems
~750k /day
~3M day
~150k /day
1M
10B
6M
~100B
4
2010 © University of Michigan
IR Winter 2010
…
Automated indexing/labeling
Storing, indexing and searching text.
Inverted indexes.
…
2010 © University of Michigan
Handling large collections
• Life is good when every document is mapped into a vector of words, but …
• Consider N = 1 million documents, each with about 1000 words.
• Avg 6 bytes/word including spaces/punctuation – 6GB of data in the documents.
• Say there are M = 500K distinct terms among these.
Sec. 1.1
6
2010 © University of Michigan
Storage issue
• 500K x 1M matrix has half-a-trillion elements.– 4 bytes for an integer– 500K x 1M x 4 = 2T (your laptop would fail)– 500K x 100G x 4 = 2*105 T (challenging even for
google)
• But it has no more than one billion positive numbers.– matrix is extremely sparse.– 1000 x 1M x 4 = 4G
• What’s a better representation?
Sec. 1.1
7
2010 © University of Michigan
Indexing
• Indexing = Convert documents to data structures that enable fast search
• Inverted index is the dominating indexing method (used by all search engines)
• Other indices (e.g., document index) may be needed for feedback
8
2010 © University of Michigan
Inverted index
• Instead of an incidence vector, use a posting table
• CLEVELAND: D1, D2, D6• OHIO: D1, D5, D6, D7• Use linked lists to be able to insert new
document postings in order and to remove existing postings.
• More efficient than scanning docs (why?)
9
2010 © University of Michigan
Inverted index
• Fast access to all docs containing a given term (along with frequency and position information)
• For each term, we get a list of tuples – (docID, freq, pos).
• Given a query, we can fetch the lists for all query terms and work on the involved documents.– Boolean query: set operation– Natural language query: term weight summing
• Keep everything sorted! This gives you a logarithmic improvement in access.
10
2010 © University of Michigan
Inverted index - example
• For each term t, we must store a list of all documents that contain t.– Identify each by a docID, a document serial number
Sec. 1.2
Dictionary Postings
PostingPosting
Brutus
Calpurnia
Caesar 1 2 4 5 6 16 57 132
1 2 4 11 31 45173
2 31
174
54101
- From Chris Manning’s slides
11
2010 © University of Michigan
Inverted index - example
This is a sample
document
with one sample
sentence
Doc 1
This is another
sample document
Doc 2
Dictionary Postings
Term # docs
Total freq
This 2 2
is 2 2
sample 2 3
another 1 1
… … …
Doc id Freq Pos
1 1 1
2 1 1
1 1 2
2 1 2
1 2 4, 8
2 1 4
2 1 3
… … …
… … …
- From ChengXiang Zhai’s slides
12
2010 © University of Michigan
Basic operations on inverted indexes
• Conjunction (AND) – iterative merge of the two postings: O(x+y)
• Disjunction (OR) – very similar• Negation (NOT) – can we still do it in O(x+y)?
– Example: MICHIGAN AND NOT OHIO– Example: MICHIGAN OR NOT OHIO
• Recursive operations• Optimization: start with the smallest sets
13
2010 © University of Michigan
Data structures for inverted index
• Dictionary: modest size– Needs fast random access– Preferred to be in memory– Hash table, B-tree, trie, …
• Postings: huge– Sequential access is expected – Can stay on disk– May contain docID, term freq., term pos, etc– Compression is desirable
14
2010 © University of Michigan
Constructing inverted index
• The main difficulty is to build a huge index with limited memory
• Memory-based methods: not usable for large collections
• Sort-based methods: – Step 1: collect local (termID, docID, freq) tuples– Step 2: sort local tuples (to make “runs”)– Step 3: pair-wise merge runs– Step 4: Output inverted file
15
2010 © University of Michigan
Sort-based inversion
...
Term Lexicon:
the 1
cold 2
days 3
a 4
...
DocID
Lexicon:
doc1 1
doc2 2
doc3 3
...
doc1
doc1
doc300
<1,1,3>
<1,2,2>
<2,1,2>
<2,4,3>
...
<1,5,3>
<1,6,2>
…
<1,299,3>
<1,300,1>
...
Sort by term-id
“Local” sort
<1,1,3>
<1,2,2>
<1,5,2>
<1,6,3>
...
<1,300,3>
<2,1,2>
…
<5000,299,1>
<5000,300,1>
...Merge sort
All info about term 1
<1,1,3>
<2,1,2>
<3,1,1>
...
<1,2,2>
<3,2,3>
<4,2,2>
…
<1,300,3>
<3,300,1>
...
Sort by doc-id
Parse & Count16
2010 © University of Michigan
IR Winter 2010
…
Document preprocessing.
Tokenization. Stemming.
The Porter algorithm.
…
2010 © University of Michigan
Can we make it even better?
• Index term selection/normalization– Reduce the size of the vocabulary
• Index compression– Reduce the space of storage
18
2010 © University of Michigan 19
Should we index every term?
• How big is English?– Dictionary Marketing– Education (Testing of Vocabulary Size)
– Psychology– Statistics– Linguistics
• Two Very Different Answers– Chomsky: language is infinite– Shannon: 1.25 bits per character
• Should we care about a term– If no body uses it as a query?
2010 © University of Michigan
What is a good indexing term?
• Specific (phrases) or general (single word)?• Luhn found that words with middle frequency are
most useful– Not too specific (low utility, but still useful!)– Not too general (lack of discrimination, stop words)– Stop word removal is common, but rare words are
kept
• All words or a (controlled) subset? When term weighting is used, it is a matter of weighting not selecting of indexing terms (more later)
20
2010 © University of Michigan
Term selection for indexing
• Manual: e.g., Library of Congress subject headings, MeSH
• Automatic: e.g., TF*IDF based
21
2010 © University of Michigan
LOC subject headings
http://www.loc.gov/catdir/cpso/lcco/lcco.html
A -- GENERAL WORKSB -- PHILOSOPHY. PSYCHOLOGY. RELIGIONC -- AUXILIARY SCIENCES OF HISTORYD -- HISTORY (GENERAL) AND HISTORY OF EUROPEE -- HISTORY: AMERICAF -- HISTORY: AMERICAG -- GEOGRAPHY. ANTHROPOLOGY. RECREATIONH -- SOCIAL SCIENCESJ -- POLITICAL SCIENCEK -- LAWL -- EDUCATIONM -- MUSIC AND BOOKS ON MUSICN -- FINE ARTSP -- LANGUAGE AND LITERATUREQ -- SCIENCER -- MEDICINES -- AGRICULTURET -- TECHNOLOGYU -- MILITARY SCIENCEV -- NAVAL SCIENCEZ -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)
22
2010 © University of Michigan
MedicineCLASS R - MEDICINE
Subclass R
R5-920 Medicine (General)
R5-130.5 General works
R131-687 History of medicine. Medical expeditions
R690-697 Medicine as a profession. Physicians
R702-703 Medicine and the humanities. Medicine and disease in relation to history, literature, etc.
R711-713.97 Directories
R722-722.32 Missionary medicine. Medical missionaries
R723-726 Medical philosophy. Medical ethics
R726.5-726.8 Medicine and disease in relation to psychology. Terminal care. Dying
R727-727.5 Medical personnel and the public. Physician and the public
R728-733 Practice of medicine. Medical practice economics
R735-854 Medical education. Medical schools. Research
R855-855.5 Medical technology
R856-857 Biomedical engineering. Electronics. Instrumentation
R858-859.7 Computer applications to medicine. Medical informatics
R864 Medical records
R895-920 Medical physics. Medical radiology. Nuclear medicine23
2010 © University of Michigan
Automatic term selection methods
• TF*IDF: pick terms with the highest TF*IDF scores
• Centroid-based: pick terms that appear in the centroid with high scores
• The maximal marginal relevance principle (MMR)
• Related to summarization, snippet generation
24
2010 © University of Michigan
Non-English languages
– Arabic:
– Japanese:
– Chinese: 信息檢索
– German: Lebensversicherungsgesellschaftsangesteller
كتاب
この本は重い。
25
2010 © University of Michigan
Document preprocessing
• What should we use to index?• Dealing with formatting and encoding issues• Hyphenation, accents, stemming, capitalization• Tokenization:
– Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc, can’t
– Example: “The New York-Los Angeles flight”
26
2010 © University of Michigan
Document preprocessing
• Normalization:– Casing (cat vs. CAT)– Stemming (computer, computation)– String matching– Labeled/labelled, extraterrestrial/extra-terrestrial/extra
terrestrial, Qaddafi/Kadhafi/Ghadaffi
• Index reduction– Dropping stop words (“and”, “of”, “to”)– Problematic for “to be or not to be”
27
2010 © University of Michigan
Tokenization
• Normalize lexical units: Words with similar meanings should be mapped to the same indexing term
• Stemming: Mapping all inflectional forms of words to the same root form, e.g.– computer -> compute– computation -> compute– computing -> compute (but king->k?)
• Porter’s Stemmer is popular for English
28
2010 © University of Michigan
Porter’s algorithm
Example: the word “duplicatable”
duplicat rule from step 4duplicate rule from step 1b1duplic rule from step 3
The application of another rule in step 4, removing “ic,” cannotbe applied since one rule from each step is allowed to be applied.
29
2010 © University of Michigan
Porter’s algorithm
Computable Comput
Intervention Intervent
Retrieval Retriev
Document Docum
Representing Repres
Representative Repres
30
2010 © University of Michigan
Links
• http://maya.cs.depaul.edu/~classes/ds575/porter.html
• http://www.tartarus.org/~martin/PorterStemmer/def.txt
31
2010 © University of Michigan
IR Winter 2010
…Approximate string matching
…
2010 © University of Michigan
Approximate string matching
• The Soundex algorithm (Odell and Russell)
• Uses:– spelling correction– hash function– non-recoverable
33
2010 © University of Michigan
The Soundex algorithm
1. Retain the first letter of the name, and drop all occurrences of a,e,h,I,o,u,w,y in other positions
2. Assign the following numbers to the remaining letters after the first:b,f,p,v : 1
c,g,j,k,q,s,x,z : 2
d,t : 3
l : 4
m n : 5
r : 6
34
2010 © University of Michigan
The Soundex algorithm
3. if two or more letters with the same code were adjacent in the original name, omit all but the first
4. Convert to the form “LDDD” by adding terminal zeros or by dropping rightmost digits
Examples:
Euler: E460, Gauss: G200, Hilbert: H416, Knuth :K530, Lloyd: L300
same as Ellery, Ghosh, Heilbronn, Kant, and Ladd
Some problems: Rogers and Rodgers, Sinclair and StClair
35
2010 © University of Michigan
Levenshtein edit distance
• Examples:– Theatre-> theater
– Ghaddafi->Qadafi
– Computer->counter
• Edit distance (inserts, deletes, substitutions)– Edit transcript
• Done through dynamic programming
36
2010 © University of Michigan
Recurrence relation
• Three dependencies– D(i, 0)=i
– D(0, j)=j
– D(i, j)=min[D(i-1,j)+1, D(1,j-1)+1, D(i-1,j-1)+t(i,j)]
• Simple edit distance: – t(i, j) = 0 iff S1(i) = S2(j)
• Target: D(l1, l2)
37
2010 © University of Michigan
Example
Gusfield 1997
W R I T E R S
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
V 1 1
I 2 2
N 3 3
T 4 4
N 5 5
E 6 6
R 7 7
38
2010 © University of Michigan
Example (cont’d)
Gusfield 1997
W R I T E R S
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
V 1 1 1 2 3 4 5 6 7
I 2 2 2 2 2 3 4 5 6
N 3 3 3 3 3 3 4 5 6
T 4 4 4 4 4 *
N 5 5
E 6 6
R 7 7
39
2010 © University of Michigan
Tracebacks
Gusfield 1997
W R I T E R S
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
V 1 1 1 2 3 4 5 6 7
I 2 2 2 2 2 3 4 5 6
N 3 3 3 3 3 3 4 5 6
T 4 4 4 4 4 *
N 5 5
E 6 6
R 7 7
40
2010 © University of Michigan
Weighted edit distance
• Used to emphasize the relative cost of different edit operations
• Useful in bioinformatics– Homology information– BLAST– Blosum– http://eta.embl-heidelberg.de:8000/misc/mat/b
losum50.html
41
2010 © University of Michigan
Links
• Web sites:– http://www.merriampark.com/ld.htm – http://odur.let.rug.nl/~kleiweg/lev/
• Demo:– http
://nayana.ece.ucsb.edu/imsearch/imsearch.html
42
2010 © University of Michigan
IR Winter 2010
… Index Compression
IR Toolkits
…
2010 © University of Michigan
Inverted index compression
• Compress the postings• Observations
– Inverted list is sorted (e.g., by docid or termfq)– Small numbers tend to occur more frequently
• Implications– “d-gap” (store difference): d1, d2-d1, d3-d2-d1,…– Exploit skewed frequency distribution: fewer bits for
small (high frequency) integers
• Binary code, unary code, -code, -code
44
2010 © University of Michigan
Integer compression
• In general, to exploit skewed distribution• Binary: equal-length coding• Unary: x1 is coded as x-1 one bits followed by
0, e.g., 3=> 110; 5=>11110 -code: x=> unary code for 1+log x followed by
uniform code for x-2 log x in log x bits, e.g., 3=>101, 5=>11001
-code: same as -code ,but replace the unary prefix with -code. E.g., 3=>1001, 5=>10101
45
2010 © University of Michigan
Text compression
• Compress the dictionaries
• Methods– Fixed length codes– Huffman coding– Ziv-Lempel codes
46
2010 © University of Michigan
Fixed length codes
• Binary representations– ASCII– Representational power (2k symbols where k
is the number of bits)
47
2010 © University of Michigan
Variable length codes
• Alphabet:A .- N -. 0 -----B -... O --- 1 .----C -.-. P .--. 2 ..---D -.. Q --.- 3 ...—E . R .-. 4 ....-F ..-. S ... 5 .....G --. T - 6 -....H .... U ..- 7 --...I .. V ...- 8 ---..J .--- W .-- 9 ----.K -.- X -..-L .-.. Y -.—M -- Z --..
• Demo:– http://www.scphillips.com/morse/
48
2010 © University of Michigan
Most frequent letters in English
• Some are more frequently used than others…• Most frequent letters:
– E T A O I N S H R D L U • Demo:
– http://www.amstat.org/publications/jse/secure/v7n2/count-char.cfm
• Also: bigrams:– TH HE IN ER AN RE ND AT ON NT
49
2010 © University of Michigan
Huffman coding
• Developed by David Huffman (1952)• Average of 5 bits per character (37.5%
compression)• Based on frequency distributions of
symbols• Algorithm: iteratively build a tree of
symbols starting with the two least frequent symbols
50
2010 © University of Michigan
Symbol Frequency
A 7
B 4
C 10
D 5
E 2
F 11
G 15
H 3
I 7
J 8
51
2010 © University of Michigan
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
c
b d
f
g
i j
he
a
The Huffman tree
52
2010 © University of Michigan
Symbol Code
A 0110
B 0010
C 000
D 0011
E 01110
F 010
G 10
H 01111
I 110
J 111
53
2010 © University of Michigan
Exercise
• Consider the bit string: 01101101111000100110001110100111000110101101011101
• Use the Huffman code from the example to decode it.
• Why does this work: no one is a prefix of others
• Try inserting, deleting, and switching some bits at random locations and try decoding.
54
2010 © University of Michigan
Extensions
• Word-based
• Domain/genre dependent models
55
2010 © University of Michigan
Links on text compression
• Data compression:– http://www.data-compression.info/
• Calgary corpus:– http://links.uwaterloo.ca/calgary.corpus.html
• Huffman coding:– http://www.compressconsult.com/huffman/ – http://en.wikipedia.org/wiki/Huffman_coding
• LZ– http://en.wikipedia.org/wiki/LZ77
56
2010 © University of Michigan
Open Source IR Toolkits
• Smart (Cornell)• MG (RMIT & Melbourne, Australia; Waikato,
New Zealand), • Lemur (CMU/Univ. of Massachusetts)• Terrier (Glasgow)• Clair (University of Michigan)• Lucene (Open Source)• Ivory (University of Maryland – cloud computing)
57
2010 © University of Michigan
Smart
• The most influential IR system/toolkit• Developed at Cornell since 1960’s • Vector space model with lots of weighting
options• Written in C • The Cornell/AT&T groups have used the Smart
system to achieve top TREC performance
58
2010 © University of Michigan
MG
• A highly efficient toolkit for retrieval of text and images
• Developed by people at Univ. of Waikato, Univ. of Melbourne, and RMIT in 1990’s
• Written in C, running on Unix• Vector space model with lots of compression
and speed up tricks• People have used it to achieve good TREC
performance
59
2010 © University of Michigan
Lemur/Indri
• An IR toolkit emphasizing language models• Developed at CMU and Univ. of Massachusetts
in 2000’s• Written in C++, highly extensible• Vector space and probabilistic models including
language models• Achieving good TREC performance with a
simple language model
60
2010 © University of Michigan
Terrier
• A large-scale retrieval toolkit with lots of applications (e.g., desktop search) and TREC support
• Developed at University of Glasgow, UK• Written in Java, open source• “Divergence from randomness” retrieval model
and other modern retrieval formulas
61
2010 © University of Michigan
Lucene
• Open Source IR toolkit • Initially developed by Doug Cutting in Java• Now has been ported to some other languages• Good for building IR/Web applications • Many applications have been built using Lucene
(e.g., Nutch Search Engine)• Currently the retrieval algorithms have poor
accuracy
62
2010 © University of Michigan
What You Should Know
• What is an inverted index• Why does an inverted index help make search fast• How to construct a large inverted index• How to preprocess documents to reduce the index terms• How to compress an index• IR toolkits
Top Related