Introduction to Digital Libraries Information Retrieval.
-
Upload
jonas-preston -
Category
Documents
-
view
230 -
download
0
Transcript of Introduction to Digital Libraries Information Retrieval.
![Page 1: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/1.jpg)
Introduction to Digital Libraries
Information Retrieval
![Page 2: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/2.jpg)
Sample Statistics of Text Collections
• Dialog: claims to have >12 terabytes of data in
>600 Databases, > 800 million unique records
• LEXIS/NEXIS: claims 7 terabytes, 1.7 billion
documents, 1.5 million subscribers, 11,400
databases; >200,000 searches per day; 9
mainframes, 300 Unix servers, 200 NT servers
![Page 3: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/3.jpg)
Information Retrieval
• Motivation
– the larger the holdings of the archive, the more
useful it is
– however, it is harder to find what you want
![Page 4: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/4.jpg)
Simple IR ModelUser
Query Results
Pre-Processing
Post-Processing
Searching
Storage
Collection & Processing
BooleanVector
StemmingThesaurusSignature
RankingClusteringWeighting
BooleanVector
Feedback
Flat FilesInverted FilesSignature FilesPAT Trees
StemmingStoplist
![Page 5: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/5.jpg)
5
IR problem• In libraries
ISBN: 0-201-12227-8 Author: Salton, Gerard Title: Automatic text processing: the transformation,
analysis, and retrieval of information by computer Editor: Addison-Wesley Date: 1989Content: <Text>
• external attributes and internal attribute (content)• Search by external attributes = Search in DB• IR: search by content
![Page 6: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/6.jpg)
Basic concepts
• Document is described by a set of representative keywords (index terms)
• Keywords may have binary weights or weights calculated from statistics of their frequency in text
• Retrieval is a ‘matching’ process between document keywords and words in queries
![Page 7: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/7.jpg)
IR Outline• Index Storage
– flat files, inverted files, signature files, PAT trees
• Processing – Stemming, stop-words
• Searching & Queries– Boolean, vector (including ranking, weighting,
feedback)
• Results– clustering
![Page 8: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/8.jpg)
Flat Files Index
• Simple files, no additional processing or storage needed
• Worst case keyword search time: O(DW)– D = # of documents– W = # words per document– linear search
• Clearly only acceptable for small collections
![Page 9: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/9.jpg)
Inverted Files• All input files are read, and a list of which
words appear in what documents (records) is made
• Extra space required can be up to 100% of original input files
• Worst case keyword search time is now O(log(DW))
• Almost all indexing systems in popular usage use inverted files
![Page 10: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/10.jpg)
Sample Inverted File
Term Record Frequencycomputer 1 3computer 3 5computing 2 1distributed 2 1parallel 1 2system 2 1... ... ...
![Page 11: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/11.jpg)
Structure of inverted index
• May be a hierarchical set of addresses, e.g.
word number within sentence number within paragraph number within chapter number within volume number within document number
• Consider as a vector (d,v,c,p,s,w)
![Page 12: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/12.jpg)
Inverted File Index
Store appearance of terms in documents (like index of a book)
alphabetdatabaseindexinformationretrievalsemistructuredXMLXPath
(15,42);(26,186);(31,86)(41,10)(15,76);(51,164);(76,641);(81,64)(16,76)(16,88)(5,61);(15,174);(25,41)(1,108);(2,65);(15,741);(21,421)(5,90);(21,301)
(document-ID,position in the doc)
Answer queries like „xml and index“, „information near retrieval“
But: not suitable for evaluating path expressions
![Page 13: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/13.jpg)
An Inverted File
• Search for– “databases”– “microsoft”
term docURLdata http://www-inst.eecs.berkeley.edu/~cs186database http://www-inst.eecs.berkeley.edu/~cs186date http://www-inst.eecs.berkeley.edu/~cs186day http://www-inst.eecs.berkeley.edu/~cs186dbms http://www-inst.eecs.berkeley.edu/~cs186decision http://www-inst.eecs.berkeley.edu/~cs186demonstrate http://www-inst.eecs.berkeley.edu/~cs186description http://www-inst.eecs.berkeley.edu/~cs186design http://www-inst.eecs.berkeley.edu/~cs186desire http://www-inst.eecs.berkeley.edu/~cs186developer http://www.microsoft.comdiffer http://www-inst.eecs.berkeley.edu/~cs186disability http://www.microsoft.comdiscussion http://www-inst.eecs.berkeley.edu/~cs186division http://www-inst.eecs.berkeley.edu/~cs186do http://www-inst.eecs.berkeley.edu/~cs186document http://www-inst.eecs.berkeley.edu/~cs186
![Page 14: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/14.jpg)
Other indexing structures
• Signature files– Each document has an associated signature, generating
by hashing each term it contains– Leads to possible matches; further processing to resolve
• Bitmaps– One-to-one hash function; each distinct term in
collection has a bit vector with one bit for each document
– Special case of signature file; storage expensive
![Page 15: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/15.jpg)
Signature FilesSignature size. Number of bits in a signature, F.
Word signature. A bit pattern of size F with exactly m bits set to 1 and the others 0.
Block. A sequence of text that contains D distinct words.
Block signature. The logical or of all the word signatures in a block of text.
![Page 16: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/16.jpg)
Signature File
• Each document is divided into “logical blocks”
-- pieces of text that contain a constant number
D of distinct, non-common words
• Each word yields a “word signature” which is a
bit pattern of size F, with m bits set to 1 and the
rest to 0
– F and m are design parameters
![Page 17: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/17.jpg)
Sample Signature File
Word Signature
free 001 000 110 010
text 000 010 101 001
block signature 001 010 111 011
Figure, D=2, F=12, m=4
![Page 18: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/18.jpg)
data 0000 0000 0000 0010 0000
base 0000 0001 0000 0000 0000
management 0000 1000 0000 0000 0000
system 0000 0000 0000 0000 1000
----------------------------------------
block
signature 0000 1001 0000 0010 1000
Figure, D=4, F=20, m=1
![Page 19: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/19.jpg)
Signature File
• Searching
– By examining each block signature for "1" 's in those bit
positions that the signature of the search word has a "1".
– False Drop
– probability that the signature test will “fail”, creating a “false
hit” or “false drop”
– A word signature may match the block signature, but the word is
not in the block. This is a false hit.
![Page 20: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/20.jpg)
Sistrings
• Original text:
”The traditional approach for searching a regular expression…”
• Sistrings
1. “The traditional approach for searching …”2. “he traditional approach for searching a…”
3. “e traditional approach for searching a …”
4. “onal approach for searching a regular …”
![Page 21: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/21.jpg)
Sistrings
• Once upon a time, in a far away land ...– sistring1: Once upon a time ...– sistring2: nce upon a time ...– sistring8: on a time, in a ...– sistring11: a time, in a far ...– sistring22: a far away land ...
![Page 22: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/22.jpg)
PAT Trees• PAT Tree:
– a Patricia Tree constructed over all the possible sistrings of a document
– bits of the key decide branching• 0 is branch to left subtree
• 1 is branch to right subtree
• internal node decides which bit of the key to use
• at leaf node, check any skipped bits
• PAT (Suffix) tree of a string S is a compacted trie that represents all substrings of S or semi-infinite string (sistring).
![Page 23: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/23.jpg)
PATRICIA TREE
• A particular type of “trie”
• Example, trie and PATRICIA TREE with content ‘010’, ‘011’, and ‘101’.
010 011 101
1
0
1
0
1
0 1
Lv0
Lv1
Lv2
trie
Lv0
Lv2 101
011010
10
10
PATRICIA TREE
010 011 101
1
0
1
0
1
0 1
Lv0
Lv1
Lv2
trie
![Page 24: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/24.jpg)
PAT Tree1
22
3 3 4 2
7 5 5 1 6 3
4 8
01100100010111... Text123456789.... Position
Query: 00101
sistrings 1-8already indexed
= sistring
= position to check
![Page 25: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/25.jpg)
Try to build the Patricia tree
• A 00001• S 10011• E 00101• R 10010• C 00011• H 01000• I 01001• N 01110• G 00111• X 11000• M 01101• P 10000
![Page 26: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/26.jpg)
PAT TreePAT Tree
A
E S
R XC H
G I N
M
P
![Page 27: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/27.jpg)
Example
Text 01100100010111 …sistring 1 01100100010111 …sistring 2 1100100010111 …sistring 3 100100010111 …sistring 4 00100010111 …sistring 5 0100010111 …sistring 6 100010111 …sistring 7 00010111 …sistring 8 0010111 ...
1 1
21 2
23
1
1
2
23
1
2
14
2
23
1
2
4 3
15
: external node sistring (integer displacement) total displacement of the bit to be inspected
: internal node skip counter & pointer
0 1 0 1
0 1
![Page 28: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/28.jpg)
SISTRING
• Bit level is too abstract, depends on application, we rarely apply this on bit level. Character level is a better idea!– e.g. CUHK– Corresponding sistrings would be
• CUHK000…• UHK000…• HK000…• K000…
– We require each should be at least 4 characters long.– (Why we pad 0/NULL at the end of sistring?)
![Page 29: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/29.jpg)
SISTRING (USAGE)
• We may instead storing the sistrings of ‘CUHK’, which requires O(n2) storage.
– CUHK <- represent C CU CUH CUHK at the same time– UHK0 <- represent U UH UHK at the same time– HK00 <- represent H HK at the same time– K000 <- represent K only
• A prefix-matching on sistrings is equivalent to the exact matching on the sub-strings.
• Conclusion, sistrings is better representation for storing sub-string information.
![Page 30: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/30.jpg)
CUHK 01000011 01010101 00000000 00000000UHK0 01010101 01001000 00000000 00000000HK00 01001000 01001011 00000000 00000000K000 01001011 00000000 00000000 00000000
PAT Tree (Example)
• By digitalizing the string, we can manually visualize how the PAT Tree could be.
• Following is the actual bit patternof the four sistrings
bit 3
bit 4
bit 6
10
10
PAT Tree
UHK0
K000HK00
CUHK
0 1
![Page 31: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/31.jpg)
PAT Tree (Example)
• This works! BUT…– We still need O(n2)
memory for storingthose sistrings
• We may reduce thememory to O(n)by making use ofpoints.
Hello This document is simple 01001000 …This document is simple 01010100 …document is simple 01100100 …is simple 01101001 …simple 01110011 …
bit 2
bit 3
bit 4
10
00
PAT Tree ofa REAL (but very simple)
document
simlpe
is simpledocument is
simple
Hello. This document is
simple.0 1
bit 3
This document is
simple.
11
![Page 32: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/32.jpg)
Space/Time Tradeoffs
Space
Time
inverted files
flat files
signature files
PAT trees
![Page 33: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/33.jpg)
33
Stemming
• Reason: – Different word forms may bear similar meaning (e.g. search,
searching): create a “standard” representation for them
• Stemming: – Removing some endings of word
computercompute computescomputingcomputedcomputation comput
![Page 34: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/34.jpg)
Inverted File, Stemmed
Term Record Frequencycomput 1 3comput 3 5comput 2 1distribut 2 1parallel 1 2system 2 1... ... ...
![Page 35: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/35.jpg)
Stemming
• am, are, is be car, cars, car's, cars' car
• the boy's cars are different colors the boy car be differ color
![Page 36: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/36.jpg)
Stemming
• Manual or Automatic
• Can reduce index files up to 50%
• Effectiveness studies of stemming are mixed, but in general it has either no effect or a positive effect when measuring both precision and recall
![Page 37: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/37.jpg)
Stopwords• Stopwords exist in stoplists or negative
dictionaries• Idea: remove low semantic content
– index should only have “important stuff”
• What not to index is domain dependent, but often includes:– “small” words: a, and, the, but, of, an, very, etc. – case is removed– punctuation
![Page 38: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/38.jpg)
Stop words
• Very common words that have no discriminatory power
• ( إلى من، (...،في،
![Page 39: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/39.jpg)
Normalization
• Token normalization– Canonicalizing tokens so that matches occur
despite superficial differences in the character sequences of the tokens
– U.S.A vs USA– Anti-discriminatory vs antidiscriminatory– Car vs automobile?
![Page 40: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/40.jpg)
Capitalization/case folding
• Good for– Allow instances of Automobile at the beginning of a
sentence to match with a query of automobile– Helps a search engine when most users type ferrari
when they are interested in a Ferrari car• Bad for
– Proper names vs common nouns– General Motors, Associated Press, Black
• Heuristic solution: lowercase only words at the beginning of the sentence; true casing via machine learning
![Page 41: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/41.jpg)
Performance of search
• 3 major classes of measuring performance– precision / recall
• TREC conference series, http://trec.nist.gov/
– space / time• see Esler & Nelson, JNCA for an example
• http://techreports.larc.nasa.gov/ltrs/PDF/1997/jp/NASA-97-jnca-sle.pdf
– usability• probably the most important measure, but largely ignored
![Page 42: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/42.jpg)
Precision and Recall
• Precision= No. of relevant documents retrieved
Total no. of documents retrieved
• Recall= No. of relevant documents retrieved .
Total no. of relevant documents in database
![Page 43: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/43.jpg)
Standard Evaluation Measures
w x
y z
n2 = w + y
n1 = w + x
N
relevant
not relevant
retrieved not retrieved
Starts with a CONTINGENCY table
![Page 44: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/44.jpg)
Precision and Recall
Recall:
Precision:
w
w+y
w+x
w
From all the documents that are relevant out there,how many did the IR system retrieve?
From all the documents that are retrieved by the IR system, how many are relevant?
![Page 45: Introduction to Digital Libraries Information Retrieval.](https://reader036.fdocuments.us/reader036/viewer/2022081513/56649f435503460f94c635db/html5/thumbnails/45.jpg)
User-Centered IR Evaluation
• More user-oriented measures– Satisfaction, informativeness
• Other types of measures– Time, cost-benefit, error rate, task analysis
• Evaluation of user characteristics
• Evaluation of interface
• Evaluation of process or interaction