Speech and Music Retrieval LBSC 796/INFM 718R Session 11, April 20, 2011 Douglas W. Oard.
Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
-
Upload
emory-robinson -
Category
Documents
-
view
215 -
download
0
description
Transcript of Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.
![Page 1: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/1.jpg)
Indexing
LBSC 796/CMSC 828oSession 9, March 29, 2004
Doug Oard
![Page 2: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/2.jpg)
Agenda
• Questions
• Finish up evaluation from last time
• Computational complexity
• Inverted indexes
• Project planning
![Page 3: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/3.jpg)
User Studies
• Goal is to account for interface issues– By studying the interface component– By studying the complete system
• Formative evaluation– Provide a basis for system development
• Summative evaluation– Designed to assess performance
![Page 4: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/4.jpg)
Quantitative User Studies• Select independent variable(s)
– e.g., what info to display in selection interface• Select dependent variable(s)
– e.g., time to find a known relevant document• Run subjects in different orders
– Average out learning and fatigue effects• Compute statistical significance
– Null hypothesis: independent variable has no effect– Rejected if p<0.05
![Page 5: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/5.jpg)
Variation in Automatic Measures
• System– What we seek to measure
• Topic– Sample topic space, compute expected value
• Topic+System– Pair by topic and compute statistical significance
• Collection– Repeat the experiment using several collections
![Page 6: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/6.jpg)
Additional Effects in User Studies
• Learning– Vary topic presentation order
• Fatigue– Vary system presentation order
• Topic+User (Expertise)– Ask about prior knowledge of each topic
![Page 7: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/7.jpg)
Presentation Order
![Page 8: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/8.jpg)
Document Selection Experiments
InteractiveSelection
F0.8
StandardRanked List
Topic Description
![Page 9: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/9.jpg)
Measures of Effectiveness• Query Formulation: Uninterpolated Average Precision
– Expected value of precision [over relevant document positions]
– Interpreted based on query content at each iteration
• Document Selection: Unbalanced F-Measure:– P = precision– R = recall = 0.8 favors precision
• Models expensive human translation
RP
F 1
1
])(
[jrjEAP j
![Page 10: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/10.jpg)
End-to-End Experiments
QueryFormulation
AutomaticRetrieval
InteractiveSelection
AveragePrecision
F0.8
Topic Description
![Page 11: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/11.jpg)
End-to-End Experiment ResultsF α
=0.8
English queries, German documents4 searchers, 20 minutes per topic
![Page 12: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/12.jpg)
Summary
• Qualitative user studies suggest what to build
• Design decomposes task into components
• Automated evaluation helps to refine components
• Quantitative user studies show how well it works
![Page 13: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/13.jpg)
Supporting the Search Process
SourceSelection
Search
Query
Selection
Ranked List
Examination
Document
Delivery
Document
QueryFormulation
IR System
Indexing Index
Acquisition Collection
![Page 14: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/14.jpg)
Some Questions for Today• How long will it take to find a document?
– Is there any work we can do in advance?• If so, how long will that take?
• How big a computer will I need?– How much disk space? How much RAM?
• What if more documents arrive?– How much of the advance work must be repeated?– Will searching become slower?– How much more disk space will be needed?
![Page 15: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/15.jpg)
A Cautionary Tale• Searching is easy - just ask Microsoft!
– “Find” can search my hard drive in a few minutes• If it only looks at the file names...
• How long would it would take for the Web?– A 100 GB disk?– For the World Wide Web?
• Computers are getting faster, but…– How does Google give answers in 3 seconds?
![Page 16: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/16.jpg)
Find “complex” in the dictionary
marsupial
belligerentcomplex
marsupial
belligerentcomplex
arcadeastronomical
mastiffrelativelyrelaxationresplendent
![Page 17: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/17.jpg)
Computational Complexity• Time complexity: how long will it take?• Space complexity: how much memory is needed?
• Things you need to know to assess complexity:– What is the “size” of the input? (“n”)
• What aspects of the input are we paying attention to?– How is the input represented?– How is the output represented?– What are the internal data structures?– What is the algorithm?
![Page 18: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/18.jpg)
Worst Case Complexity
0
500
1000
1500
2000
2500
3000
3500
4000
4500
10 20 30 40
10nn^2100n
![Page 19: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/19.jpg)
0
20000
40000
60000
80000
100000
120000
140000
50 200 350
10nn^2100n100n+25263
10n: O(n)100n: O(n)100n+25263: O(n)
n2: O(n2)n2+45662: O(n2)
![Page 20: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/20.jpg)
“Asymptotic” Complexity• Constant, i.e. O(1)
n doesn’t matter • Sublinear, e.g. O(log n)
n = 65536 log n = 16• Linear, i.e. O(n)
n = 65536 n = 65536• Polynomial, e.g. O(n3)
n = 65536 n3 = 281,474,976,710,656• Exponential, e.g. O(2n)
n = 65536 beyond astronomical
![Page 21: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/21.jpg)
The “Inverted File” Trick
• Organize the bag of words matrix by terms– You know the terms that you are looking for
• Look up terms like you search dictionaries– For each letter, jump directly to the right spot
• For terms of reasonable length, this is very fast
– For each term, store the document identifiers• For every document that contains that term
• At query time, use the document identifiers– Consult a “postings file”
![Page 22: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/22.jpg)
An Example
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
00110000010010110
01001001001100001
Term Doc
1D
oc 2
00110110110010100
11001001001000001
Doc
3D
oc 4
00010110010010010
01001001000101001
Doc
5D
oc 6
00110010010010010
10001001001111000
Doc
7D
oc 8
A
B
C
FD
GJLMNOPQ
T
AIALBABR
THTI
4, 82, 4, 61, 3, 7
1, 3, 5, 72, 4, 6, 8
3, 53, 5, 7
2, 4, 6, 83
1, 3, 5, 72, 4, 82, 6, 8
1, 3, 5, 7, 86, 81, 3
1, 5, 72, 4, 6
PostingsInverted File
![Page 23: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/23.jpg)
The Finished Product
quick
brown
fox
over
lazy
dog
back
now
time
all
good
men
come
jump
aid
their
party
Term
A
B
C
FD
GJLMNOPQ
T
AIALBABR
THTI
4, 82, 4, 61, 3, 7
1, 3, 5, 72, 4, 6, 8
3, 53, 5, 7
2, 4, 6, 83
1, 3, 5, 72, 4, 82, 6, 8
1, 3, 5, 7, 86, 81, 3
1, 5, 72, 4, 6
PostingsInverted File
![Page 24: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/24.jpg)
What Goes in a Postings File?
• Boolean retrieval– Just the document number
• Ranked Retrieval– Document number and term weight (TF*IDF, ...)
• Proximity operators– Word offsets for each occurrence of the term
• Example: Doc 3 (t17, t36), Doc 13 (t3, t45)
![Page 25: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/25.jpg)
How Big Is the Postings File?
• Very compact for Boolean retrieval– About 10% of the size of the documents
• If an aggressive stopword list is used!
• Not much larger for ranked retrieval– Perhaps 20%
• Enormous for proximity operators– Sometimes larger than the documents!
![Page 26: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/26.jpg)
Building an Inverted Index• Simplest solution is a single sorted array
– Fast lookup using binary search– But sorting large files on disk is very slow– And adding one document means starting over
• Tree structures allow easy insertion– But the worst case lookup time is linear
• Balanced trees provide the best of both– Fast lookup and easy insertion– But they require 45% more disk space
![Page 27: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/27.jpg)
Starting a B+ Tree Inverted File
now timegoodall
aaaaa now
Now is the time for all good …
![Page 28: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/28.jpg)
Adding a New Term
now timegoodall
aaaaa now
Now is the time for all good men …
aaaaa men
men
![Page 29: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/29.jpg)
How Big is the Inverted Index?
• Typically smaller than the postings file– Depends on number of terms, not documents
• Eventually, most terms will already be indexed– But the postings file will continue to grow
• Postings dominate asymptotic space complexity– Linear in the number of documents
![Page 30: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/30.jpg)
Index Compression• CPU’s are much faster than disks
– A disk can transfer 1,000 bytes in ~20 ms– The CPU can do ~10 million instructions in that time
• Compressing the postings file is a big win– Trade decompression time for fewer disk reads
• Key idea: reduce redundancy– Trick 1: store relative offsets (some will be the same)– Trick 2: use an optimal coding scheme
![Page 31: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/31.jpg)
Compression Example
• Postings (one byte each = 7 bytes = 56 bits)– 37, 42, 43, 48, 97, 98, 243
• Difference– 37, 5, 1, 5, 49, 1, 145
• Optimal Huffman Code– 0:1, 10:5, 110:37, 1110:49, 1111: 145
• Compressed (17 bits)– 11010010111001111
![Page 32: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/32.jpg)
Indexing and Searching
• Indexing– Walk the inverted file, splitting if needed– Insert into the postings file in sorted order– Hours or days for large collections
• Query processing– Walk the inverted file– Read the postings file– Manipulate postings based on query– Seconds, even for enormous collections
![Page 33: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/33.jpg)
Summary• Slow indexing yields fast query processing
– Key fact: most terms don’t appear in most documents
• We use extra disk space to save query time– Index space is in addition to document space– Time and space complexity must be balanced
• Disk block reads are the critical resource– This makes index compression a big win
![Page 34: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/34.jpg)
Project Options
• LBSC 796 MLS/MIM– Option 1: TREC-like IR evaluation (team of 2)– Option 2: Design and run a user study (team of 3)
• LBSC 796 Ph.D.– Research paper
• LBSC 828o– Program a new capability
![Page 35: Indexing LBSC 796/CMSC 828o Session 9, March 29, 2004 Doug Oard.](https://reader035.fdocuments.us/reader035/viewer/2022062504/5a4d1b587f8b9ab0599aa17d/html5/thumbnails/35.jpg)
One Minute Paper
What was the muddiest point in today’s lecture?