1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .
-
date post
19-Dec-2015 -
Category
Documents
-
view
216 -
download
0
Transcript of 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .
![Page 1: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/1.jpg)
1
Algorithms for Large Data Sets
Ziv Bar-YossefLecture 2
March 26, 2006
http://www.ee.technion.ac.il/courses/049011
![Page 2: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/2.jpg)
2
Information Retrieval
![Page 3: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/3.jpg)
3
I want information about Michael
Jordan, the machine learning expert
Information Retrieval Setting
queryUserDocument
Collection
“Information Need”
+”Michael Jordan” -basketball
1. Michael I. Jordan’s homepage2. NBA.com3. Michael Jordan on TV
Ranked list of retrieved documents
IR SystemIR System
documents
No. 1 is good, Rest are bad
feedback
Revised ranked list of retrieved documents
1. Michael I. Jordan’s homepage2. M.I. Jordan’s pubs3. Graphical Models
![Page 4: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/4.jpg)
4
Information Retrieval vs.Data Retrieval Information Retrieval System: a system that allows a user
to retrieve documents that match her “information need” from a large corpus. Ex: Get documents about Michael Jordan, the machine learning
expert.
Data Retrieval System: a system that allows a user to retrieve all documents that match her query from a large corpus. Ex: SELECT doc
FROM corpus WHERE (doc.text CONTAINS “Michael Jordan”)
AND NOT (doc.text CONTAINS
“basketball”).
![Page 5: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/5.jpg)
5
Information Retrieval vs. Data Retrieval
Information RetrievalData Retrieval
DataFree text, unstructuredDatabase tables, structured
QueriesKeywords,
Natural language
SQL,
Relational algebras
ResultsApproximate matchesExact matches
ResultsOrdered by relevanceUnordered
AccessibilityNon-expert humansKnowledgeable users or automatic processes
![Page 6: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/6.jpg)
6
Information Retrieval Systems
IR System
queryprocessor
textprocessor
user query
ranked retrieved
docs
User
Corpus
rankingprocedure
system query
retrieved docs
index
indexertokenized
docs
postings
raw docs
![Page 7: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/7.jpg)
7
Search EnginesSearch Engine
queryprocessor
textprocessor
user query
ranked retrieved
docs
User
Web
rankingprocedure
system query
retrieved docs
index
indexertokenized
docs
postings
crawlerglobal
analyzerrepository
![Page 8: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/8.jpg)
8
Classical IR vs. Web IRClassical IRWeb IR
VolumeLargeHuge
Data qualityClean, no dupsNoisy, dups
Data change rateInfrequentIn flux
Data accessibilityAccessiblePartially accessible
Format diversityHomogeneousWidely diverse
DocumentsTextHypertext
# of matchesSmallLarge
IR techniquesContent-basedLink-based
![Page 9: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/9.jpg)
9
Outline
Abstract formulation Models for relevance ranking Retrieval evaluation Query languages Text processing Indexing and searching
![Page 10: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/10.jpg)
10
Abstract Formulation Ingredients:
D: document collection Q: query space f: D x Q R: relevance scoring function For every q in Q, f induces a ranking (partial order) q on D
Functions of an IR system: Preprocess D and create an index I Given q in Q, use I to produce a permutation on D
Goals: Accuracy: should be “close” to q Compactness: index should be compact Response time: answers should be given quickly
![Page 11: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/11.jpg)
11
Document Representation
T = { t1,…, tk }: a “token space” (a.k.a. “feature space” or “term space”)Ex: all words in EnglishEx: phrases, URLs, …
A document: a real vector d in Rk
di: “weight” of token ti in d
Ex: di = normalized # of occurrences of ti in d
![Page 12: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/12.jpg)
12
Classic IR (Relevance) Models
The Boolean model The Vector Space Model (VSM)
![Page 13: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/13.jpg)
13
The Boolean Model
A document: a boolean vector d in {0,1}k
di = 1 iff ti belongs to d
A query: a boolean formula q over tokens q: {0,1}k {0,1} Ex: “Michael Jordan” AND (NOT basketball) Ex: +“Michael Jordan” –basketball
Relevance scoring function: f(d,q) = q(d)
![Page 14: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/14.jpg)
14
The Boolean Model: Pros & Cons
Advantages:Simplicity for users
Disadvantages:Relevance scoring is too coarse
![Page 15: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/15.jpg)
15
The Vector Space Model (VSM)
A document: a real vector d in Rk
di = weight of ti in d (usually TF-IDF score)
A query: a real vector q in Rk
qi = weight of ti in q
Relevance scoring function: f(d,q) = sim(d,q)
“similarity” between d and q
![Page 16: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/16.jpg)
16
Popular Similarity Measures
L1 or L2 distance d,q are first normalized
to have unit norm
Cosine similarity
d
q
d –q
d
q
![Page 17: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/17.jpg)
17
TF-IDF Score: Motivation
Motivating principle:A term ti is relevant to a document d if:
ti occurs many times in d relative to other terms that occur in d
ti occurs many times in d relative to its number of occurrences in other documents
Examples10 out of 100 terms in d are “java”10 out of 10,000 terms in d are “java”10 out of 100 terms in d are “the”
![Page 18: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/18.jpg)
18
TF-IDF Score: Definition
n(d,ti) = # of occurrences of ti in d N = i n(d,ti) (# of tokens in d) Di = # of documents containing ti D = # of documents in the collection
TF(d,ti): “Term Frequency” Ex: TF(d,ti) = n(d,ti) / N Ex: TF(d,ti) = n(d,ti) / (maxj { n(d,tj) })
IDF(ti): “Inverse Document Frequency” Ex: IDF(ti) = log (D/Di)
TFIDF(d,ti) = TF(d,ti) x IDF(ti)
![Page 19: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/19.jpg)
19
VSM: Pros & Cons
Advantages:Better granularity in relevance scoringGood performance in practiceEfficient implementations
Disadvantages:Assumes term independence
![Page 20: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/20.jpg)
20
Retrieval Evaluation Notations:
D: document collection Dq: documents in D that are “relevant” to query q
Ex: f(d,q) is above some threshold
Lq: list of results on query qD
Lq DqRecall:
Precision:
![Page 21: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/21.jpg)
21
Recall & Precision: Example
Recall(A) = 80% Precision(A) = 40%
1. d123
2. d84
3. d56
4. d6
5. d8
6. d9
7. d511
8. d129
9. d187
10.d25
List A Relevant docs:
d123, d56, d9, d25, d31. d81
2. d74
3. d56
4. d123
5. d511
6. d25
7. d9
8. d129
9. d3
10.d5
List B
Recall(B) = 100% Precision(B) = 50%
![Page 22: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/22.jpg)
22
Precision@k and Recall@k
Notations:Dq: documents in D that are “relevant” to q
Lq,k: top k results on the list
Recall@k:
Precision@k:
![Page 23: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/23.jpg)
23
Precision@k: Example
1. d123
2. d84
3. d56
4. d6
5. d8
6. d9
7. d511
8. d129
9. d187
10.d25
List A1. d81
2. d74
3. d56
4. d123
5. d511
6. d25
7. d9
8. d129
9. d3
10.d5
List B
0%10%20%30%40%50%60%70%80%90%
100%
1 2 3 4 5 6 7 8 9 1
k
pre
cisi
on
@k
List A List B
![Page 24: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/24.jpg)
24
Recall@k: Example
0%10%20%30%40%50%60%70%80%90%
100%
1 2 3 4 5 6 7 8 9 1
k
recall@
k
List A List B
1. d123
2. d84
3. d56
4. d6
5. d8
6. d9
7. d511
8. d129
9. d187
10.d25
List A1. d81
2. d74
3. d56
4. d123
5. d511
6. d25
7. d9
8. d129
9. d3
10.d5
List B
![Page 25: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/25.jpg)
25
“Interpolated” Precision
Notations:Dq: documents in D that are “relevant” to qr: a recall level (e.g., 20%)k(r): first k so that recall@k >= r
Interpolated precision@ recall level r =
max { precision@k : k >= k(r) }
![Page 26: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/26.jpg)
26
Precision vs. Recall: Example
0%10%20%30%40%50%60%70%80%90%
100%
0 20 40 60 80 100
Recall
Inte
rpo
late
d P
reci
sio
n List AList B
1. d123
2. d84
3. d56
4. d6
5. d8
6. d9
7. d511
8. d129
9. d187
10.d25
List A1. d81
2. d74
3. d56
4. d123
5. d511
6. d25
7. d9
8. d129
9. d3
10.d5
List B
![Page 27: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/27.jpg)
27
Query Languages: Keyword-Based Singe-word queries
Ex: Michael Jordan machine learning Context queries
Phrases. Ex: “Michael Jordan” “machine learning” Proximity. Ex: “Michael Jordan” at distance of at most
10 words from “machine learning” Boolean queries
Ex: +”Michael Jordan” –basketball Natural language queries
Ex: “Get me pages about Michael Jordan, the machine learning expert.”
![Page 28: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/28.jpg)
28
Query Languages: Pattern Matching Prefixes
Ex: prefix:comput Suffixes
Ex: suffix:net Regular Expressions
Ex: [0-9]+th world-wide web conference
![Page 29: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/29.jpg)
29
Text Processing
Lexical analysis & tokenization Split text into words, downcase letters, filter out
punctuation marks, digits, hyphens Stopword elimination
Better retrieval accuracy, more compact index Ex: “to be or not to be”
Stemming Ex: “computer”, “computing”, “computation” comput
Index term selection Keywords vs. full text
![Page 30: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/30.jpg)
30
Inverted Index
Michael1 Jordan2, the3 author4 of5 “graphical6 models7”, is8 a9 professor10 at11 U.C.12 Berkeley13.
The1 famous2 NBA3 legend4 Michael5 Jordan6 liked7 to8 date9 models10.
d1
d2
author: (d1,4)
berkeley: (d1,13)
date: (d2,9)
famous: (d2, 2)
graphical: (d1,6)
jordan: (d1,2), (d2,6)
legend: (d2,4)
like: (d2,7)
michael: (d1,1), (d2,5)
model: (d1,7), (d2,10)
nba: (d2,3)
professor: (d1,10)
uc: (d1,12)
Vocabulary Postings
![Page 31: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/31.jpg)
31
Inverted Index Structure
Vocabulary File
term1
term2
…
Postings File
postings list 1
postings list 2
…
• Usually, fits in main memory
• Stored on disk
![Page 32: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/32.jpg)
32
Searching an Inverted Index
Given: t1, t2: query terms L1,L2: corresponding posting lists
Need to get ranked list of docs in intersection of L1,L2
Solution 1: If L1,L2 are comparable in size, “merge” L1 and L2 to find docs in their intersection, and then order them by rank. (running time: O(|L1| + |L2|))
Solution 2: If L1 is considerably shorter than L2, binary search each posting of L1 in L2 to find the intersection, and then order them by rank.(running time: O(|L1| x log(|L2|))
![Page 33: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/33.jpg)
33
Search Optimization
Improvement:
Order docs in posting lists by static rank (e.g., PageRank).
Then, can output top matches, without scanning the whole lists.
![Page 34: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/34.jpg)
34
Index Construction
Given a stream of documents, store (did,tid,pos) triplets in a file
Sort and group file by tid Extract posting lists
![Page 35: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/35.jpg)
35
Index Maintenance
Naïve updates of inverted index can be very costly Require random access A single change may cause many insertions/deletions
Batch updates Two indices
Main index (created in batch, large, compressed) “Stop-press” index (incremental, small,
uncompressed)
![Page 36: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/36.jpg)
36
Index Maintenance
If a page d is inserted/deleted, the “signed” postings (did,tid,pos,I/D) are added to the stop-press index.
Given a query term t, fetch its list Lt from main index, and two lists Lt,+ and Lt,- from stop-press index.
Result is:
When stop-press index grows too large, it is merged into the main index.
![Page 37: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/37.jpg)
37
Index Compression
Delta compression
Saves a lot for popular terms Doesn’t save much for rare terms (but these don’t take
much space anyway)
michael: (1000007,5), (1000009,12), (1000013,77), (1000035,88),…
michael: (1000007,5), (2,12), (4,77), (22,88),…
![Page 38: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/38.jpg)
38
Variable Length Encodings
How to encode gaps succinctly? Option 1: Fixed-length binary encoding.
Effective when all gap lengths are equally likely No savings over storing doc ids.
Option 2: Unary encoding. Gap x is encoded by x-1 1’s followed by a 0 Effective when large gaps are very rare (Pr(x) = 1/2x)
Option 3: Gamma encoding. Gap x is encoded by (x x), where x is the binary encoding
of x and x is the length of x, encoded in unary. Encoding length: about 2log(x).
![Page 39: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006 .](https://reader031.fdocuments.us/reader031/viewer/2022012922/56649d405503460f94a19945/html5/thumbnails/39.jpg)
39
End of Lecture 2