1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .
-
date post
21-Dec-2015 -
Category
Documents
-
view
217 -
download
0
Transcript of 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .
![Page 1: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/1.jpg)
1
Algorithms for Large Data Sets
Ziv Bar-YossefLecture 6
May 7, 2006
http://www.ee.technion.ac.il/courses/049011
![Page 2: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/2.jpg)
2
Principal Eigenvector Computation
E: n × n matrix |1| > |2| ≥ |3| … ≥ |n| : eigenvalues of E
Suppose 1 > 0 v1,…,vn: corresponding eigenvectors
Eigenvectors form a basis Suppose ||v1||2 = 1
Input: The matrix E A unit vector u, which is not in span(v2,…,vn)
Goal: compute 1 and v1
![Page 3: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/3.jpg)
3
The Power Method
![Page 4: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/4.jpg)
4
Why Does It Work?
Theorem: As t , w ±v1
• Convergence rate: Proportional to (2/1)t
• The larger the “spectral gap” |1|- |2|, the faster the convergence.
![Page 5: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/5.jpg)
5
Spectral Methods in
Information Retrieval
![Page 6: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/6.jpg)
6
Outline
Motivation: synonymy and polysemy Latent Semantic Indexing (LSI) Singular Value Decomposition (SVD) LSI via SVD Why LSI works? HITS and SVD
![Page 7: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/7.jpg)
7
Synonymy and Polysemy
Synonymy: multiple terms with (almost) the same meaningEx: cars, autos, vehiclesHarms recall
Polysemy: a term with multiple meaningsEx: java (programming language, coffee,
island)Harms precision
![Page 8: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/8.jpg)
8
Traditional Solutions
Query expansionSynonymy: OR on all synonyms
Manual/automatic use of thesauri Too few synonyms: recall still low Too many synonyms: harms precision
Polysemy: AND on term and additional specializing terms
Ex: +java +”programming language” Too broad terms: precision still low Too narrow terms: harms recall
![Page 9: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/9.jpg)
9
Syntactic Indexing
D: document collection, |D| = n T: term space, |T| = m At,d: “weight” of t in d (e.g., TFIDF) ATA: pairwise document similarities AAT: pairwise term similarities
A m
n
terms
documents
![Page 10: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/10.jpg)
10
Latent Semantic Indexing (LSI)[Deerwester et al. 1990]
C: concept space, |C| = r Documents & query: “mixtures” of concepts Given a query, finds the most similar documents Bridges the syntax-semantics gap
B r
n
concepts
documents
![Page 11: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/11.jpg)
11
Fourier TransformTime domain:
time
3 × + 1.1 ×=
frequency
3
1.1
Frequency domain:
Compact discrete representation Effective for noise removal
![Page 12: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/12.jpg)
12
Latent Semantic Indexing
Documents, queries ~ signals Vectors in Rm
Concepts ~ base signals Orthonormal basis of columns(A)
Semantic indexing of a document ~ Fourier transform of a signal Representation of document in concept basis
Advantages Space-efficient Better handling of synonymy and polysemy Removal of “noise”
![Page 13: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/13.jpg)
13
Open Questions
How to choose the concept basis? How to transform the syntactic index into a
semantic index? How to filter out “noisy concepts”?
![Page 14: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/14.jpg)
14
Singular Values
A: m×n real matrix Definition: ≥ 0 is a singular value of A if there exists a
pair of vectors u,v s.t. Av = u and ATu = v.
u and v are called singular vectors.
Ex: = ||A||2 = max||x||2 = 1 ||Ax||2. Corresponding singular vectors: x that maximizes ||Ax||2 and y =
Ax / ||A||2.
Note: ATAv = 2v and AATu = 2u 2 is eigenvalue of ATA and AAT
u eigenvector of ATA v eigenvector of AAT
![Page 15: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/15.jpg)
15
Singular Value Decomposition (SVD) Theorem: For every m×n real matrix A, there
exists a singular value decomposition:
A = U VT
1 ≥ … ≥ r > 0 (r = rank(A)): singular values of A
= Diag(1,…,r)
U: column-orthonormal m×r matrix (UT U = I) V: column-orthonormal n×r matrix (VT V = I)
A U VT× ×=
![Page 16: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/16.jpg)
16
Singular Values vs. EigenvaluesA = U VT
1,…,r: singular values of A 1
2,…,r2: non-zero eigenvalues of ATA and AAT
u1,…,ur: columns of U Orthonormal basis for span(columns of A) Left singular vectors of A Eigenvectors of ATA
v1,…,vr: columns of V Orthonormal basis for span(rows of A) Right singular vectors Eigenvectors of AAT
![Page 17: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/17.jpg)
17
LSI as SVD
A = U VT UTA = VT
u1,…,ur : concept basis B = VT : LSI matrix (semantic index) Ad: d-th column of A Bd: d-th column of B Bd = UTAd Bd[c] = uc
T Ad
![Page 18: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/18.jpg)
18
Noisy Concepts
B = UTA = VT
Bd[c] = c vd[c] If c is small, then Bd[c] small for all d k = largest i s.t. i is “large” For all c = k+1,…,r, and for all d, c is a low-
weight concept in d Main idea: filter out all concepts c = k+1,…,r
Space efficient: # of index terms = k (vs. r or m) Better retrieval: noisy concepts are filtered out across
the board
![Page 19: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/19.jpg)
19
Low-rank SVD
B = UTA = VT
Uk = (u1,…,uk)
Vk = (v1,…,vk)
k = upper-left k×k sub-matrix of Ak = Uk k Vk
T
Bk = k VkT
rank(Ak) = rank(Bk) = k
![Page 20: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/20.jpg)
20
Low Dimensional Embedding
Theorem: If is small, then for “most” d,d’, .
Ak preserves pairwise similarities among documents at least as good as A for retrieval.
![Page 21: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/21.jpg)
21
Why is LSI Better?[Papadimitriou et al. 1998] [Azar et al. 2001]
LSI summaryDocuments are embedded in low dimensional
space (m k)Pairwise similarities are preservedMore space-efficient
But why is retrieval better?SynonymyPolysemy
![Page 22: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/22.jpg)
22
Generative Model
A corpus model M = (T,C,W,D) T: Term space, |T| = m C: Concept space, |C| = k
Concept: distribution over terms W: Topic space
Topic: distribution over concepts D: Document distribution
Distribution over W × N
A document d is generated as follows: Sample a topic w and a length n according to D Repeat n times:
Sample a concept c from C according to w Sample a term t from T according to c
![Page 23: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/23.jpg)
23
Simplifying Assumptions
Every document has a single topic (W = C) For every two concepts c,c’, ||c – c’|| ≥ 1 - The probability of every term under a
concept c is at most some constant .
![Page 24: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/24.jpg)
24
LSI Works
A: m×n term-document matrix, representing n documents generated according to the model
Theorem [Papadimitriou et al. 1998]With high probability, for every two documents d,d’, If topic(d) = topic(d’), then
If topic(d) topic(d’), then
![Page 25: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/25.jpg)
25
Proof For simplicity, assume = 0 Want to show:
If topic(d) = topic(d’), Adk || Ad’
k
If topic(d) topic(d’), Adk Ad’
k Dc: documents whose topic is the concept c Tc: terms in supp(c)
Since ||c – c’|| = 1, Tc ∩ Tc’ = Ø A has non-zeroes only in blocks: B1,…,Bk, where
Bc: sub-matrix of A with rows in Tc and columns in Dc
ATA is a block diagonal matrix with blocks BT1B1,…, BT
kBk
(i,j)-th entry of BTcBc: term similarity between i-th and j-th
documents whose topic is the concept c BT
cBc: adjacency matrix of a bipartite (multi-)graph Gc on Dc
![Page 26: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/26.jpg)
26
Proof (cont.) Gc is a “random” graph First and second eigenvalues of BT
cBc are well separated
For all c,c’, second eigenvalue of BTcBc is smaller
than first eigenvalue of BTc’Bc’
Top k eigenvalues of ATA are the principal eigenvalues of BT
cBc for c = 1,…,k Let u1,…,uk be corresponding eigenvectors For every document d on topic c, Ad is orthogonal to
all u1,…,uk, except for uc. Ak
d is a scalar multiple of uc.
![Page 27: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/27.jpg)
27
Extensions[Azar et al. 2001]
A more general generative model Explain also improved treatment of
polysemy
![Page 28: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/28.jpg)
28
Computing SVD
Compute singular values of A, by computing eigenvalues of ATA
Compute U,V by computing eigenvectors of ATA and AAT
Running time not too good: O(m2 n + m n2) Not practical for huge corpora
Sub-linear time algorithms for estimating Ak [Frieze,Kannan,Vempala 1998]
![Page 29: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/29.jpg)
29
HITS and SVD
A: adjacency matrix of a web (sub-)graph G a: authority vector h: hub vector a is principal eigenvector of ATA h is principal eigenvector of AAT
Therefore: a and h give A1: the rank-1 SVD of A
Generalization: using Ak, we can get k authority and hub vectors, corresponding to other topics in G.
![Page 30: 1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 6 May 7, 2006 .](https://reader035.fdocuments.us/reader035/viewer/2022062320/56649d5f5503460f94a3f847/html5/thumbnails/30.jpg)
30
End of Lecture 5