Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of...
Transcript of Finding Similar Items - TAUamir1/SEMINAR/LECTURES/Finding_similar... · 2014. 11. 16. · rows of...
Finding Similar Items
Course: Big Data Processing Professor: Amir Averbuch
Student: Nave Frost 16/11/2014
Mining of Massive Datasets, Jure Leskovec, Anand Rajaraman ,Jeffrey D. Ullman
Word Count
• Problem Given set of Strings – Count how many times each String appears. • Naive Solution Compare each String with all the other Strings. • Solution Hash each String and compare only Strings with the same hash value.
Similar Sets
Problem
Given set of Sets – Find all pair of similar sets.
Applications
• Near duplicate Web pages
– Plagiarisms
– Mirror pages
• Collaborative filter
Jaccard Similarity
• Definition
The Jaccard similarity of sets 𝑆 and 𝑇 is:
𝑆𝐼𝑀 𝑆, 𝑇 =|𝑆 ∩ 𝑇|
|𝑆 ∪ 𝑇|
Jaccard Similarity
• Example 𝑆 = 𝑎, 𝑏, 𝑐 𝑇 = 𝑎, 𝑏, 𝑑, 𝑒
𝑆𝐼𝑀 𝑆, 𝑇 =|𝑆 ∩ 𝑇|
|𝑆 ∪ 𝑇|=
| 𝑎, 𝑏 |
| 𝑎, 𝑏, 𝑐, 𝑑, 𝑒 |=2
5
Shingling
• Definition
Any substring of length 𝑘 is called 𝑘 − 𝑠ℎ𝑖𝑛𝑔𝑙𝑒.
• Example
For 𝑘 = 2 and String "𝑎𝑏𝑐𝑑𝑎𝑏𝑑“
The set of 2 − 𝑠ℎ𝑖𝑛𝑔𝑙𝑒𝑠 is {𝑎𝑏, 𝑏𝑐, 𝑐𝑑, 𝑑𝑎, 𝑏𝑑}.
Shingle Size
Too small 𝑘 : All documents will be similar.
Too large 𝑘 : documents will be similar only to identical documents.
Shingle Size
𝑘 should be picked large enough that the probability of any given shingle appearing in any given document is low.
Shingle Size
Example
Let 𝐸 = {𝑒1, … , 𝑒𝑁} be corpus of emails.
Assume each email contain only letters and a white-space character.
There will be 27𝑘 possible shingles.
For each 𝑒𝑖: 𝑒𝑖 ≪ 14,348,907 = 275
Hence, we would expect 𝑘 = 5 to work well.
Hashing Shingles
To reduce the size of the k-shingles
Use hash ℎ: 𝑘 − 𝑠ℎ𝑖𝑛𝑔𝑙𝑒 → 232 − 1
That maps strings of length k to Integer.
Signature
• Problem
Even if we hash the 𝑘 − 𝑠ℎ𝑖𝑛𝑔𝑙𝑒 to 4 bytes each, the space needed to store a set is still roughly four times the space taken by the document.
• Goal
1. Find Signature, i.e, smaller representation.
2. Compare the signatures of two sets to estimate the Jaccard similarity.
Matrix Representation
Columns : Documents
Rows : Elements
Example
𝑆1 = {𝑎, 𝑑},
𝑆2 = 𝑐 ,
𝑆3 = 𝑏, 𝑑, 𝑒 ,
𝑆4 = {𝑎, 𝑐, 𝑑}.
𝑺𝟒 𝑺𝟑 𝑺𝟐 𝑺𝟏
1 0 0 1 𝒂
0 1 0 0 𝒃
1 0 1 0 𝒄
0 1 0 1 𝒅
1 1 0 0 𝒆
Minhashing
Pick a permutation of the rows.
minhash value of a column is the first row, in the permuted order, in which the column has a 1.
Minhashing
Example
Permutation ℎ : 𝑏𝑒𝑎𝑑𝑐
ℎ 𝑆1 = 𝑎 ℎ 𝑆2 = 𝑐 ℎ 𝑆3 = 𝑏 ℎ 𝑆4 = 𝑎
𝑺𝟒 𝑺𝟑 𝑺𝟐 𝑺𝟏
1 0 0 1 𝒂
0 1 0 0 𝒃
1 0 1 0 𝒄
0 1 0 1 𝒅
1 1 0 0 𝒆
Minhashing
• Theoram 𝑃𝑟ℎ ℎ 𝑆1 = ℎ 𝑆2 = 𝑆𝐼𝑀(𝑆1, 𝑆2)
• Proof 𝑋 : rows have 1 in both columns. 𝑌 : rows have 1 in one of the columns and 0 in the other. 𝑍 : rows have 0 in both columns. Denote, 𝑥 = |𝑋| 𝑦 = |𝑌|
𝑆𝐼𝑀 𝑆1, 𝑆2 =|𝑆1 ∩ 𝑆2|
|𝑆1 ∪ 𝑆2|=
𝑥
𝑥 + 𝑦
Minhashing
• The probability that we shall meet a type 𝑋 row
before we meet a type 𝑌 row is 𝑥
𝑥+𝑦 - In that case
ℎ 𝑆1 = ℎ(𝑆2).
• If we meet a type 𝑌 row before we meet a type 𝑋 - In that case ℎ 𝑆1 ≠ ℎ(𝑆2).
Hence, 𝑃𝑟ℎ ℎ 𝑆1 = ℎ 𝑆2 =𝑥
𝑥+𝑦
Locality Sensitive Hashing
Generate from the collection of all elements (signatures in our example) a small list of candidate pairs: pairs of elements whose similarity must be evaluated.
Signature Matrix
For 2 sets 𝑆1and 𝑆2 such that SIM 𝑆1, 𝑆2 = 0.8
The probability that ℎ 𝑆1 ≠ ℎ(𝑆2) is 0.2.
We will have 𝑛 permutations
ℎ1, … , ℎ𝑛 and will a build
Signature Matrix 𝑀,
such that 𝑀 𝑖, 𝑗 = ℎ𝑖(𝑆𝑗).
𝑺𝟒 𝑺𝟑 𝑺𝟐 𝑺𝟏
1 0 3 1 𝒉𝟏
0 0 2 0 𝒉𝟐
ℎ1 𝑥 = 𝑥 + 1 𝑚𝑜𝑑 5 ℎ2 𝑥 = 3𝑥 + 1 𝑚𝑜𝑑 5
Candidate Generation
• Pick a similarity threshold 0 < 𝑡 < 1.
• We want a pair of columns 𝑐 and 𝑑 of the signature matrix 𝑀 to be candidate pair if and only if their signatures agree in at least fraction 𝑡 of the rows.
Partition into Bands
• Divide Matrix 𝑀 into 𝑏 bands and 𝑟 rows.
• For each band, hash its portion of each column to hash table with 𝑘 buckets.
• Candidate column pair are those that hash to the same bucket for ≥ 1 band.
Let 𝑆1 and 𝑆2 be pair of documents with SIM 𝑆1, 𝑆2 = 𝑠: 1. The probability that the signatures agree in all
rows of one particular band is 𝑠𝑟. 2. The probability that the signatures do not agree
in at least one row of a particular band is 1 − 𝑠𝑟. 3. The probability that the signatures do not agree
in all rows of any of the bands is (1 − 𝑠𝑟)𝑏. 4. The probability that the signatures agree in all
the rows of at least one band, and therefore become a candidate pair, is 1 − (1 − 𝑠𝑟)𝑏.
Analysis of Banding
Example • Suppose 𝑛 = 100 divided to 20 bands with 5 rows each.
• Let 𝑆1 and 𝑆2 be 80% similar.
– Probability 𝑆1, 𝑆2 identical in one particular band: 0.85 = 0.328
– Probability 𝑆1, 𝑆2 are not similar in any of the 20 bands: (1 − 0.328)20= 0.00035
• Let 𝑆1 and 𝑆2 be 40% similar. – Probability 𝑆1, 𝑆2 identical in any one particular band:
0.45 = 0.01 – Probability 𝑆1, 𝑆2 identical in at least one of the 20 bands:
≤ 20 ∗ 0.01 = 0.2
Analysis of Banding
Analysis of Banding
Definition
A distance measure 𝑑(𝑥, 𝑦) takes two points in space and produces a real number, and satisfies the following axioms:
1. 𝑑 𝑥, 𝑦 ≥ 0
2. 𝑑 𝑥, 𝑦 = 0 if and only if 𝑥 = 𝑦
3. 𝑑 𝑥, 𝑦 = 𝑑 𝑦, 𝑥
4. 𝑑 𝑥, 𝑦 ≤ 𝑑 𝑥, 𝑧 + 𝑑 𝑧, 𝑦
Distance Measures
Definition: 𝑑 𝑥, 𝑦 = |𝑥𝑖 − 𝑦𝑖|𝑟𝑛
𝑖=1𝑟
Interesting Cases
• 𝑳𝟏 − 𝒏𝒐𝒓𝒎:
Manhattan distance - d 𝑥, 𝑦 = |𝑥𝑖 − 𝑦𝑖|𝑛𝑖=1
• 𝑳𝟐 − 𝒏𝒐𝒓𝒎:
Euclidian distance - 𝑑 𝑥, 𝑦 = (𝑥𝑖 − 𝑦𝑖)2𝑛
𝑖=1
• 𝑳∞ − 𝒏𝒐𝒓𝒎:
Max distance - 𝑑 𝑥, 𝑦 = max𝑖|𝑥𝑖 − 𝑦𝑖|
𝐿𝑟 − 𝑛𝑜𝑟𝑚
Example
• 𝑳𝟏 − 𝒏𝒐𝒓𝒎: d 𝑥, 𝑦 = 4 + 3 = 7
• 𝑳𝟐 − 𝒏𝒐𝒓𝒎:
𝑑 𝑥, 𝑦 = 42 + 32 = 5
• 𝑳∞ − 𝒏𝒐𝒓𝒎: 𝑑 𝑥, 𝑦 = max(4,3) = 4
𝐿𝑟 − 𝑛𝑜𝑟𝑚
• Definition:
𝑑 𝑆𝑖 , 𝑆𝑗 = 1 − 𝑆𝐼𝑀 𝑆𝑖 , 𝑆𝑗
• Example: 𝑆 = 𝑎, 𝑏, 𝑐 𝑇 = 𝑎, 𝑏, 𝑑, 𝑒
𝑑 𝑆, 𝑇 = 1 − 𝑆𝐼𝑀 𝑆, 𝑇 = 1 −2
5=3
5
Jaccard Distance (Sets)
• Definition:
𝑑 𝑣𝑖 , 𝑣𝑗 = Angle between the vectors
• Example:
Cosine distance (Vectors)
• Definition:
𝑑 𝑆𝑡𝑟𝑖 , 𝑆𝑡𝑟𝑗 =Number of inserts and deletes to change one string into
another • Example:
𝑑 "kitten", "sitting" = 5 Delete k at 0 Insert s at 0 Delete e at 4 Insert i at 4 Insert g at 6
Edit Distance (Strings)
• Definition:
𝑑 𝑣𝑖 , 𝑣𝑗 =Number of positions in which they differ
• Example: 𝑣1 𝑣2
𝑑 𝑣1, 𝑣2 = 2
Hamming Distance (Bit Vectors)
1 1 0 1 0 0 1
1 0 0 1 0 1 1
Locality-Sensitive Functions
Definition:
Let 𝑑1 < 𝑑2 be two distances according to some distance measure 𝑑.
Family of functions 𝐻 is said to be (𝑑1, 𝑑2, 𝑝1, 𝑝2) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 if for every 𝑓 ∈ 𝐻:
1. If 𝑑 𝑥, 𝑦 ≤ 𝑑1, then Pr f x = f y ≥ 𝑝1.
2. If 𝑑 𝑥, 𝑦 ≥ 𝑑2, then Pr f x = f y ≤ 𝑝2.
Locality-Sensitive Functions
Locality-Sensitive Functions
Example:
The family of minhash functions is a 𝑑1, 𝑑2, 1 − 𝑑1, 1 − 𝑑2 − 𝑠𝑒𝑛𝑠𝑒𝑡𝑖𝑣𝑒 family for
any 𝑑1 and 𝑑2, where 0 ≤ 𝑑1 ≤ 𝑑2 ≤ 1.
Recall that: 𝑃𝑟ℎ ℎ 𝑆1 = ℎ 𝑆2 = 𝑆𝐼𝑀 𝑆1, 𝑆2
= 1 − 𝑑(𝑥, 𝑦)
Improvements
Goal:
Given 𝐻: (𝑑1, 𝑑2, 𝑝1, 𝑝2) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒
Generate 𝐻′: (𝑑1, 𝑑2, 𝑝1
′, 𝑝2′) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒
and 𝑝1′ ≈ 1, 𝑝2
′ ≈ 0
AND Construction
• Theorem: Given 𝐻 is (𝑑1, 𝑑2, 𝑝1, 𝑝2) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒, we can generate 𝐻′ that is:
(𝑑1, 𝑑2, 𝑝1𝑟 , 𝑝2
𝑟) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒
• Proof: h ∈ 𝐻′ consist of 𝑟 functions from 𝐻. For ℎ = {ℎ1, … , ℎ𝑟} in 𝐻′, ℎ 𝑥 = ℎ(𝑦) if and only if ℎ𝑖 𝑥 = ℎ𝑖(𝑦) for all 𝑖.
OR Construction
• Theorem: Given 𝐻 is (𝑑1, 𝑑2, 𝑝1, 𝑝2) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒, we can generate 𝐻′ that is: (𝑑1, 𝑑2, 1 − (1 − 𝑝1)
𝑏 , 1 − (1 − 𝑝2)𝑏) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒
• Proof: h ∈ 𝐻′ consist of 𝑟 functions from 𝐻. For ℎ = {ℎ1, … , ℎb} in 𝐻′, ℎ 𝑥 = ℎ(𝑦) if and only if ℎ𝑖 𝑥 = ℎ𝑖(𝑦) for some 𝑖.
Composing Constructions
We can cascade AND and OR constructions in any order to make 𝑝2 close to 0 and 𝑝1 close to 1.
Composing Constructions
Example:
𝐻𝐴𝑁𝐷
H1𝑂𝑅H2
AND-construction with 𝑟 = 3.
OR-construction with 𝑏 = 5.
Member of 𝐻2 built from 15 members of 𝐻.
𝟏 − (𝟏 − 𝒑𝟑)𝟓 𝒑𝟑 𝒑
0.039 0.008 0.2
0.127 0.027 0.3
0.282 0.064 0.4
0.4874 0.125 0.5
0.704 0.216 0.6
0.878 0.343 0.7
0.972 0.512 0.8
0.999 0.729 0.9
Composing Constructions
Example:
𝐻𝑂𝑅H1
𝐴𝑁𝐷H2
AND-construction with 𝑟 = 5.
OR-construction with 𝑏 = 3.
Member of 𝐻2 built from 15 members of 𝐻.
((𝟏 − 𝟏 − 𝒑 𝟑))𝟓 𝟏 − (𝟏 − 𝒑)𝟑 𝒑
0.028 0.488 0.2
0.122 0.657 0.3
0.296 0.784 0.4
0.513 0.875 0.5
0.718 0.936 0.6
0.872 0.973 0.7
0.961 0.992 0.8
0.995 0.999 0.9
LSH For Hamming Distance
ℎ(𝑥, 𝑦) - Hamming distance between vectors x and y in 𝑑 − 𝑑𝑖𝑚𝑒𝑛𝑠𝑖𝑜𝑛𝑎𝑙 space.
Define: 𝑓𝑖 𝑥 = 𝑥[𝑖]
Hence, 𝑓𝑖 𝑥 = 𝑓𝑖 y if and only if 𝑥 𝑖 = 𝑦[𝑖].
Pr 𝑓𝑖 𝑥 = 𝑓𝑖 𝑦 = 1 − ℎ 𝑥,𝑦𝑑
{𝑓1, … 𝑓𝑑} is (𝑑1, 𝑑2, 1 −𝑑1𝑑, 1 − 𝑑2
𝑑) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒 family.
LSH For Cosine Distance
Random Hyperplanes
To pick a random hyperplane, we pick a random vector 𝑣.
The hyperplane is then the set of points whose dot product with 𝑣 is 0.
𝑓v 𝑥 = 𝑆𝑖𝑔𝑛(𝑣 ∙ 𝑥)
LSH For Cosine Distance
𝑓v 𝑥 = 𝑓v y
𝑆𝑖𝑔𝑛 𝑣 ∙ 𝑥 = 𝑆𝑖𝑔𝑛(𝑣 ∙ 𝑦)
𝑥 and 𝑦 on the same side of the hyperplane
(𝑑1, 𝑑2,180−𝑑1180 , 180−𝑑2180 ) − 𝑠𝑒𝑛𝑠𝑖𝑡𝑖𝑣𝑒
Thanks
Any Questions?