: Maha Sabri Altememe Lecturer : Maha Sabri Altememe Lecture :1 1.
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang...
-
Upload
abner-blair -
Category
Documents
-
view
212 -
download
0
Transcript of Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang...
Cache-Conscious Performance Optimization for
Similarity Search
Maha Alabduljalil, Xun Tang, Tao YangDepartment of Computer Science
University of California at Santa Barbara
36th ACM International Conference on Information Retrieval
• Definition: Finding pairs of objects whose similarity is above a certain threshold.
• Application examples:• Collaborative filtering.• Spam and near duplicate detection.• Image search.• Query suggestions.
• Motivation: APSS still time consuming for large datasets.
All Pairs Similarity Search (APSS)
≥ τSim (di,dj) = cos(di,dj)
2
Previous Work
• Approaches to speedup APSS: Exact APSS:
– Dynamic Computation Filtering. [ Bayardo et al. WWW’07 ]– Inverted indexing. [Arasu et al. VLDB’06]– Parallelization with MapReduce. [Lin SIGIR’09]– Partition-based similarity comparison [Maha WSDM’13]
Approximate APSS via LSH: Tradeoff between precision and recall plus addition of redundant computations.
• Approaches that utilize memory hierarchy: General query processing [ Manegold VLDB02 ]
Other computing problems.
3
Baseline: Partition-based Similarity Search (PSS)
Partitioning with
dissimilarity detection
Similarity comparison with parallel tasks
[WSDM’13]
4
PSS Task
Read assigned partition into area S. Repeat
Read some vectors vi from other partitions
Compare vi with S
Output similar vector pairs
Until other potentially similar vectors are compared.
Memory areas: S = vectors owned, B = other vectors,C = temporary.
Task steps:
5
Focus and Contribution
• Contribution: Analyze memory hierarchy behavior in PSS tasks. New data layout/traversal techniques for speedup:
①Splitting data blocks to fit cache.
②Coalescing: read a block of vectors from other partitions and process them together.
• Algorithms: Baseline: PSS [WSDM’13] Cache-conscious designs: PSS1 & PSS2 6
PROBLEM1: PSS area S is too big to fit in cache
Other vectors B
CInverted index of vectors …
…
Accumulatorfor S
…
S
…
… ……
…
Too Long to fit in cache!
7
PSS1: Cache-conscious data splitting
B
Accumulator for Si
C…
S1
…
S2
Sq
aa
aa
aa
aa
aa
…
…
aa
aa
aa …
After splitting:
……Split Size?
8
PSS1 Task
Compare (Sx, B)
PSS1 Task
Compare(Sx, B)
Read S and divide into many splitsRead other vectors into B
…for di in Sx
for dj in B Sim(di,dj) += wi,t * wj,t
if( sim(di,dj) + maxwdi *
sumdj <t) then
…
Output similarity scores
For each split Sx
9
Modeling Memory/Cache Access of PSS1
Area Si Area B
Area C
Sim(di,dj) + = wi,t * wj,t
if( sim(di,dj) + maxwdi * sumdj
<
T ) then
Total number of data accesses :
D0 = D0(Si) + D0(B)+D0(C) 10
Cache misses and data access time
D0 : total memory data accesses.
Memory and cache access counts:
D1 : missed access at L1D2 : missed access at L2D3 : missed access at L3
Total data access time
= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3
+ D3δmem
δi : access time at cache level iδmem : access time in memory.
Memory and cache access time:
11
Total data access time
Data found in L1
Total data access time
= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3
+ D3δmem ~2 cycles
Total data access time
Data found in L2
Total data access time
= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3
+ D3δmem
6-10 cycles
Total data access time
Data found in L3
Total data access time
= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3
+ D3δmem
30-40 cycles
Total data access time
Data found in memory
Total data access time
= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3
+ D3δmem
100- 300 cycles
Actual vs. Predicted
Avg. task time ≈ #features * ( lookup + multiply + add) + accessmem
13
RECALL: Split size s
B
Accumulator for Si
C…
S1
…
S2
Sq
aa
aa
aa
aa
aa
…
…
aa
aa
aa …
……Split Size s
Ratio of Data Access to Computation
Avg. task time ≈ #features * ( lookup + add+multiply) + accessmem
Data accesscomputation
computation
Data access
Split size s15
PSS2: Vector coalescing
• Issues:• PSS1 focused on splitting S to fit into cache.
• PSS1 does not consider cache reuse to improve temporal locality in memory areas B and C.• Solution: coalescing multiple vectors in B
PSS2: Example for improved locality
Si
…
… …
C
…
…
…
B
…Striped areas in cache
16
Evaluation
• Implementation: Hadoop MapReduce.• Objectives:
• Effectiveness of PSS1, PSS2 over PSS.• Benefits of modeling.
• Datasets: • Twitter, Clueweb, Enron emails, YahooMusic,
Google news.• Preprocessing:
• Stopword removal + df-cut.• Static partitioning for dissimilarity detection.
Improvement Ratio of PSS1,PSS2 over PSS
2.7x
18
RECALL: coalescing size b
Si
…
… …
C
…
……
B
…b
…
Avg. # of sharing
= 2 18
Average number of shared features
19
Overall performance
Overall performance
Clueweb
Impact of split size s in PSS1
Clueweb
Emails
RECALL: split size s & coalescing size b
Si
…
… …
C
…
……
B
…b
s
20
Affect of s & b on PSS2 performance (Twitter)
fastest
21
Conclusions
• Splitting hosted partitions to fit into cache reduces slow memory data access (PSS1)
• Coalescing vectors with size-controlled inverted indexing can improve the temporal locality of visited data.(PSS2)
• Cost modeling for memory hierarchy access is a guidance to optimize parameter setting.
• Experiments show cache-conscious design can be upto 2.74x as fast as the cache-oblivious baseline.