Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang...

30
Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California at Santa Barbara 36 th ACM International Conference on Information Retrieval

Transcript of Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang...

Page 1: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Cache-Conscious Performance Optimization for

Similarity Search

Maha Alabduljalil, Xun Tang, Tao YangDepartment of Computer Science

University of California at Santa Barbara

36th ACM International Conference on Information Retrieval

Page 2: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

• Definition: Finding pairs of objects whose similarity is above a certain threshold.

• Application examples:• Collaborative filtering.• Spam and near duplicate detection.• Image search.• Query suggestions.

• Motivation: APSS still time consuming for large datasets.

All Pairs Similarity Search (APSS)

≥ τSim (di,dj) = cos(di,dj)

2

Page 3: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Previous Work

• Approaches to speedup APSS: Exact APSS:

– Dynamic Computation Filtering. [ Bayardo et al. WWW’07 ]– Inverted indexing. [Arasu et al. VLDB’06]– Parallelization with MapReduce. [Lin SIGIR’09]– Partition-based similarity comparison [Maha WSDM’13]

Approximate APSS via LSH: Tradeoff between precision and recall plus addition of redundant computations.

• Approaches that utilize memory hierarchy: General query processing [ Manegold VLDB02 ]

Other computing problems.

3

Page 4: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Baseline: Partition-based Similarity Search (PSS)

Partitioning with

dissimilarity detection

Similarity comparison with parallel tasks

[WSDM’13]

4

Page 5: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

PSS Task

Read assigned partition into area S. Repeat

Read some vectors vi from other partitions

Compare vi with S

Output similar vector pairs

Until other potentially similar vectors are compared.

Memory areas: S = vectors owned, B = other vectors,C = temporary.

Task steps:

5

Page 6: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Focus and Contribution

• Contribution: Analyze memory hierarchy behavior in PSS tasks. New data layout/traversal techniques for speedup:

①Splitting data blocks to fit cache.

②Coalescing: read a block of vectors from other partitions and process them together.

• Algorithms: Baseline: PSS [WSDM’13] Cache-conscious designs: PSS1 & PSS2 6

Page 7: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

PROBLEM1: PSS area S is too big to fit in cache

Other vectors B

CInverted index of vectors …

Accumulatorfor S

S

… ……

Too Long to fit in cache!

7

Page 8: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

PSS1: Cache-conscious data splitting

B

Accumulator for Si

C…

S1

S2

Sq

aa

aa

aa

aa

aa

aa

aa

aa …

After splitting:

……Split Size?

8

Page 9: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

PSS1 Task

Compare (Sx, B)

PSS1 Task

Compare(Sx, B)

Read S and divide into many splitsRead other vectors into B

…for di in Sx

for dj in B Sim(di,dj) += wi,t * wj,t

if( sim(di,dj) + maxwdi *

sumdj <t) then

Output similarity scores

For each split Sx

9

Page 10: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Modeling Memory/Cache Access of PSS1

Area Si Area B

Area C

Sim(di,dj) + = wi,t * wj,t

if( sim(di,dj) + maxwdi * sumdj

<

T ) then

Total number of data accesses :

D0 = D0(Si) + D0(B)+D0(C) 10

Page 11: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Cache misses and data access time

D0 : total memory data accesses.

Memory and cache access counts:

D1 : missed access at L1D2 : missed access at L2D3 : missed access at L3

Total data access time

= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3

+ D3δmem

δi : access time at cache level iδmem : access time in memory.

Memory and cache access time:

11

Page 12: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Total data access time

Data found in L1

Total data access time

= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3

+ D3δmem ~2 cycles

Page 13: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Total data access time

Data found in L2

Total data access time

= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3

+ D3δmem

6-10 cycles

Page 14: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Total data access time

Data found in L3

Total data access time

= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3

+ D3δmem

30-40 cycles

Page 15: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Total data access time

Data found in memory

Total data access time

= (D0-D1)δ1 + (D1-D2)δ2 + (D2-D3)δ3

+ D3δmem

100- 300 cycles

Page 16: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Actual vs. Predicted

Avg. task time ≈ #features * ( lookup + multiply + add) + accessmem

13

Page 17: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

RECALL: Split size s

B

Accumulator for Si

C…

S1

S2

Sq

aa

aa

aa

aa

aa

aa

aa

aa …

……Split Size s

Page 18: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Ratio of Data Access to Computation

Avg. task time ≈ #features * ( lookup + add+multiply) + accessmem

Data accesscomputation

computation

Data access

Split size s15

Page 19: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

PSS2: Vector coalescing

• Issues:• PSS1 focused on splitting S to fit into cache.

• PSS1 does not consider cache reuse to improve temporal locality in memory areas B and C.• Solution: coalescing multiple vectors in B

Page 20: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

PSS2: Example for improved locality

Si

… …

C

B

…Striped areas in cache

16

Page 21: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Evaluation

• Implementation: Hadoop MapReduce.• Objectives:

• Effectiveness of PSS1, PSS2 over PSS.• Benefits of modeling.

• Datasets: • Twitter, Clueweb, Enron emails, YahooMusic,

Google news.• Preprocessing:

• Stopword removal + df-cut.• Static partitioning for dissimilarity detection.

Page 22: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Improvement Ratio of PSS1,PSS2 over PSS

2.7x

18

Page 23: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

RECALL: coalescing size b

Si

… …

C

……

B

…b

Avg. # of sharing

= 2 18

Page 24: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Average number of shared features

19

Page 25: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Overall performance

Page 26: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Overall performance

Clueweb

Page 27: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Impact of split size s in PSS1

Clueweb

Twitter

Emails

Page 28: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

RECALL: split size s & coalescing size b

Si

… …

C

……

B

…b

s

20

Page 29: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Affect of s & b on PSS2 performance (Twitter)

fastest

21

Page 30: Cache-Conscious Performance Optimization for Similarity Search Maha Alabduljalil, Xun Tang, Tao Yang Department of Computer Science University of California.

Conclusions

• Splitting hosted partitions to fit into cache reduces slow memory data access (PSS1)

• Coalescing vectors with size-controlled inverted indexing can improve the temporal locality of visited data.(PSS2)

• Cost modeling for memory hierarchy access is a guidance to optimize parameter setting.

• Experiments show cache-conscious design can be upto 2.74x as fast as the cache-oblivious baseline.