ACM Multimedia 2020 Tutorial on Effective and Efficient ...

Post on 29-May-2022

4 views 0 download

Transcript of ACM Multimedia 2020 Tutorial on Effective and Efficient ...

1

Billion-scale Approximate Nearest Neighbor Search

ACM Multimedia 2020 Tutorial onEffective and Efficient: Toward Open-world Instance Re-identification

Yusuke MatsuiThe University of Tokyo

Search ๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘

๐’™๐‘› โˆˆ โ„๐ท

2

โžข๐‘ ๐ท-dim database vectors: ๐’™๐‘› ๐‘›=1๐‘

Nearest Neighbor Search; NN

0.233.150.651.43

Search

0.203.250.721.68

๐’’ โˆˆ โ„๐ท ๐’™74

argmin๐‘›โˆˆ 1,2,โ€ฆ,๐‘

๐’’ โˆ’ ๐’™๐‘› 22

Result

๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘

๐’™๐‘› โˆˆ โ„๐ท

3

โžข๐‘ ๐ท-dim database vectors: ๐’™๐‘› ๐‘›=1๐‘

โžขGiven a query ๐’’, find the closest vector from the databaseโžขOne of the fundamental problems in computer scienceโžขSolution: linear scan, ๐‘‚ ๐‘๐ท , slow

Nearest Neighbor Search; NN

0.233.150.651.43

Search

0.203.250.721.68

๐’’ โˆˆ โ„๐ท ๐’™74

argmin๐‘›โˆˆ 1,2,โ€ฆ,๐‘

๐’’ โˆ’ ๐’™๐‘› 22

Result

๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘

๐’™๐‘› โˆˆ โ„๐ท

Approximate Nearest Neighbor Search; ANN

โžขFaster searchโžขDonโ€™t necessarily have to be exact neighborsโžขTrade off: runtime, accuracy, and memory-consumption

4

Approximate Nearest Neighbor Search; ANN

0.233.150.651.43

Search

0.203.250.721.68

๐’’ โˆˆ โ„๐ท ๐’™74

argmin๐‘›โˆˆ 1,2,โ€ฆ,๐‘

๐’’ โˆ’ ๐’™๐‘› 22

Result

๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘

๐’™๐‘› โˆˆ โ„๐ท

โžขFaster searchโžขDonโ€™t necessarily have to be exact neighborsโžขTrade off: runtime, accuracy, and memory-consumptionโžขA sense of scale: billion-scale data on memory

32GB RAM100 106 to 109

10 ms

5

NN/ANN for CV

Image retrieval

https://about.mercari.com/press/news/article/20190318_image_search/

https://jp.mathworks.com/help/vision/ug/image-classification-with-bag-of-visual-words.html

Clustering

kNN recognitionโžข Originally: fast construction of bag-of-featuresโžข One of the benchmarks is still SIFT

https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

6

Person Re-identification

7

Start

Have GPU(s)?

faiss-gpu: linear scan (GpuIndexFlatL2)

faiss-cpu: linear scan (IndexFlatL2)

nmslib (hnsw)

falconn

annoy

faiss-cpu: hnsw + ivfpq(IndexHNSWFlat + IndexIVFPQ)

Adjust the PQ parameters: Make ๐‘€ smlaller

Exact nearest neighbor search

Alternative: faiss.IndexHNSWFlat in faiss-cpuโžข Same algorithm in different libraries

Note: Assuming ๐ท โ‰… 100. The size of the problem is determined by ๐ท๐‘. If 100 โ‰ช ๐ท, run PCA to reduce ๐ท to 100

Yes

No

If topk > 2048

If slow, or out of memory

Require fast data addition

Would like to run from several processes

If slowโ€ฆ

Would like to adjust the performance

riiWould like to runsubset-search

If out of memory

Adjust the IVF parameters:Make nprobe larger โžก Higher accuracy but slower

Would like to adjust the performance

cheat-sheet for ANN in Python ๏ผˆas of 2020. Can be installed by conda or pip๏ผ‰

faiss-gpu: ivfpq (GpuIndexIVFPQ)

(1) If still out of GPU-memory, or(2) Need more accurate results

If out of GPU-memoryIf out of GPU-memory, make ๐‘€ smaller

About: 103 < ๐‘ < 106

About: 106 < ๐‘ < 109

About: 109 < ๐‘

Part 1:Nearest Neighbor Search

Part 2:Approximate Nearest Neighbor Search

8

Part 1:Nearest Neighbor Search

Part 2:Approximate Nearest Neighbor Search

9

0.233.150.651.43

Search

0.203.250.721.68

๐’’ โˆˆ โ„๐ท ๐’™74

argmin๐‘›โˆˆ 1,2,โ€ฆ,๐‘

๐’’ โˆ’ ๐’™๐‘› 22

Result

๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘

๐’™๐‘› โˆˆ โ„๐ท

Nearest Neighbor Search

10

โžขShould try this first of allโžขIntroduce a naรฏve implementationโžขIntroduce a fast implementationโœ“ Faiss library from FAIR (youโ€™ll see many times today. CPU & GPU)

โžขExperience the drastic difference between the two impls

Task๏ผšGiven ๐’’ โˆˆ ๐’ฌ and ๐’™ โˆˆ ๐’ณ, compute ๐’’ โˆ’ ๐’™ 22

11

๐‘€ ๐ท-dim query vectors ๐’ฌ = ๐’’1, ๐’’2, โ€ฆ , ๐’’๐‘€๐‘ ๐ท-dim database vectors ๐’ณ = ๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘ ๐‘€ โ‰ช ๐‘

๐‘€ ๐ท-dim query vectors ๐’ฌ = ๐’’1, ๐’’2, โ€ฆ , ๐’’๐‘€๐‘ ๐ท-dim database vectors ๐’ณ = ๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘ ๐‘€ โ‰ช ๐‘

Task๏ผšGiven ๐’’ โˆˆ ๐’ฌ and ๐’™ โˆˆ ๐’ณ, compute ๐’’ โˆ’ ๐’™ 22

parfor q in Q:for x in X:

l2sqr(q, x)

def l2sqr(q, x):diff = 0.0for (d = 0; d < D; ++d):

diff += (q[d] โ€“ x[d])**2return diff

Naรฏve impl.Parallelize query-side

Select min by heap, but omit it now

12

Task๏ผšGiven ๐’’ โˆˆ ๐’ฌ and ๐’™ โˆˆ ๐’ณ, compute ๐’’ โˆ’ ๐’™ 22

parfor q in Q:for x in X:

l2sqr(q, x)

def l2sqr(q, x):diff = 0.0for (d = 0; d < D; ++d):

diff += (q[d] โ€“ x[d])**2return diff

Naรฏve impl.Parallelize query-side

Select min by heap, but omit it now

faiss impl.

if ๐‘€ < 20๏ผš

compute ๐’’ โˆ’ ๐’™ 22 by SIMD

else๏ผš

compute ๐’’ โˆ’ ๐’™ 22 = ๐’’ 2

2 โˆ’ 2๐’’โŠค๐’™ + ๐’™ 22 by BLAS 13

๐‘€ ๐ท-dim query vectors ๐’ฌ = ๐’’1, ๐’’2, โ€ฆ , ๐’’๐‘€๐‘ ๐ท-dim database vectors ๐’ณ = ๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘ ๐‘€ โ‰ช ๐‘

Task๏ผšGiven ๐’’ โˆˆ ๐’ฌ and ๐’™ โˆˆ ๐’ณ, compute ๐’’ โˆ’ ๐’™ 22

parfor q in Q:for x in X:

l2sqr(q, x)

def l2sqr(q, x):diff = 0.0for (d = 0; d < D; ++d):

diff += (q[d] โ€“ x[d])**2return diff

Naรฏve impl.Parallelize query-side

Select min by heap, but omit it now

faiss impl.

if ๐‘€ < 20๏ผš

compute ๐’’ โˆ’ ๐’™ 22 by SIMD

else๏ผš

compute ๐’’ โˆ’ ๐’™ 22 = ๐’’ 2

2 โˆ’ 2๐’’โŠค๐’™ + ๐’™ 22 by BLAS 14

๐‘€ ๐ท-dim query vectors ๐’ฌ = ๐’’1, ๐’’2, โ€ฆ , ๐’’๐‘€๐‘ ๐ท-dim database vectors ๐’ณ = ๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘ ๐‘€ โ‰ช ๐‘

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

๐’™ โˆ’ ๐’š 22 by SIMD Ref.Rename variables for the

sake of explanation

x

y

D=31

float: 32bit

15

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

โžข 256bit SIMD Registerโžข Process eight floats at oncefloat: 32bit

16

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

float: 32bit

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

17

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 256bit SIMD Registerโžข Process eight floats at once

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

a_m_b1

โŠ–โŠ–โŠ–โŠ– โŠ–โŠ– โŠ–โŠ–

float: 32bit

18

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 256bit SIMD Registerโžข Process eight floats at once

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

a_m_b1

msum1

a_m_b1

โŠ–โŠ–โŠ–โŠ– โŠ–โŠ– โŠ–โŠ–

โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—

+=

float: 32bit

19

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 256bit SIMD Registerโžข Process eight floats at once

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

msum1 +=

float: 32bit

20

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 256bit SIMD Registerโžข Process eight floats at once

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

a_m_b1

โŠ–โŠ–โŠ–โŠ– โŠ–โŠ– โŠ–โŠ–

msum1 +=

float: 32bit

21

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 256bit SIMD Registerโžข Process eight floats at once

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

a_m_b1

msum1

a_m_b1

โŠ–โŠ–โŠ–โŠ– โŠ–โŠ– โŠ–โŠ–

โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—

+=

float: 32bit

22

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 256bit SIMD Registerโžข Process eight floats at once

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

a_m_b1

msum1

a_m_b1

msum2

โŠ–โŠ–โŠ–โŠ– โŠ–โŠ– โŠ–โŠ–

โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—โŠ—

โŠ• โŠ• โŠ• โŠ•

โžข 128bit SIMD Register

+=

float: 32bit

23

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 256bit SIMD Registerโžข Process eight floats at once

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

mx my

a_m_b1 a_m_b1

msum2

โŠ–โŠ–โŠ–โŠ–

โŠ—โŠ—โŠ—โŠ—

+=

โžข 128bit SIMD Register

float: 32bit

24

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

0 0 0mx 0 0 0my

a_m_b1 a_m_b1

โŠ–โŠ–โŠ–โŠ–

โŠ—โŠ—โŠ—โŠ—

msum2 +=

The rest

float: 32bit

25

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 128bit SIMD Register

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

0 0 0mx 0 0 0my

a_m_b1 a_m_b1

โŠ–โŠ–โŠ–โŠ–

โŠ—โŠ—โŠ—โŠ—

โŠ• โŠ•

โŠ•

msum2 +=

Result

float: 32bit

26

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 128bit SIMD Register

The rest

float fvec_L2sqr (const float * x,const float * y,size_t d)

{__m256 msum1 = _mm256_setzero_ps();

while (d >= 8) {__m256 mx = _mm256_loadu_ps (x); x += 8;__m256 my = _mm256_loadu_ps (y); y += 8;const __m256 a_m_b1 = mx - my;msum1 += a_m_b1 * a_m_b1;d -= 8;

}

__m128 msum2 = _mm256_extractf128_ps(msum1, 1);msum2 += _mm256_extractf128_ps(msum1, 0);

if (d >= 4) {__m128 mx = _mm_loadu_ps (x); x += 4;__m128 my = _mm_loadu_ps (y); y += 4;const __m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;d -= 4;

}

if (d > 0) {__m128 mx = masked_read (d, x);__m128 my = masked_read (d, y);__m128 a_m_b1 = mx - my;msum2 += a_m_b1 * a_m_b1;

}

msum2 = _mm_hadd_ps (msum2, msum2);msum2 = _mm_hadd_ps (msum2, msum2);return _mm_cvtss_f32 (msum2);

}

x

y

0 0 0mx 0 0 0my

a_m_b1 a_m_b1

โŠ–โŠ–โŠ–โŠ–

โŠ—โŠ—โŠ—โŠ—

โŠ• โŠ•

โŠ•

msum2 +=

Result

float: 32bit

27

def l2sqr(x, y):diff = 0.0for (d = 0; d < D; ++d):

diff += (x[d] โ€“ y[d])**2return diff

๐’™ โˆ’ ๐’š 22 by SIMD Rename variables for the

sake of explanationRef.

D=31

โžข 128bit SIMD Register

The rest

โžข SIMD codes of faiss are simple and easy to readโžข Being able to read SIMD codes comes in handy

sometimes; why this impl is super fastโžข Another example of SIMD L2sqr from HNSW:

https://github.com/nmslib/hnswlib/blob/master/hnswlib/space_l2.h

๐‘€ ๐ท-dim query vectors ๐’ฌ = ๐’’1, ๐’’2, โ€ฆ , ๐’’๐‘€๐‘ ๐ท-dim database vectors ๐’ณ = ๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘Task๏ผšGiven ๐’’ โˆˆ ๐’ฌ and ๐’™ โˆˆ ๐’ณ, compute ๐’’ โˆ’ ๐’™ 2

2

parfor q in Q:for x in X:

l2sqr(q, x)

def l2sqr(q, x):diff = 0.0for (d = 0; d < D; ++d):

diff += (q[d] โ€“ x[d])**2return diff

Naรฏve impl.Parallelize query-side

Select min by heap, but omit it now

faiss impl.

if ๐‘€ < 20๏ผš

compute ๐’’ โˆ’ ๐’™ 22 by SIMD

else๏ผš

compute ๐’’ โˆ’ ๐’™ 22 = ๐’’ 2

2 โˆ’ 2๐’’โŠค๐’™ + ๐’™ 22 by BLAS 28

Compute ๐’’ โˆ’ ๐’™ 22 = ๐’’ 2

2 โˆ’ 2๐’’โŠค๐’™ + ๐’™ 22 with BLAS

# Compute tablesq_norms = norms(Q) # ๐’’1 2

2, ๐’’2 22, โ€ฆ , ๐’’๐‘€ 2

2

x_norms = norms(X) # ๐’™1 22, ๐’™2 2

2, โ€ฆ , ๐’™๐‘ 22

ip = sgemm_(Q, X, โ€ฆ) # ๐‘„โŠค๐‘‹

# Scan and sumparfor (m = 0; m < M; ++m):

for (n = 0; n < N; ++n):dist = q_norms[m] + x_norms[n] โ€“ ip[m][n]

Stack ๐‘€ ๐ท-dim query vectors to a ๐ท ร—๐‘€ matrix: ๐‘„ = ๐’’1, ๐’’2, โ€ฆ , ๐’’๐‘€ โˆˆ โ„๐ทร—๐‘€

Stack ๐‘ ๐ท-dim database vectors to a ๐ท ร— ๐‘ matrix: ๐‘‹ = ๐’™1, ๐’™2, โ€ฆ , ๐’™๐‘ โˆˆ โ„๐ทร—๐‘

SIMD-accelerated function

โžข Matrix multiplication by BLASโžข Dominant if ๐‘„ and ๐‘‹ are largeโžข The difference of the background matters:โœ“ Intel MKL is 30% faster than OpenBLAS

29๐’’๐‘š 22 ๐’™๐‘› 2

2 ๐‘„โŠค๐‘‹ ๐‘š๐‘›๐’’๐‘š โˆ’ ๐’™๐‘› 22

NN in GPU (faiss-gpu) is 10x faster than NN in CPU (faiss-cpu)

โžขNN-GPU always compute ๐’’ 22 โˆ’ 2๐’’โŠค๐’™ + ๐’™ 2

2

โžข k-means for 1M vectors (D=256, K=20000)โœ“ 11 min on CPUโœ“ 55 sec on 1 Pascal-class P100 GPU (float32 math)โœ“ 34 sec on 1 Pascal-class P100 GPU (float16 math)โœ“ 21 sec on 4 Pascal-class P100 GPUs (float32 math)โœ“ 16 sec on 4 Pascal-class P100 GPUs (float16 math)

โžข If GPU is available and its memory is enough, try GPU-NNโžข The behavior is little bit different (e.g., a restriction for top-k)

Benchmark: https://github.com/facebookresearch/faiss/wiki/Low-level-benchmarks

x10 faster

30

Referenceโžข Switch implementation of L2sqr in faiss:

[https://github.com/facebookresearch/faiss/wiki/Implementation-notes#matrix-multiplication-to-do-many-l2-distance-computations]

โžข Introduction to SIMD: a lecture by Markus Pรผschel (ETH) [How to Write Fast Numerical Code - Spring 2019], especially [SIMD vector instructions]โœ“ https://acl.inf.ethz.ch/teaching/fastcode/2019/โœ“ https://acl.inf.ethz.ch/teaching/fastcode/2019/slides/07-simd.pdf

โžข SIMD codes for faiss [https://github.com/facebookresearch/faiss/blob/master/utils/distances_simd.cpp]

โžข L2sqr benchmark including AVX512 for faiss-L2sqr [https://gist.github.com/matsui528/583925f88fcb08240319030202588c74]

31

Part 1:Nearest Neighbor Search

Part 2:Approximate Nearest Neighbor Search

32

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

33

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

34

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

35

Locality Sensitive Hashing (LSH)โžข LSH = Hash functions + Hash tablesโžข Map similar items to the same symbol with a high probability

Record

๐’™13

Hash 1

Hash 2

โ€ฆ

โ‘ฌ

โ‘ฌ

Search

๐’’Hash 1

Hash 2

โ€ฆ

โ‘ฃใ‰‘ใŠด

โ‘คใŠผ

Compare ๐’’ with ๐’™4, ๐’™5, ๐’™21, โ€ฆby the Euclidean distance

36

Locality Sensitive Hashing (LSH)โžข LSH = Hash functions + Hash tablesโžข Map similar items to the same symbol with a high probability

Record

๐’™13

Hash 1

Hash 2

โ€ฆ

โ‘ฌ

โ‘ฌ

Search

๐’’Hash 1

Hash 2

โ€ฆ

โ‘ฃใ‰‘ใŠด

โ‘คใŠผ

Compare ๐’’ with ๐’™4, ๐’™5, ๐’™21, โ€ฆby the Euclidean distance

E.g., random projection [Dater+, SCG 04]๐ป ๐’™ = โ„Ž1 ๐’™ ,โ€ฆ , โ„Ž๐‘€ ๐’™ ๐‘‡

โ„Ž๐‘š ๐’™ =๐’‚๐‘‡๐’™ + ๐‘

๐‘Š

37

Locality Sensitive Hashing (LSH)โžข LSH = Hash functions + Hash tablesโžข Map similar items to the same symbol with a high probability

Record

๐’™13

Hash 1

Hash 2

โ€ฆ

โ‘ฌ

โ‘ฌ

Search

๐’’Hash 1

Hash 2

โ€ฆ

โ‘ฃใ‰‘ใŠด

โ‘คใŠผ

Compare ๐’’ with ๐’™4, ๐’™5, ๐’™21, โ€ฆby the Euclidean distance

E.g., random projection [Dater+, SCG 04]๐ป ๐’™ = โ„Ž1 ๐’™ ,โ€ฆ , โ„Ž๐‘€ ๐’™ ๐‘‡

โ„Ž๐‘š ๐’™ =๐’‚๐‘‡๐’™ + ๐‘

๐‘Š

โ˜บ: โžข Math-friendlyโžข Popular in the theory area (FOCS, STOC, โ€ฆ): โžข Large memory costโœ“ Need several tables to boost the accuracyโœ“ Need to store the original data, ๐’™๐‘› ๐‘›=1

๐‘ , on memoryโžข Data-dependent methods such as PQ are better for real-world dataโžข Thus, in recent CV papers, LSH has been treated as a classic-

method

38

Hash 2

โ€ฆ

โ‘ฃใ‰‘ใŠด

โ‘คใŠผSearch

ใ‰™ใŠน๐’’

Hash 1

In fact:โžขConsider a next candidate โžก practical memory consumption๏ผˆMulti-Probe [Lv+, VLDB 07]๏ผ‰

โžขA library based on the idea: FALCONN

Compare ๐’’ with ๐’™4, ๐’™5, ๐’™21, โ€ฆby the Euclidean distance

39

โ˜…852https://github.com/falconn-lib/falconn

$> pip install FALCONN

table = falconn.LSHIndex(params_cp)table.setup(X-center)query_object = table.construct_query_object()# query parameter config herequery_object.find_nearest_neighbor(Q-center, topk)

Falconn

โ˜บ Faster data addition (than annoy, nmslib, ivfpq)โ˜บ Useful for on-the-fly addition Parameter configuration seems a bit non-intuitive

40

Referenceโžข Good summaries on this field: CVPR 2014 Tutorial on Large-Scale Visual

Recognition, Part I: Efficient matching, H. Jรฉgou[https://sites.google.com/site/lsvrtutorialcvpr14/home/efficient-matching]

โžข Practical Q&A: FAQ in Wiki of FALCONN [https://github.com/FALCONN-LIB/FALCONN/wiki/FAQ]

โžข Hash functions: M. Datar et al., โ€œLocality-sensitive hashing scheme based on p-stable distributions,โ€ SCG 2004.

โžข Multi-Probe: Q. Lv et al., โ€œMulti-Probe LSH: Efficient Indexing for High-Dimensional Similarity Searchโ€, VLDB 2007

โžข Survey: A. Andoni and P. Indyk, โ€œNear-Optimal Hashing Algorithms for Approximate Nearest Neighbor in High Dimensions,โ€ Comm. ACM 2008

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

41

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

42

โžขAutomatically select โ€œRandomized KD Treeโ€ or โ€œk-means Treeโ€https://github.com/mariusmuja/flann

โ˜บ Good code base. Implemented in OpenCV and PCLโ˜บ Very popular in the late 00's and early 10โ€™s Large memory consumption. The original data need to be stored Not actively maintained now

Images are from [Muja and Lowe, TPAMI 2014]

Randomized KD Tree k-means Tree

FLANN: Fast Library for Approximate Nearest Neighbors

43

Annoyโ€œ2-means treeโ€+ โ€œmultiple-treesโ€ + โ€œshared priority queueโ€

Record

Search

โžข Focus the cell that the query livesโžข Compare the distances Can traverse the tree by log-times comparisons

All images are cited from the authorโ€™s blog post (https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html)

Select two points randomly Divide up the space Repeat hierarchically

44

Annoyโ€œ2-means treeโ€+ โ€œmultiple-treesโ€ + โ€œshared priority queueโ€

All images are cited from the authorโ€™s blog post (https://erikbern.com/2015/10/01/nearest-neighbors-and-vector-models-part-2-how-to-search-in-high-dimensional-spaces.html)

Feature 1 If we need more data points, use a priority queue

Feature 2 Boost the accuracy by multi-tree with a shared priority queue

45

Annoyhttps://github.com/erikbern/annoy$> pip install annoy

t = AnnoyIndex(D)for n, x in enumerate(X):

t.add_item(n, x)t.build(n_trees)

t.get_nns_by_vector(q, topk)

โ˜บ Developed at Spotify. Well-maintained. Stableโ˜บ Simple interface with only a few parametersโ˜บ Baseline for million-scale dataโ˜บ Support mmap, i.e., can be accessed from several processes Large memory consumption Runtime itself is slower than HNSW

โ˜…7.1K

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

46

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

47

Graph traversal

โžข Very popular in recent years

โžข Around 2017, it turned out that the graph-traversal-based methods work well for million-scale data

โžข Pioneer:โœ“ Navigable Small World Graphs (NSW)โœ“ Hierarchical NSW (HNSW)

โžข Implementation: nmslib, hnsw, faiss

48

Record Images are from [Malkov+, Information Systems, 2013]

โžขEach node is a database vector

๐’™13

Graph of ๐’™1, โ€ฆ , ๐’™90

49

Record Images are from [Malkov+, Information Systems, 2013]

โžขEach node is a database vectorโžขGiven a new database vector, create new edges to neighbors

๐’™13

๐’™91

Graph of ๐’™1, โ€ฆ , ๐’™90

50

Record Images are from [Malkov+, Information Systems, 2013]

โžขEach node is a database vectorโžขGiven a new database vector, create new edges to neighbors

๐’™13

๐’™91

Graph of ๐’™1, โ€ฆ , ๐’™90

51

Record Images are from [Malkov+, Information Systems, 2013]

โžขEach node is a database vectorโžขGiven a new database vector, create new edges to neighbors

๐’™13

๐’™91

Graph of ๐’™1, โ€ฆ , ๐’™90

52

Record Images are from [Malkov+, Information Systems, 2013]

โžขEach node is a database vectorโžขGiven a new database vector, create new edges to neighbors

โžข Early links can be longโžข Such long links encourage a large hop,

making the fast convergence for search

๐’™13

๐’™91

Graph of ๐’™1, โ€ฆ , ๐’™90

53

Search Images are from [Malkov+, Information Systems, 2013]

54

Search Images are from [Malkov+, Information Systems, 2013]

โžข Given a query vector

55

Search Images are from [Malkov+, Information Systems, 2013]

โžข Given a query vectorโžข Start from a random point

56

Search Images are from [Malkov+, Information Systems, 2013]

โžข Given a query vectorโžข Start from a random pointโžข From the connected nodes, find the closest one to the query

57

Search Images are from [Malkov+, Information Systems, 2013]

โžข Given a query vectorโžข Start from a random pointโžข From the connected nodes, find the closest one to the query

58

โžข Given a query vectorโžข Start from a random pointโžข From the connected nodes, find the closest one to the queryโžข Traverse in a greedy manner

Search Images are from [Malkov+, Information Systems, 2013]

59

โžข Given a query vectorโžข Start from a random pointโžข From the connected nodes, find the closest one to the queryโžข Traverse in a greedy manner

Search Images are from [Malkov+, Information Systems, 2013]

60

Extension: Hierarchical NSW; HNSW

[Malkov and Yashunin, TPAMI, 2019]

โžข Construct the graph hierarchically [Malkov and Yashunin, TPAMI, 2019]

โžข This structure works pretty well for real-world data

Search on a coarse graph

Move to the same node on a finer graph

Repeat

61

NMSLIB (Non-Metric Space Library)https://github.com/nmslib/nmslib$> pip install nmslib

index = nmslib.init(method=โ€˜hnswโ€™)index.addDataPointBatch(X)index.createIndex(params1)

index.setQueryTimeParams(params2)index.knnQuery(q, topk)

โ˜บ The โ€œhnswโ€ is the best method as of 2020 for million-scale dataโ˜บ Simple interfaceโ˜บ If memory consumption is not the problem, try this Large memory consumption Data addition is not fast

โ˜…2k

62

Other implementations of HNSW

Hnswlib: https://github.com/nmslib/hnswlibโžข Spin-off library from nmslibโžข Include only hnswโžข Simpler; may be useful if you want to extend hnsw

Faiss: https://github.com/facebookresearch/faissโžข Libraries for PQ-based methods. Will Introduce later โžข This lib also includes hnsw

Other graph-based approachesโžข From Alibaba:C. Fu et al., โ€œFast Approximate Nearest Neighbor Search with the Navigating Spreading-out Graphโ€, VLDB19https://github.com/ZJULearning/nsg

โžข From Microsoft Research Asia. Used inside Bing:J. Wang and S. Lin, โ€œQuery-Driven Iterated Neighborhood Graph Search for Large Scale Indexingโ€, ACMMM12 (This seems the backbone paper)https://github.com/microsoft/SPTAG

โžข From Yahoo Japan. Competing with NMSLIB for the 1st place of benchmark:M. Iwasaki and D. Miyazaki, โ€œOptimization of Indexing Based on k-Nearest Neighbor Graph for Proximity Search in High-dimensional Dataโ€, arXiv18https://github.com/yahoojapan/NGT

63

64

Referenceโžข The original paper of Navigable Small World Graph: Y. Malkov et al., โ€œApproximate

Nearest Neighbor Algorithm based on Navigable Small World Graphs,โ€ Information Systems 2013

โžข The original paper of Hierarchical Navigable Small World Graph: Y. Malkov and D. Yashunin, โ€œEfficient and Robust Approximate Nearest Neighbor search using Hierarchical Navigable Small World Graphs,โ€ IEEE TPAMI 2019

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

65

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

66

Basic idea

0.542.350.820.42

0.620.310.341.63

3.340.830.621.45

1 2 N โžขNeed 4๐‘๐ท byte to represent ๐‘ real-valued vectorsusing floats

โžข If ๐‘ or ๐ท is too large, we cannot read the data on memoryโœ“ E.g., 512 GB for ๐ท = 128,๐‘ = 109

โžขConvert each vector to a short-code

โžข Short-code is designed as memory-efficientโœ“ E.g., 4 GB for the above example, with 32-bit code

โžขRun search for short-codes

๐ท

1 2 N

cod

e

cod

e

cod

e

Convert

โ€ฆ

โ€ฆ

67

Basic idea

0.542.350.820.42

0.620.310.341.63

3.340.830.621.45

1 2 N

โ€ฆ

โžขNeed 4๐‘๐ท byte to represent ๐‘ real-valued vectorsusing floats

โžข If ๐‘ or ๐ท is too large, we cannot read the data on memoryโœ“ E.g., 512 GB for ๐ท = 128,๐‘ = 109

โžขConvert each vector to a short-code

โžข Short-code is designed as memory-efficientโœ“ E.g., 4 GB for the above example, with 32-bit code

โžขRun search for short-codes

๐ท

1 2 N

cod

e

cod

e

cod

e

Convert

โ€ฆ

What kind of conversion is preferred?

1. The โ€œdistanceโ€ between two codes can be calculated (e.g., Hamming-distance)

2. The distance can be computed quickly

3. That distance approximates the distancebetween the original vectors (e.g., ๐ฟ2)

4. Sufficiently small length of codes can achieve the above three criteria

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

68

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

โžข Convert ๐’™ to a ๐ต-bit binary vector:๐‘“ ๐’™ = ๐’ƒ โˆˆ 0, 1 ๐ต

โžข Hamming distance ๐‘‘๐ป ๐’ƒ1, ๐’ƒ2 = ๐’ƒ1โŠ•๐’ƒ2 โˆผ ๐‘‘ ๐’™1, ๐’™2

โžข A lot of methods:โœ“ J. Wang et al., โ€œLearning to Hash for Indexing Big Data - A

Surveyโ€, Proc. IEEE 2015โœ“ J. Wang et al., โ€œA Survey on Learning to Hashโ€, TPAMI 2018

โžข Not the main scope of this tutorial;PQ is usually more accurate

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

69

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

70

0.340.220.681.020.030.71

๐ท ๐‘€

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

vector; ๐’™

PQ-code; เดฅ๐’™

Codebook

Product Quantization; PQ [Jรฉgou, TPAMI 2011]

โžข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by

k-means on training data

71

0.340.220.681.020.030.71

๐ท ๐‘€

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebook

Product Quantization; PQ [Jรฉgou, TPAMI 2011]

โžข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by

k-means on training datavector; ๐’™

PQ-code; เดฅ๐’™

72

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebook

Product Quantization; PQ [Jรฉgou, TPAMI 2011]

โžข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by

k-means on training datavector; ๐’™

PQ-code; เดฅ๐’™

73

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

ID: 123

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebook

Product Quantization; PQ [Jรฉgou, TPAMI 2011]

โžข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by

k-means on training datavector; ๐’™

PQ-code; เดฅ๐’™

74

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

ID: 123

ID: 87

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebook

Product Quantization; PQ [Jรฉgou, TPAMI 2011]

โžข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by

k-means on training datavector; ๐’™

PQ-code; เดฅ๐’™

75

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

ID: 123

ID: 87

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebook

โžขSimpleโžขMemory efficientโžขDistance can be esimated

Product Quantization; PQ [Jรฉgou, TPAMI 2011]

โžข Split a vector into sub-vectors, and quantize each sub-vectorTrained beforehand by

k-means on training datavector; ๐’™

PQ-code; เดฅ๐’™

Bar notation for PQ-code in this tutorial:๐’™ โˆˆ โ„๐ท โ†ฆ เดฅ๐’™ โˆˆ 1,โ€ฆ , 256 ๐‘€

76

Product Quantization: Memory efficient

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

ID: 123

ID: 87

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebookvector; ๐’™

PQ-code; เดฅ๐’™

float: 32bit

77

e.g., ๐ท = 128128 ร— 32 = 4096 [bit]

Product Quantization: Memory efficient

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

ID: 123

ID: 87

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebookvector; ๐’™

PQ-code; เดฅ๐’™

float: 32bit

78

e.g., ๐ท = 128128 ร— 32 = 4096 [bit]

e.g., ๐‘€ = 88 ร— 8 = 64 [bit]

uchar: 8bit

Product Quantization: Memory efficient

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

ID: 123

ID: 87

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebookvector; ๐’™

PQ-code; เดฅ๐’™

float: 32bit

79

e.g., ๐ท = 128128 ร— 32 = 4096 [bit]

e.g., ๐‘€ = 88 ร— 8 = 64 [bit]

1/64

uchar: 8bit

Product Quantization: Memory efficient

0.340.220.681.020.030.71

๐ท ๐‘€ID: 2

ID: 123

ID: 87

0.130.98

0.320.27

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

0.31.28

0.350.12

0.991.13

โ€ฆID: 1 ID: 2 ID: 256

0.130.98

0.721.34

1.030.08

โ€ฆID: 1 ID: 2 ID: 256

Codebookvector; ๐’™

PQ-code; เดฅ๐’™

80

Query; ๐’’ โˆˆ โ„๐ท

0.340.220.681.020.030.71

Product Quantization: Distance estimation

0.542.350.820.420.140.32

0.620.310.341.631.430.74

3.340.830.621.450.122.32

โ€ฆ

๐’™1 ๐’™2 ๐’™๐‘Database vectors

81

Query; ๐’’ โˆˆ โ„๐ท

0.340.220.681.020.030.71

Product Quantization: Distance estimation

0.542.350.820.420.140.32

0.620.310.341.631.430.74

3.340.830.621.450.122.32

โ€ฆ

Product quantization

๐’™1 ๐’™2 ๐’™๐‘Database vectors

82

Query; ๐’’ โˆˆ โ„๐ท

0.340.220.681.020.030.71

โ€ฆ ID: 42

ID: 67

ID: 92

ID: 221

ID: 143

ID: 34

ID: 99

ID: 234

ID: 3

Product Quantization: Distance estimation

๐’™1 ๐’™2 ๐’™๐‘

๐’™1 โˆˆ 1,โ€ฆ , 256 ๐‘€

83

Query; ๐’’ โˆˆ โ„๐ท

โžข ๐‘‘ ๐’’, ๐’™ 2 can be efficiently approximated by ๐‘‘๐ด ๐’’, เดฅ๐’™ 2

โžข Lookup-trick: Looking up pre-computed distance-tablesโžข Linear-scan by ๐‘‘๐ด

0.340.220.681.020.030.71

Linearscan

โ€ฆ ID: 42

ID: 67

ID: 92

ID: 221

ID: 143

ID: 34

ID: 99

ID: 234

ID: 3

Product Quantization: Distance estimation

๐’™1 ๐’™2 ๐’™๐‘

Asymmetric distance

๐’™1 โˆˆ 1,โ€ฆ , 256 ๐‘€

84

โžข Only tens of lines in Pythonโžข Pure Python library: nanopq https://github.com/matsui528/nanopqโžข pip install nanopq

Not pseudo codes

85

Deep PQ

โžข T. Yu et al., โ€œProduct Quantization Network for Fast Image Retrievalโ€, ECCV 18, IJCV20

โžข L. Yu et al., โ€œGenerative Adversarial Product Quantisationโ€, ACMMM 18

โžข B. Klein et al., โ€œEnd-to-End Supervised Product Quantization for Image Search and Retrievalโ€, CVPR 19

From T. Yu et al., โ€œProduct Quantization Network for Fast Image Retrievalโ€, ECCV 18

โžข Supervised search (unlike the original PQ)โžข Base-CNN + PQ-like-layer + Some-lossโžข Need class information

86

More extensive survey for PQ

โžขhttps://github.com/facebookresearch/faiss/wiki#research-foundations-of-faissโžขhttp://yusukematsui.me/project/survey_pq/survey_pq_jp.htmlโžขY. Matsui, Y. Uchida, H. Jรฉgou, S. Satoh โ€œA Survey of Product Quantizationโ€, ITE 2018.

87

Hamming-based Look-up-based

0.340.220.681.020.030.71

0

1

0

1

0

0

0.340.220.681.020.030.71

ID: 2

ID: 123

ID: 87

Representation Binary code๏ผš 0, 1 ๐ต PQ code๏ผš 1,โ€ฆ , 256 ๐‘€

Distance Hamming distance Asymmetric distance

Approximation โ˜บ โ˜บโ˜บ

Runtime โ˜บโ˜บ โ˜บ

Pros No auxiliary structure Can reconstruct the original vector

Cons Cannot reconstruct the original vector Require an auxiliary structure (codebook)

Hamming-based vs Look-up-based

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

88

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

89

Inverted index + PQ: Recap the notation

0.340.220.681.020.030.71

ID: 2

ID: 123

ID: 87

๐’™ โˆˆ โ„๐ท

๐ท ๐‘€

เดฅ๐’™ โˆˆ 1,โ€ฆ , 256 ๐‘€

Product quantization

โžข Suppose ๐’’, ๐’™ โˆˆ โ„๐ท, where ๐’™ is quantized to เดฅ๐’™โžข ๐‘‘ ๐’’, ๐’™ 2 can be efficiently approximated by เดฅ๐’™:

๐‘‘ ๐’’, ๐’™ 2 โˆผ ๐‘‘๐ด ๐’’, เดฅ๐’™ 2

PQ code

Bar-notation = PQ-code

Just by a PQ-code.Not the original vector ๐’™

90

Coarse quantizer

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

Inverted index + PQ: Record

Prepare a coarse quantizer โœ“ Split the space into ๐พ sub-spaces

โœ“ ๐’„๐‘˜ ๐‘˜=1๐พ are created by running k-means on training data

91

Coarse quantizer

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

1.020.730.561.371.370.72

๐’™1

Record ๐’™1๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

Inverted index + PQ: Record

92

Coarse quantizer

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

1.020.730.561.371.370.72

๐’™1

Record ๐’™1๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

Inverted index + PQ: Record

93

Coarse quantizer

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

1.020.730.561.371.370.72

๐’™1

Record ๐’™1๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

Inverted index + PQ: Record

โžข ๐’„2 is closest to ๐’™1โžข Compute a residual ๐‘Ÿ1 between ๐’™1 and ๐’„2:

๐’“1 = ๐’™1 โˆ’ ๐’„2 ( )

94

Coarse quantizer

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

1.020.730.561.371.370.72

๐’™1

Record ๐’™1๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พ

โžข ๐’„2 is closest to ๐’™1โžข Compute a residual ๐‘Ÿ1 between ๐’™1 and ๐’„2:

๐’“1 = ๐’™1 โˆ’ ๐’„2 ( )

ID: 42

ID: 37

ID: 9

1

ใƒปใƒปใƒป

โžข Quantize ๐’“1 to เดค๐’“1 by PQโžข Record it with the ID, โ€œ1โ€โžข i.e., record (๐‘–, เดฅ๐’“๐‘–)

เดค๐’“๐‘–

Inverted index + PQ: Record

๐‘–

95

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 42

ID: 37

ID: 9

1

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 18

ID: 4

ID: 96

3721

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

Inverted index + PQ: Recordโžข For all database vectors, record [ID + PQ(res)] as pointing lists

coarse quantizer

96

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 42

ID: 37

ID: 9

1

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 18

ID: 4

ID: 96

3721

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

Inverted index + PQ: Search

coarse quantizer

97

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 42

ID: 37

ID: 9

1

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 18

ID: 4

ID: 96

3721

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

0.542.350.820.420.140.32

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

๐’’

Find the nearest vector to ๐’’

Inverted index + PQ: Search

coarse quantizer

98

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 42

ID: 37

ID: 9

1

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 18

ID: 4

ID: 96

3721

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

0.542.350.820.420.140.32

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

๐’’

Find the nearest vector to ๐’’

Inverted index + PQ: Search

coarse quantizer

99

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 42

ID: 37

ID: 9

1

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 18

ID: 4

ID: 96

3721

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

0.542.350.820.420.140.32

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

๐’’

Find the nearest vector to ๐’’

โžข ๐’„2 is the closest to ๐’’โžข Compute the residual: ๐’“๐‘ž = ๐’’ โˆ’ ๐’„2

Inverted index + PQ: Search

coarse quantizer

100

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 42

ID: 37

ID: 9

1

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 18

ID: 4

ID: 96

3721

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

๐‘˜ = 1

๐‘˜ = 2

๐‘˜ = ๐พใƒปใƒปใƒป

0.542.350.820.420.140.32

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

๐’’

Find the nearest vector to ๐’’

โžข ๐’„2 is the closest to ๐’’โžข Compute the residual: ๐’“๐‘ž = ๐’’ โˆ’ ๐’„2

โžข For all (๐‘–, เดค๐’“๐‘–) in ๐‘˜ = 2, compare เดค๐’“๐‘– with ๐’“๐‘ž:

๐‘‘ ๐’’, ๐’™๐‘–2 = ๐‘‘ ๐’’ โˆ’ ๐’„2, ๐’™๐‘– โˆ’ ๐’„2

2

= ๐‘‘ ๐’“๐‘ž , ๐’“๐‘–2โˆผ ๐‘‘๐ด ๐’“๐‘ž , เดฅ๐’“๐‘–

2

โžข Find the smallest one (several strategies)

Inverted index + PQ: Search

coarse quantizer

เดค๐’“๐‘–

๐‘–

101

Faisshttps://github.com/facebookresearch/faiss

$> conda install faiss-cpu -c pytorch$> conda install faiss-gpu -c pytorch

โžขFrom the original authors of the PQ and a GPU expert, FAIRโžขCPU-version: all PQ-based methodsโžขGPU-version: some PQ-based methodsโžขBonus:โžขNN (not ANN) is also implemented, and quite fastโžขk-means (CPU/GPU). Fast.

โ˜…10K

Benchmark of k-means: https://github.com/DwangoMediaVillage/pqkmeans/blob/master/tutorial/4_comparison_to_faiss.ipynb

102

quantizer = faiss.IndexFlatL2(D)index = faiss.IndexIVFPQ(quantizer, D, nlist, M, nbits)

index.train(Xt) # Trainindex.add(X) # Record dataindex.nprobe = nprobe # Search parameterdist, ids = index.search(Q, topk) # Search

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

ใƒปใƒปใƒป

0.542.350.820.420.140.32

coarse quantizer

๐’„1

๐’„3

๐’„2

๐’„4 ๐’„5

๐’„6

๐’„7

๐’’

๐‘˜ = 1

๐‘˜ = ๐พ

Usually, 8 bit

๐‘€

Select a coarse quantizer

Simple linear scan

๐‘

109

106

bill

ion

-sca

lem

illio

n-s

cale Locality Sensitive Hashing (LSH)

Tree / Space Partitioning

Graph traversal

0.340.220.680.71

0

1

0

0

ID: 2

ID: 123

0.340.220.680.71

Space partition Data compression

โžข k-meansโžข PQ/OPQโžข Graph traversalโžข etcโ€ฆ

โžข Raw dataโžข Scalar quantizationโžข PQ/OPQโžข etcโ€ฆ

Look-up-based

Hamming-based

Linear-scan by Asymmetric Distance

โ€ฆ

Linear-scan by Hamming distance

103

Inverted index + data compression

For raw data: Acc. โ˜บ, Memory: For compressed data: Acc. , Memory: โ˜บ

104

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

ใƒปใƒปใƒป

๐‘˜ = 1

๐‘˜ = ๐พ

๐‘€

0.542.350.820.420.140.32

๐’’

๐’„6๐’„3๐’„13

Coarse quantizer

105

quantizer = faiss.IndexHNSWFlat(D, hnsw_m)index = faiss.IndexIVFPQ(quantizer, D, nlist, M, nbits)

ID: 42

ID: 37

ID: 9

245

ID: 25

ID: 47

ID: 32

12

ID: 38

ID: 49

ID: 72

1932

ID: 24

ID: 54

ID: 23

8621

ID: 77

ID: 21

ID: 5

145

ID: 32

ID: 11

ID: 85

324

ID: 16

ID: 72

ID: 95

1721

โ€ฆ

ใƒปใƒปใƒป

0.542.350.820.420.140.32

Coarse quantizer๐’’

๐‘˜ = 1

๐‘˜ = ๐พ

๐‘€HNSW

โžขSwitch a coarse quantizer from linear-scan to HNSWโžขThe best approach for billion-scale data as of 2020โžขThe backbone of [Douze+, CVPR 2018] [Baranchuk+, ECCV 2018]

Usually, 8 bitSelect a coarse quantizer

๐’„6๐’„3๐’„13

106

โ˜บ From the original authors of PQ. Extremely efficient (theory & impl)โ˜บ Used in a real-world product (Mercari, etc)โ˜บ For billion-scale data, Faiss is the best optionโ˜บ Especially, large-batch-search is fast; #query is large

Lack of documentation (especially, python binding) Hard for a novice user to select a suitable algorithm As of 2020, anaconda is required. Pip is not supported officially

107

Referenceโžข Faiss wiki: [https://github.com/facebookresearch/faiss/wiki]

โžข Faiss tips: [https://github.com/matsui528/faiss_tips]

โžข Julia implementation of lookup-based methods [https://github.com/una-dinosauria/Rayuela.jl]

โžข PQ paper: H. Jรฉgou et al., โ€œProduct quantization for nearest neighbor search,โ€ TPAMI 2011

โžข IVFADC + HNSW (1): M. Douze et al., โ€œLink and code: Fast indexing with graphs and compact regression codes,โ€ CVPR 2018

โžข IVFADC + NHSW (2): D. Baranchuk et al., โ€œRevisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors,โ€ ECCV 2018

108

Start

Have GPU(s)?

faiss-gpu: linear scan (GpuIndexFlatL2)

faiss-cpu: linear scan (IndexFlatL2)

nmslib (hnsw)

falconn

annoy

faiss-cpu: hnsw + ivfpq(IndexHNSWFlat + IndexIVFPQ)

Adjust the PQ parameters: Make ๐‘€ smlaller

Exact nearest neighbor search

Alternative: faiss.IndexHNSWFlat in faiss-cpuโžข Same algorithm in different libraries

Note: Assuming ๐ท โ‰… 100. The size of the problem is determined by ๐ท๐‘. If 100 โ‰ช ๐ท, run PCA to reduce ๐ท to 100

Yes

No

If topk > 2048

If slow, or out of memory

Require fast data addition

Would like to run from several processes

If slowโ€ฆ

Would like to adjust the performance

riiWould like to runsubset-search

If out of memory

Adjust the IVF parameters:Make nprobe larger โžก Higher accuracy but slower

Would like to adjust the performance

cheat-sheet for ANN in Python ๏ผˆas of 2020. Can be installed by conda or pip๏ผ‰

faiss-gpu: ivfpq (GpuIndexIVFPQ)

(1) If still out of GPU-memory, or(2) Need more accurate results

If out of GPU-memoryIf out of GPU-memory, make ๐‘€ smaller

About: 103 < ๐‘ < 106

About: 106 < ๐‘ < 109

About: 109 < ๐‘

109

Benchmark 1: ann-benchmarksโžข https://github.com/erikbern/ann-benchmarksโžข Comprehensive and thorough benchmarks

for various libraries. Docker-based

โžข Top right, the betterโžข As of June, 2020, NMSLIB and NGT are

competing each other for the first place

110

Benchmark 2: annbenchโžข https://github.com/matsui528/annbenchโžข Lightweight, easy-to-use

# Install librariespip install -r requirements.txt

# Download dataset on ./datasetpython download.py dataset=siftsmall

# Evaluate algos. Results are on ./outputpython run.py dataset=siftsmall algo=annoy

# Visualizepython plot.py

# Multi-run by Hydrapython run.py --multirun dataset=siftsmall,sift1m algo=linear,annoy,ivfpq,hnsw

ID Img Tag

1 โ€œcatโ€

2 โ€œbirdโ€

โ‹ฎ

125 โ€œzebraโ€

126 โ€œelephantโ€

โ‹ฎ

111

Search for a โ€œsubsetโ€Target IDs:[125, 223, 365, โ€ฆ]

(1) Tag-based search:tag == โ€œzebraโ€

(2) Image search with a query ๐’’

๐’™๐‘› ๐‘›=1๐‘

Subset-searchY. Matsui+, โ€œReconfigurable Inverted Indexโ€, ACMMM 18

112

Trillion-scale search: ๐‘ = 1012 (1T)

Sense of scaleโžข K(= 103) Just in a second on a local machineโžข M(= 106) All data can be on memory. Try several approachesโžข G(= 109) Need to compress data by PQ. Only two datasets are available (SIFT1B, Deep1B)โžข T(= 1012) Cannot even imagine

https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectorsโžข Only in Faiss wikiโžข Distributed, mmap, etc.

https://github.com/facebookresearch/faiss/wiki/Indexing-1T-vectors

A sparse matrix of 15 Exa elements?

113

Nearest neighbor search engine: something like ANN + SQL

https://github.com/vearch/vearch

https://github.com/milvus-io/milvus

โžข The algorithm inside is faiss, nmslib, or NGT

Elasticsearch KNNhttps://github.com/opendistro-for-elasticsearch/k-NN

https://github.com/vdaas/vald

114

Problems of ANNโžข No mathematical background.โœ“ Only actual measurements matter: recall and runtimeโœ“ The ANN problem was mathematically defined 10+ years ago (LSH), but recently no

one cares the definition.

โžข Thus, when the score is high, itโ€™s not clear the reason:โœ“ The method is good?โœ“ The implementation is good?โœ“ Just happens to work well for the target dataset?โœ“ E.g.: The difference of math library (OpenBLAS vs Intel MKL) matters.

โžข If one can explain โ€œwhy this approach works good for this datasetโ€, it would be a great contribution to the field.

โžข Not enough dataset. Currently, only two datasets are available for billion-scale data: SIFT1B and Deep1B

115

Start

Have GPU(s)?

faiss-gpu: linear scan (GpuIndexFlatL2)

faiss-cpu: linear scan (IndexFlatL2)

nmslib (hnsw)

falconn

annoy

faiss-cpu: hnsw + ivfpq(IndexHNSWFlat + IndexIVFPQ)

Adjust the PQ parameters: Make ๐‘€ smlaller

Exact nearest neighbor search

Alternative: faiss.IndexHNSWFlat in faiss-cpuโžข Same algorithm in different libraries

Note: Assuming ๐ท โ‰… 100. The size of the problem is determined by ๐ท๐‘. If 100 โ‰ช ๐ท, run PCA to reduce ๐ท to 100

Yes

No

If topk > 2048

If slow, or out of memory

Require fast data addition

Would like to run from several processes

If slowโ€ฆ

Would like to adjust the performance

riiWould like to runsubset-search

If out of memory

Adjust the IVF parameters:Make nprobe larger โžก Higher accuracy but slower

Would like to adjust the performance

cheat-sheet for ANN in Python ๏ผˆas of 2020. Can be installed by conda or pip๏ผ‰

faiss-gpu: ivfpq (GpuIndexIVFPQ)

(1) If still out of GPU-memory, or(2) Need more accurate results

If out of GPU-memoryIf out of GPU-memory, make ๐‘€ smaller

About: 103 < ๐‘ < 106

About: 106 < ๐‘ < 109

About: 109 < ๐‘