RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED...

74
RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING Stefan Savev Berlin Buzzwords June 2015 = +

Transcript of RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED...

Page 1: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

RANDOM PROJECTIONSFOR SEARCH AND MACHINE LEARNING

Stefan Savev

Berlin Buzzwords

June 2015

= +

Page 2: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

KEYWORD-BASED SEARCH

Document Data

300 unique words per document

300 000 words in vocabulary

Data sparsity: 99.9%

Words are strong query features

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 2

Page 3: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

IMAGE SEARCH BY EXAMPLE

Image Data

800 pixels

200 black pixels

Data sparsity 80%

Pixels are weak query features

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 3

Page 4: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

TWO KINDS OF SEARCH

Data Sparse Dense

Query Size Short Long

Use Cases Keyword Search Image Search, Semantic

Search,

Recommendations

SEARCH

METHOD

INVERTED

INDEX (LUCENE)

RANDOM

PROJECTIONS

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 4

Page 5: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

OUTLINE

1. High dimensional data (images, text, clicks)

2. Random Projections: Why? What? How?

3. Image Search Benchmark

5STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015

Page 6: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

HIGH-DIMENSIONAL DATA

Images

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 6

Page 7: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

HIGH-DIMENSIONAL DATA

Images

Pixels

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 7

Page 8: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

HIGH-DIMENSIONAL DATA

Images

1 2 3 4

5 6 7 8

9 10 11 12

13 14 15 16

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

Dimensions

Pixels

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 8

Page 9: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

MATRIX REPRESENTATION OF DATASET

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 9

Page 10: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

MATRIX REPRESENTATION OF DATASET

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 10

Page 11: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

HIGH-DIMENSIONAL DATA

Text

A beer can do everyone a

favor.

a

ab

stract

be

bee …

beer

…ca

n …

do …

eve

ryone

2 0 0 0 1 1

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 11

Dimensions

Page 12: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

HIGH-DIMENSIONAL DATA

Clicks

… … … …

1 0 0

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 12

Page 13: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

DENSE VS SPARSE MATRIXIMAGES

1000 PIXELS

TEXT/CLICKS

300000 WORDS

CAN SEARCH WITH LUCENECANNOT SEARCH

WITH LUCENE

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 13

img_1

img_3

img_n

doc_1

doc_3

doc_n

Page 14: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

DENSE VS SPARSE MATRIXIMAGES

1000 PIXELS

TEXT/CLICKS

300000 WORDS

CAN SEARCH WITH LUCENECANNOT SEARCH

WITH LUCENE

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 14

img_1

img_3

img_n

doc_1

doc_3

doc_n

TEXT/CLICKS

500 DIMENSIONS

WITH LUCENE

Page 15: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

DENSE VS SPARSE MATRIXIMAGES

1000 PIXELS

TEXT/CLICKS

300000 WORDS

CAN SEARCH WITH LUCENECANNOT SEARCH

WITH LUCENE

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 15

img_1

img_3

img_n

doc_1

doc_3

doc_n

TEXT/CLICKS

500 DIMENSIONS

WITH RANDOM

PROJECTIONS

WITH RANDOM

PROJECTIONSWITH RANDOM

PROJECTIONS

WITH LUCENE

Page 16: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

DIMENSIONALITY REDUCTION

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 16

300000 dimensions

doc_1

doc_3

doc_n

500 dimensions

beer wine beer + wine

SVD

Page 17: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

DIMENSIONALITY REDUCTION = PATTERN DISCOVERY

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 17

= +

Page 18: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

APPLICATIONS

Random Projections

Random Forest

SVM

BoostingSVD

Search

(Nearest

Neighbor)

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 18

Page 19: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

RANDOM PROJECTIONS IN INDUSTRY

Spotify for music recommendations https://github.com/spotify/annoy

Etsy for user/product recommendations “Style in the Long Tail: Discovering Unique Interests with Latent Variable Models in Large Scale Social E-commerce”, Diane Hu, Rob Hall, Josh Attenberg

Referred to as Locality Sensitive Hashing

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 19

Page 20: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

OUTLINE

1. High dimensional data (images, text, clicks)

2. Random Projections: Why? What? How?

3. Image Search Benchmark

20STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015

Page 21: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

HOW DOES IT WORK?

= +

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 21

Page 22: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

ILLUSTRATION ON ARTIFICIAL DATASET

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 22

id_1

id_3

id_n

Page 23: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

ILLUSTRATION ON ARTIFICIAL DATASET

Nearest neighbor problem: Find the closest point

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 23

Page 24: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

Approximation: Neighbors are in a circle with

small radius

ILLUSTRATION ON ARTIFICIAL DATASET

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 24

Page 25: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

Core idea: random grids

Dynamic grids (variably sized)

Random = cheap

ILLUSTRATION ON ARTIFICIAL DATASET

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 25

Page 26: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

PARTITION BY RANDOM SPLITS

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 26

Hyperplane splits

the dataset

Page 27: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

PROJECTION = ONE DIMENSIONAL VIEW OF DATASET

Random Direction

(1 Dimension)

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 27

A

Page 28: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

Random Direction

(1 Dimension)

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 28

A

PLATO’S ALEGORY OF THE CAVE

Page 29: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

PARTITIONING BY PROJECTING ON RANDOM LINES

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 29

Page 30: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

PARTITIONING BY PROJECTING ON RANDOM LINES

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 30

Page 31: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

ALGORITHM VISUALIZATION

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 31

Page 32: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

ALGORITHM VISUALIZATION

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 32

Page 33: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

ALGORITHM VISUALIZATION

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 33

Page 34: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

ALGORITHM VISUALIZATION

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 34

Page 35: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

ALGORITHM VISUALIZATION

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 35

When two points are close,

they are likely to end up

in the same partition,

even under

random partitioning

Page 36: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

ALGORITHM VISUALIZATION

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 36

When two points are close,

they are LIKELY to end

up in the same partition,

even under

random partitioning

Page 37: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

MULTIPLE TREES

Random tree partitioning is noisy. Build more trees.

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 37

Page 38: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

ADDING MORE TREES

Tree 1 Trees 1 and 2 Trees 1, 2 and 3

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 38

Page 39: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

ADDING MORE TREES

Trees 1 - 50 Trees 1 - 100STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 39

Page 40: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

CONSTRAINING THE NEAREST NEIGHBOR REGION

THRESHOLD = In how many trees a “NN candidate” appears

Threshold = 1 Threshold = 60Threshold = 20

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 40

Page 41: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

“WE’VE GOT HIM”

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 41

Page 42: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

OUTLINE

1. High dimensional data (images, text, clicks)

2. Random Projections: Why? What? How?

3. Image Search Benchmark

42STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015

Page 43: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 43

IMAGE SEARCH FOR PREDICTION

Page 44: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 44

Results

Image Label

5

3

3

IMAGE SEARCH FOR PREDICTION

Page 45: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

STAGE 1: INDEXING

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 45

42 000 examples with labels

784 pixels per image

Page 46: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

STAGE 1: INDEXING

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 46

42 000 examples with labels

784 pixels per image

Page 47: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

STAGE 1: INDEXING

index = Index(...)

for image in training_examples:

labels[image.id] = image.label

index.add_item(image.id,

image.vector)

index.build(...)

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 47

42 000 examples with labels

784 pixels per image

Page 48: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

STAGE 2: SEARCH TO PREDICT THE LABEL

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 48

index.load('.../file.index')

...

image_vec = read_image(...)

nn = 100 #number of nearest neighbors

results = index.get_nns_by_vector(image_vec,nn)

top_result = results[0]

predicted_label = labels[top_result.id]

Results

Image Label

5

3

3

28 000 test examples

Page 49: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

TOOLS

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 49

stefansavev.com/randomtrees

Page 50: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

BENCHMARK IMAGE SEARCH

DIMENSIONS # TREES # NN

CANDIDATES

ACCURACY

ON TEST

SEARCH TIME

[ms/query]

784 10 1000-1010 97.0 4.85

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 50

Reference: https://github.com/stefansavev/random-projections-talk/

BOTTLENECK

Page 51: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

BENCHMARK IMAGE SEARCH

DIMENSIONS # TREES # NN

CANDIDATES

ACCURACY

ON TEST

SEARCH TIME

[ms/query]

784 10 1000-1010 97.0 4.85

100,

after SVD

10 1000-1010 97.5 0.7

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 51

Reference: https://github.com/stefansavev/random-projections-talk/

Page 52: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

BENCHMARK IMAGE SEARCH

DIMENSIONS # TREES # NN

CANDIDATES

ACCURACY

ON TEST

SEARCH TIME

[ms/query]

784 10 1000-1010 97.0 4.85

100,

after SVD

10 1000-1010 97.5 0.7

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 52

Reference: https://github.com/stefansavev/random-projections-talk/

BOTTLENECK

Page 53: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

BENCHMARK IMAGE SEARCH

DIMENSIONS # TREES # NN

CANDIDATES

ACCURACY

ON TEST

SEARCH TIME

[ms/query]

784 10 1000-1010 97.0 4.85

100,

after SVD

10 1000-1010 97.5 0.7

100,

after SVD

40 180 (mean) 97.5 0.36

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 53

Reference: https://github.com/stefansavev/random-projections-talk/

stefansavev.com/

randomtrees

Page 54: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

SAME FOUNDATION FOR SEARCH AND MACHINE LEARNING

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 54

+

Data point id

32

56

89

Page 55: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

SAME FOUNDATION FOR SEARCH AND MACHINE LEARNING

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 55

+

label frequency

“0” 4

“1” 51

… …

“7” 38

… …

Histogram

Page 56: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

CONCLUSION

Search in Dense Data

CHEAP method to “explore” data; Makes algorithms FASTER (e.g. SVD, SVM)

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 56

= +

Page 57: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 57

Stefan Savev

Email: [email protected]

Blog: stefansavev.com

THANK YOU!

Page 58: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

EXTRA SLIDES

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 58

Page 59: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

REFERENCES

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 59

Page 60: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

ADVANTAGES/DISADVANTAGESDIMENSIONALITY REDUCTION

Advantages

“Semantic Similarity”

“Concentrate the signal”

Disadvantages

No exact matches possible

Adds noise

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 60

Page 61: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

RANDOM PROJECTIONS AS HASH FUNCTIONS

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 61

0 0 1 0

0 1 1 0

0 0 1 0

0 0 1 0

-1 1 1 -1

1 -1 1 1

1 1 -1 -1

-1 1 -1 1

0 0 1 0

0 -1 1 0

0 0 -1 0

0 0 -1 0

1 +1 - 1 – 1 – 1 = -1

* =

Page 62: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

EXAMPLE-BASED IMAGE SEARCH

MNIST image dataset of digits

Images are digits from 0 to 9

42 000 images for training

28 000 images for test

28 x 28 pixels with values from 0 to 255

784 (=28*28) features

Dataset is already preprocessed

Goal is to predict the label of an image

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 62

Page 63: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

SOME IMPLEMENTATION TRICKS

sparse random projections combine not all dimension but a small number of them

apply dimensionality reduction first

pick better random projections the projected histogram is wide

project on multiple lines simultaneously Hadamard matrix trick

reuse random projections via recombination

cut out a “ball” from the densest region

PCA may help

use the neighbors of neighbors of a point as additional similarity candidates

advanced search inside the trees with backtracking (KD-tree style)

generate multiple versions of query/training data (i.e. for image data translate or rotate the image)

[Link to Blog]

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 63

Page 64: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

PROJECTION

Dot product (cosine, similarity)

View of the dataset

Overlap with Pattern

“Geometry” of Search

Foundation for Search and Machine Learning

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 64

Page 65: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

WHY RANDOM?

Cheap (replace optimization with randomization); Simply build more trees

If we build the perfect tree, it’s just one tree. We need more and DIVERSE trees

In high dimensions all algorithms will have problems, so do the cheapest

With a few points/dimensions random is not the best choice

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 65

Page 66: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

DIMENSIONALITY REDUCTION

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 66

300000 WORDS

doc_1

doc_3

doc_n

500 DIMENSIONS

beer

wine

beer + wine

SVD

MIXING THE

COLUMNS

OF THE ORIGINAL

MATRIX TO MAXIMIZE

SIGNAL TO NOISE

Page 67: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

PROJECTION = OVERLAP WITH A PATTERN

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 67

,OVERLAP

=SUM 1, more red

-1, more green=

Page 68: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

PROJECTION = OVERLAP WITH A PATTERN

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 68

,OVERLAP =SUM

Page 69: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

PROJECTION = OVERLAP WITH A PATTERN

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 69

, =OVERLAP( ) SUM

Page 70: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

1, if more red pixels than white

-1, if more white pixels than red

PROJECTION = OVERLAP WITH A PATTERN

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 70

, =OVERLAP( ) SUM

=

Page 71: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

PROJECTION = OVERLAP WITH A PATTERN

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 71

, =OVERLAP( ) SUM

1, if more red pixels than white

-1, if more white pixels than red=

Page 72: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

DATA POINT – SVD FEATURE – RANDOM FEATURE

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 72

Page 73: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

PLATO’S ALEGORY OF THE CAVE

He then explains how the philosopher is like a prisoner who is freed from the cave and comes to understand that the shadows on the wall do not make up reality at all, as he can perceive the true form of reality rather than the mere shadows seen by the prisoners

Source: http://en.wikipedia.org/wiki/Allegory_of_the_Cave

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 73

Page 74: RANDOM PROJECTIONS Stefan Savev FOR SEARCH AND … · Berlin Buzzwords June 2015 = + KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data

SECRET SAUCE

How to make it faster?

How to make it more accurate?

When does it work?

Is there something better?

STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 74

http://stefansavev.com/random-projections-secretsouce.html