Terabyte-scale image similarity search with Hadoop

35
Terabyte-scale image similarity search with Hadoop Denis Shestakov Hadoop Summit Europe 2014, Amsterdam, Netherlands, 02.04.2014

description

Talk given at Hadoop Summit Europe 2014, Amsterdam, Netherlands on 02.04.2014 Talk abstract: In this talk I focus on a specific Hadoop application, image similarity search, and present our experience on designing, building and testing a Hadoop-based image similarity search scalable to terabyte-sized image collections. I start with overviewing how to adapt image retrieval techniques to MapReduce model. Second, I describe image indexing and searching workloads and show how these workflows are rather atypical for Hadoop. E.g., I explain how to tune Hadoop to fit to such computational tasks and particularly specify the parameters and values that deliver best performance. Next I present the Hadoop cluster heterogeneity problem and describe a solution to it by proposing a platform-aware Hadoop configuration. Then I introduce the tools, provided by the standard Apache Hadoop framework, useful for a large class of application workloads similar to ours, where a large-size auxiliary data structure is required for processing the dataset. Finally, I overview a series of experiments conducted on four terabytes image dataset (biggest reported in the academic literature). The findings will be shared as best practices and recommendations to the practitioners working with huge multimedia collections. Speaker: Dr. Denis Shestakov is an experienced researcher in the area of big data engineering and, recently, a practitioner as a Hadoop/MapReduce consultant. Denis has been involved in various big data projects in web analytics and search, multimedia search and bioinformatics. See his profile at LinkedIn: http://fi.linkedin.com/in/dshestakov/

Transcript of Terabyte-scale image similarity search with Hadoop

Page 1: Terabyte-scale image similarity search with Hadoop

Terabyte-scale image similarity search with Hadoop

Denis Shestakov

Hadoop Summit Europe 2014, Amsterdam, Netherlands, 02.04.2014

Page 2: Terabyte-scale image similarity search with Hadoop

About me

● Big Data researcher/engineer○ recent projects: large-scale image retrieval○ before: web crawling

● Hadoop/MapReduce contractor○ design/development/tuning Hadoop applications

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 3: Terabyte-scale image similarity search with Hadoop

Talk Outline

● Intro to image search● Image retrieval with MapReduce● Image indexing/searching workloads● Hadoop tools for large joins● Smart Hadoop configuration● Misc & conclusions

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 4: Terabyte-scale image similarity search with Hadoop

Intro to Image Search

● Finding images given a text○ dog →

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 5: Terabyte-scale image similarity search with Hadoop

Intro to Image Search

● Finding images given an image○ By content-similarity

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 6: Terabyte-scale image similarity search with Hadoop

Image Search Applications● Regular image search

○ Google Images, Bing Images, TinEye, etc ● Product search (by image)● Object recognition

○ Face, logo, vehicle, etc.● Computer vision● Augmented reality● Medical imaging● Astrophysics

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 7: Terabyte-scale image similarity search with Hadoop

Intro to Image Search

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 8: Terabyte-scale image similarity search with Hadoop

Intro to Image SearchHow does it work?● Images resized to smaller size● Then transformed to chosen feature description

representation○ image → set of feature descriptors (=high-dimensional

vectors) ○ Many transformations exist

■ SIFT (Scale-invariant feature transform) used by us

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 9: Terabyte-scale image similarity search with Hadoop

Intro to Image SearchHow does it work?

Typical: several hundreds of feature descriptors per image

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

image_id SIFT descriptor

10011 21, 143, 5, …, 201, 186

10011 121, 14, 75, …, 20, 109

10011 37, 40, 0, …, 213, 96

... ...

10011 81, 235, 67, …, 102,63

Page 10: Terabyte-scale image similarity search with Hadoop

Intro to Image SearchHow does it work?● Compare (e.g., by calculating Euclidean distance)

feature descriptors of a query image with descriptors of images in collection to search

● Images with ‘closest’ descriptors are similar to a query image

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 11: Terabyte-scale image similarity search with Hadoop

Intro to Image SearchWhy MapReduce?● Direct comparisons of descriptors costly even for

very small collections● Lots of approaches to ‘organize’ feature

descriptors for fast search○ Build an index○ Index all the descriptors○ At search, check query descriptors only against

certain groups of descriptors

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 12: Terabyte-scale image similarity search with Hadoop

Image Retrieval with MapReduceWhy MapReduce?● Poorly scalable

○ up to ~10-20 mln images● But multimedia grows exponentially● Scaling is required …

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 13: Terabyte-scale image similarity search with Hadoop

Image Retrieval with MapReduceUse case:● Copyright violation detection in large image

databank○ >100mln images

● Searching for batch of images○ Thousands of images in one query

○ Focus on throughput, not on response time for individual image

● SIFT featuresDenis Shestakov

denshe at gmail.comlinkedin: linkedin.com/in/dshestakov

Page 14: Terabyte-scale image similarity search with Hadoop

Image Retrieval with MapReduceIndexing images● Generating index tree ● Clustering images into a large set of clusters

(max cluster size = 5000)○ Mapper input:

■ unsorted SIFT descriptors■ index tree (loaded by every mapper)

○ Mapper output:■ (cluster_id, SIFT)

○ Reducer output:■ SIFTs sorted by cluster_id

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakovMapReduce

Page 15: Terabyte-scale image similarity search with Hadoop

Image Retrieval with MapReduceSearching● Generating lookup table

○ indexing query SIFTs

● Finding best matches for query SIFTs○ Mapper input:

■ sorted SIFT descriptors■ lookup table (loaded by every mapper)

○ Mapper output:■ (query-sift-id, knn of image-ids)

○ Reducer output:■ Best votes (image-ids) for query-image-id

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Map

Redu

ce

MapReduce

Page 16: Terabyte-scale image similarity search with Hadoop

Image Retrieval with MapReduceIn nutshell:● Indexing phase

○ Clustering SIFTs with one-pass k-means ● Searching phase

○ Map-side join of clustered SIFTs and lookup table (query SIFTs)

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 17: Terabyte-scale image similarity search with Hadoop

Image search workloadsTime to discuss Hadoop specifics:● Standard Apache Hadoop distribution, ver.1.0.1

○ (!) No changes in Hadoop internals■ Easy to migrate

● Around 100 nodes from Grid5000○ 8/24 cores, 24/32/48GB RAM per node○ capacity/performance varied

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 18: Terabyte-scale image similarity search with Hadoop

Image search workloadsDataset:● 110 mln images (~30 billion SIFT descriptors)

○ ~30 billion SIFT descriptors○ 4TB○ Largest reported in literature○ Images resized to 150px on largest side○ Worked also with subset, 1TB○ Used as distracting dataset

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 19: Terabyte-scale image similarity search with Hadoop

Image search workloadsQueries:● Query batches

○ Up to 250k query images in one batch○ Batch includes original images and their distorted

variants■ Some variants are very hard to find

● e.g., print-crumple-scan● Check if original images returned as top votes

○ (out of scope) state-of-the-art search quality

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 20: Terabyte-scale image similarity search with Hadoop

Image search workloadsIndexing workload characteristics● computationally-intensive (map phase)● data-intensive (at map&reduce phases)● large auxiliary data structure (i.e., index tree)

○ grows as dataset grows○ e.g., 1.8GB for 110M images (4TB)

● map input < map output● network is heavily utilized during shuffling

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 21: Terabyte-scale image similarity search with Hadoop

Image search workloadsIndexing workload

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 22: Terabyte-scale image similarity search with Hadoop

Image search workloadsSearching workload● large aux.data structure (e.g., lookup table)

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 23: Terabyte-scale image similarity search with Hadoop

Image search workloads

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

● Basic settings:○ 512MB HDFS

block size○ 3 replicas○ 8 map slots○ 2 reduce slots

● 4TB dataset: ○ 4 map slots

Page 24: Terabyte-scale image similarity search with Hadoop

Hadoop tools for large joins● Some workloads require all mappers to load a

large-size data structure○ Like image indexing/searching workloads

● Spreading data file across all nodes○ Hadoop DistributedCache

● Not efficient if structure is of gigabytes-size○ Partial solution: increase HDFS block sizes →

decrease #mappers● Another approach: multithreaded mappers

○ Not well documentedDenis Shestakov

denshe at gmail.comlinkedin: linkedin.com/in/dshestakov

Page 25: Terabyte-scale image similarity search with Hadoop

Hadoop tools for large joins

● Multithreaded mapper spans a configured number of threads, each thread executes a map task

● Mapper threads share the RAM● Downsides:

○ synchronization when reading input○ synchronization when writing output

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 26: Terabyte-scale image similarity search with Hadoop

Hadoop tools for large joins

Indexing 4T with 4 mappers slots, each running two threads

● index tree size: 1.8GBIndexing time on 100 nodes

● 8h27min → 6h8min

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 27: Terabyte-scale image similarity search with Hadoop

Hadoop tools for large joins

● In some workloads mappers require only a part of auxiliary data structure○ I.e., relevant to data block processed○ Image searching workflow

● Approach: Hadoop MapFile○ Very efficient

■ Big batches, >10000 query images■ ~2 times faster on batches including around

25000 imagesDenis Shestakov

denshe at gmail.comlinkedin: linkedin.com/in/dshestakov

Page 28: Terabyte-scale image similarity search with Hadoop

Smart Hadoop configurationHere is the problem:● Apache Hadoop, v.1.0.1● Capacity/performance of nodes varied

○ 8/24 cores, 24-48GB RAM, etc● One config file (#mappers, #reducers, maxim.

map/reduce memory, ...) for all nodes● Issue for memory-intensive workloads!

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 29: Terabyte-scale image similarity search with Hadoop

Smart Hadoop configurationSolution (hack):● deploy Hadoop on all nodes with settings addressing

the least equipped nodes● create sub-cluster configuration files adjusted to better

equipped nodes○ substitute original config file with the new one on better

equipped nodes ● restart tasktrackers with new configuration files on

better equipped nodes

Call it smart deployment● Or known under another name?

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Page 30: Terabyte-scale image similarity search with Hadoop

Smart Hadoop configuration

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Indexing 1T on 106 nodes: 75min → 65min

Page 31: Terabyte-scale image similarity search with Hadoop

Conclusions

● Several directions for further optimization ● Presented techniques applicable to video and

audio datasets○ Given a transformation into feature vectors○ Only small changes expected (e.g, new Writable)

● Hadoop smart deployment trick● (Wanted) Best practices for Hadoop job

history log analysisDenis Shestakov

denshe at gmail.comlinkedin: linkedin.com/in/dshestakov

Page 32: Terabyte-scale image similarity search with Hadoop

Supporting publicationsThings to share

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Hadoop job history logs available on request:● Describe indexing/searching 4TB dataset● Insights on better analysis/visualization are welcome● Get cbmi13 example-set at http://goo.gl/e06wE

Page 33: Terabyte-scale image similarity search with Hadoop

Supporting publicationsSupporting Materials

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Check full-texts of our publications:● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Indexing and

searching 100M images with Map-Reduce. In Proc. ACM ICMR'13, 2013.

● D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable high-dimensional indexing with Hadoop. In Proc. CBMI'13, 2013.

● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Terabyte-scale image similarity search: experience and best practice. In Proc. IEEE BigData'13, 2013.

Page 34: Terabyte-scale image similarity search with Hadoop

Acknowledgements

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

● My colleagues at INRIA Rennes

● Aalto University

● Grid5000 infrastructure

Page 35: Terabyte-scale image similarity search with Hadoop

That’s it!

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

Thanks!