Terabyte-scale image similarity search with Hadoop

Terabyte-scale image similarity search with Hadoop

Denis Shestakov

Hadoop Summit Europe 2014, Amsterdam, Netherlands, 02.04.2014

About me

● Big Data researcher/engineer○ recent projects: large-scale image retrieval○ before: web crawling

● Hadoop/MapReduce contractor○ design/development/tuning Hadoop applications

Denis Shestakovdenshe at gmail.com

linkedin: linkedin.com/in/dshestakov

http://www.slideshare.net/denshe/icwe13-tutorial-webcrawling

http://linkedin.com/in/dshestakov

Talk Outline

● Intro to image search● Image retrieval with MapReduce● Image indexing/searching workloads● Hadoop tools for large joins● Smart Hadoop configuration● Misc & conclusions




Intro to Image Search

● Finding images given a text○ dog →





● Finding images given an image○ By content-similarity




Image Search Applications● Regular image search

○ Google Images, Bing Images, TinEye, etc ● Product search (by image)● Object recognition

○ Face, logo, vehicle, etc.● Computer vision● Augmented reality● Medical imaging● Astrophysics




Intro to Image SearchHow does it work?● Images resized to smaller size● Then transformed to chosen feature description

representation○ image → set of feature descriptors (=high-dimensional

vectors) ○ Many transformations exist

■ SIFT (Scale-invariant feature transform) used by us




Intro to Image SearchHow does it work?

Typical: several hundreds of feature descriptors per image



image_id SIFT descriptor

10011 21, 143, 5, …, 201, 186

10011 121, 14, 75, …, 20, 109

10011 37, 40, 0, …, 213, 96

... ...

10011 81, 235, 67, …, 102,63


Intro to Image SearchHow does it work?● Compare (e.g., by calculating Euclidean distance)

feature descriptors of a query image with descriptors of images in collection to search

● Images with ‘closest’ descriptors are similar to a query image




Intro to Image SearchWhy MapReduce?● Direct comparisons of descriptors costly even for

very small collections● Lots of approaches to ‘organize’ feature

descriptors for fast search○ Build an index○ Index all the descriptors○ At search, check query descriptors only against

certain groups of descriptors




Image Retrieval with MapReduceWhy MapReduce?● Poorly scalable

○ up to ~10-20 mln images● But multimedia grows exponentially● Scaling is required …




Image Retrieval with MapReduceUse case:● Copyright violation detection in large image

databank○ >100mln images

● Searching for batch of images○ Thousands of images in one query

○ Focus on throughput, not on response time for individual image

● SIFT featuresDenis Shestakov

denshe at gmail.comlinkedin: linkedin.com/in/dshestakov


Image Retrieval with MapReduceIndexing images● Generating index tree ● Clustering images into a large set of clusters

(max cluster size = 5000)○ Mapper input:

■ unsorted SIFT descriptors■ index tree (loaded by every mapper)

○ Mapper output:■ (cluster_id, SIFT)

○ Reducer output:■ SIFTs sorted by cluster_id


linkedin: linkedin.com/in/dshestakovMapReduce


Image Retrieval with MapReduceSearching● Generating lookup table

○ indexing query SIFTs

● Finding best matches for query SIFTs○ Mapper input:

■ sorted SIFT descriptors■ lookup table (loaded by every mapper)

○ Mapper output:■ (query-sift-id, knn of image-ids)

○ Reducer output:■ Best votes (image-ids) for query-image-id



Map

Redu

ce

MapReduce


Image Retrieval with MapReduceIn nutshell:● Indexing phase

○ Clustering SIFTs with one-pass k-means ● Searching phase

○ Map-side join of clustered SIFTs and lookup table (query SIFTs)




Image search workloadsTime to discuss Hadoop specifics:● Standard Apache Hadoop distribution, ver.1.0.1

○ (!) No changes in Hadoop internals■ Easy to migrate

● Around 100 nodes from Grid5000○ 8/24 cores, 24/32/48GB RAM per node○ capacity/performance varied




Image search workloadsDataset:● 110 mln images (~30 billion SIFT descriptors)

○ ~30 billion SIFT descriptors○ 4TB○ Largest reported in literature○ Images resized to 150px on largest side○ Worked also with subset, 1TB○ Used as distracting dataset




Image search workloadsQueries:● Query batches

○ Up to 250k query images in one batch○ Batch includes original images and their distorted

variants■ Some variants are very hard to find

● e.g., print-crumple-scan● Check if original images returned as top votes

○ (out of scope) state-of-the-art search quality




Image search workloadsIndexing workload characteristics● computationally-intensive (map phase)● data-intensive (at map&reduce phases)● large auxiliary data structure (i.e., index tree)

○ grows as dataset grows○ e.g., 1.8GB for 110M images (4TB)

● map input < map output● network is heavily utilized during shuffling




Image search workloadsIndexing workload




Image search workloadsSearching workload● large aux.data structure (e.g., lookup table)




Image search workloads



● Basic settings:○ 512MB HDFS

block size○ 3 replicas○ 8 map slots○ 2 reduce slots

● 4TB dataset: ○ 4 map slots


Hadoop tools for large joins● Some workloads require all mappers to load a

large-size data structure○ Like image indexing/searching workloads

● Spreading data file across all nodes○ Hadoop DistributedCache

● Not efficient if structure is of gigabytes-size○ Partial solution: increase HDFS block sizes →

decrease #mappers● Another approach: multithreaded mappers

○ Not well documentedDenis Shestakov



Hadoop tools for large joins

● Multithreaded mapper spans a configured number of threads, each thread executes a map task

● Mapper threads share the RAM● Downsides:

○ synchronization when reading input○ synchronization when writing output





Indexing 4T with 4 mappers slots, each running two threads

● index tree size: 1.8GBIndexing time on 100 nodes

● 8h27min → 6h8min





● In some workloads mappers require only a part of auxiliary data structure○ I.e., relevant to data block processed○ Image searching workflow

● Approach: Hadoop MapFile○ Very efficient

■ Big batches, >10000 query images■ ~2 times faster on batches including around

25000 imagesDenis Shestakov



Smart Hadoop configurationHere is the problem:● Apache Hadoop, v.1.0.1● Capacity/performance of nodes varied

○ 8/24 cores, 24-48GB RAM, etc● One config file (#mappers, #reducers, maxim.

map/reduce memory, ...) for all nodes● Issue for memory-intensive workloads!




Smart Hadoop configurationSolution (hack):● deploy Hadoop on all nodes with settings addressing

the least equipped nodes● create sub-cluster configuration files adjusted to better

equipped nodes○ substitute original config file with the new one on better

equipped nodes ● restart tasktrackers with new configuration files on

better equipped nodes

Call it smart deployment● Or known under another name?




Smart Hadoop configuration



Indexing 1T on 106 nodes: 75min → 65min


Conclusions

● Several directions for further optimization ● Presented techniques applicable to video and

audio datasets○ Given a transformation into feature vectors○ Only small changes expected (e.g, new Writable)

● Hadoop smart deployment trick● (Wanted) Best practices for Hadoop job

history log analysisDenis Shestakov



Supporting publicationsThings to share



Hadoop job history logs available on request:● Describe indexing/searching 4TB dataset● Insights on better analysis/visualization are welcome● Get cbmi13 example-set at http://goo.gl/e06wE


Supporting publicationsSupporting Materials



Check full-texts of our publications:● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Indexing and

searching 100M images with Map-Reduce. In Proc. ACM ICMR'13, 2013.

● D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable high-dimensional indexing with Hadoop. In Proc. CBMI'13, 2013.

● D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Terabyte-scale image similarity search: experience and best practice. In Proc. IEEE BigData'13, 2013.


Acknowledgements



● My colleagues at INRIA Rennes

● Aalto University

● Grid5000 infrastructure


That’s it!



Thanks!


Terabyte-scale image similarity search with Hadoop

Technology

Transcript of Terabyte-scale image similarity search with Hadoop