Terabyte-scale image similarity search: experience and best practice

30
Terabyte-scale image similarity search: experience and best practice Denis Shestakov denis.shestakov at aalto.fi linkedin: linkedin.com/in/dshestakov mendeley: mendeley.com/profiles/denis-shestakov Diana Moise 2 , Denis Shestakov 1,2 , Gylfi Gudmundsson 2 , Laurent Amsaleg 3 1 Department of Media Technology, School of Science, Aalto University, Finland 2 Inria Rennes – Bretagne Atlantique, France 3 IRISA - CNRS, France

description

Slides for the talk given at IEEE BigData 2013, Santa Clara, USA on 07.10.2013. Full-text paper is available at http://goo.gl/WTJoxm To cite please refer to http://dx.doi.org/10.1109/BigData.2013.6691637

Transcript of Terabyte-scale image similarity search: experience and best practice

Page 1: Terabyte-scale image similarity search: experience and best practice

Terabyte-scale image similarity search: experience and best practice

Denis Shestakovdenis.shestakov at aalto.filinkedin: linkedin.com/in/dshestakovmendeley: mendeley.com/profiles/denis-shestakov

Diana Moise2, Denis Shestakov1,2, Gylfi Gudmundsson2, Laurent Amsaleg3

1 Department of Media Technology, School of Science, Aalto University, Finland 2 Inria Rennes – Bretagne Atlantique, France 3 IRISA - CNRS, France

Page 2: Terabyte-scale image similarity search: experience and best practice

Terabyte-scale image search in Europe?

Page 3: Terabyte-scale image similarity search: experience and best practice

Overview

1. Background: image retrieval, our focus, environment, etc.

2. Applying Hadoop to multimedia retrieval tasks

3. Addressing Hadoop cluster heterogeneity issue

4. Studying workloads with large auxiliary data structure required for processing

5. Experimenting with very large image dataset

Page 4: Terabyte-scale image similarity search: experience and best practice

Image search?Content-based image search:● Find matches with similar content

Page 5: Terabyte-scale image similarity search: experience and best practice

Image search applications?

● regular image search● object recognition

○ face, logo, etc.● for systems like Google Goggles● augmented reality applications● medical imaging● analysis of astrophysics data

Page 6: Terabyte-scale image similarity search: experience and best practice

Our use case

● Copyright violation detection● Our scenario:

○ Searching for batch of images■ Querying for thousands of images in one run■ Focus on throughput, not on response time for

individual image● Note: indexed dataset can be searched on single

machine with adequate disk capacity if necessary

Page 7: Terabyte-scale image similarity search: experience and best practice

Image search with Hadoop● Index & search huge image collection using

MapReduce-based eCP algorithm○ See our work at ICMR'13: Indexing and

searching 100M images with MapReduce [18]○ See Section III for quick overview

● Use the Grid5000 plartform○ Distributed infrastructure available to French

researchers & their partners● Use the Hadoop framework

Page 8: Terabyte-scale image similarity search: experience and best practice

Experimental setup: cluster

● Grid5000 platform:○ Nodes in rennes site of Grid5000

■ Up to 110 nodes available■ Nodes capacity/performance varied

● Heterogenous, come from three clusters● From 8 cores to 24 cores per node● From 24GB to 48GB RAM per node

Page 9: Terabyte-scale image similarity search: experience and best practice

Experimental setup: framework

● Standard Apache Hadoop distribution, ver.1.0.1○ (!) No changes in Hadoop internals

■ Pros: easy to migrate, try and compare by others■ Cons: not top performance

○ Tools provided by Hadoop framework■ Hadoop SequenceFiles■ DistributedCache■ multithreaded mappers■ MapFiles

Page 10: Terabyte-scale image similarity search: experience and best practice

Experimental setup: dataset

● 110 mln images (~30 billion SIFT descriptors)○ Collected from the Web and provided by one

of the partners in Quaero project■ Largest reported in literature

○ Images resized to 150px on largest side○ Worked with

■ The whole set (~4TB)■ The subset, 20mln images (~1TB)

○ Used as distracting dataset

Page 11: Terabyte-scale image similarity search: experience and best practice

Experimental setup: querying● For evaluation of indexing quality:

○ Added to distracting datasets:■ INRIA Copydays (127 images)

○ Queried for■ Copydays batch (~3000 images = 127 original

images and their associated variants incl. strong distortions, e.g. print-crumple-scan )

■ 12k batch (~12000 images = 245 random images from dataset and their variants)

■ 25k batch○ Checked if original images returned as top voted

search results

Page 12: Terabyte-scale image similarity search: experience and best practice

Image search with HadoopDistributed index creation● Clustering images into a large set of clusters (max

cluster size = 5000)● Mapper input:

○ unsorted SIFT descriptors○ index tree (loaded by every mapper)

● Mapper output:○ (cluster_id, SIFT)

● Reducer output:○ SIFTs sorted by cluster_id

Page 13: Terabyte-scale image similarity search: experience and best practice

Image search with HadoopIndexing workload characteristics● computationally-intensive (map phase)● data-intensive (at map&reduce phases)● large auxiliary data structure (i.e., index tree)

○ grows as dataset grows○ e.g., 1.8GB for 110M images (4TB)

● map input < map output● network is heavily utilized during shuffling

Page 14: Terabyte-scale image similarity search: experience and best practice

Image search with Hadoop

Page 15: Terabyte-scale image similarity search: experience and best practice

Image search with HadoopSearching workflow● large aux.data structure (e.g., lookup table)

Page 16: Terabyte-scale image similarity search: experience and best practice

Index search with Hadoop: results● Basic settings:

○ 512MB chunk size○ 3 replicas○ 8 map slots○ 2 reduce slots

● 4TB dataset: ○ 4 map slots

Page 17: Terabyte-scale image similarity search: experience and best practice

Hadoop on heterogeneous clustersCapacity/performance of nodes in our cluster varied

○ Nodes come from three clusters○ From 8 cores to 24 cores per node○ From 24GB to 48GB RAM per node○ Different CPU speeds

● Hadoop assumes one configuration (#mappers, #reducers, maxim. map/reduce memory, ...) for all nodes

● Not good for Hadoop clusters like ours

Page 18: Terabyte-scale image similarity search: experience and best practice

Hadoop on heterogeneous clusters

● Our solution (hack):○ deploy Hadoop on all nodes with settings addressing the

least equipped nodes○ create sub-cluster configuration files adjusted to better

equipped nodes○ restart tasktrackers with new configuration files on better

equipped nodes● We call it ‘smart deployment’● Considerations:

○ Perhaps rack-awareness feature of Hadoop should be complemented with smart deployment functionality

Page 19: Terabyte-scale image similarity search: experience and best practice

Hadoop on heterogeneous clusters

● Results○ indexing 1T on 106 nodes: 75min → 65min

Page 20: Terabyte-scale image similarity search: experience and best practice

Large auxiliary data structure● Some workloads require all mappers to load a large-

size data structure○ E.g., both in image indexing and searching workloads

● Spreading data file across all nodes:○ Hadoop DistributedCache

● Not efficient if structure is of gigabytes-size● Partial solution: increase HDFS block sizes →

decrease #mappers● Another solution: multithreaded mappers provided by

Hadoop○ Poorly documented feature!

Page 21: Terabyte-scale image similarity search: experience and best practice

Large auxiliary data structure

● Multithreaded mapper spans a configured number of threads, each thread executes a map task

● Mapper threads share the RAM● Downsides:

○ synchronization when reading input○ synchronization when writing output

Page 22: Terabyte-scale image similarity search: experience and best practice

Large auxiliary data structure

● Let’s test it!

● Indexing 4T with 4 mappers slots, each running 2 threads○ index tree size: 1.8GB

● Indexing time: 8h27min → 6h8min

Page 23: Terabyte-scale image similarity search: experience and best practice

Large auxiliary data structure

● In some application, mappers needs only a part of auxiliary data structure (the one relevant to data block processed)

● Solution: Hadoop MapFile ● See Section 5.C.2

○ Searching for 3-25k image batches○ Though it is rather inconclusive

● Stay tuned!○ A proper study of MapFile is now in progress

Page 24: Terabyte-scale image similarity search: experience and best practice

Open questions

● Practical one:○ What are best practices for analysis of

Hadoop job execution logs?● Analysis of Hadoop job logs happened to be very

useful in our project○ Did with our python/perl scripts

● It is extremely useful for understanding and then tuning Hadoop jobs on large Hadoop clusters

● Any good existing libraries/tools?○ E.g., Starfish Hadoop Log analyzer (Duke Univ.)

Page 25: Terabyte-scale image similarity search: experience and best practice

Open questionsE.g., search (12k batch over 1TB) job execution on 100 nodes

Page 26: Terabyte-scale image similarity search: experience and best practice

Observations & implications

● HDFS block size limits scalability○ 1TB dataset => 1186 blocks of 1024MB size○ Assuming 8-core nodes and reported searching

method: no scaling after 149 nodes (i.e. 8x149=1192) ○ Solutions:

■ Smaller HDFS blocks, e.g., scaling up to 280 nodes for 512MB blocks

■ Re-visit search process: e.g., partial-loading of lookup table

● Big data is here but not resources to process○ E.g, indexing&searching >10TB not possible given

resources we had

Page 27: Terabyte-scale image similarity search: experience and best practice

Things to share● Our methods/system can be applied to audio datasets

○ No major changes expected○ Contact me/Diana if interested

● Code for MapReduce-eCP algorithm available on request○ Should run smoothly on your Hadoop cluster○ Interested in comparisons

● Hadoop job history logs behind our experiments available on request○ Describe indexing/searching our dataset by giving details on

map/reduce tasks execution○ Insights on better analysis/visualization are welcome○ E.g., job logs supporting our CBMI'13 work: http://goo.

gl/e06wE

Page 28: Terabyte-scale image similarity search: experience and best practice

Acknowledgements

● Aalto University http://www.aalto.fi

● Quaero project http://www.quaero.org

● Grid5000 infrastructure & its Rennes maintenance team http://www.grid5000.fr

Page 29: Terabyte-scale image similarity search: experience and best practice

Supporting publications

[18] D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Indexing and searching 100M images with Map-Reduce. In Proc. ACM ICMR '13, 2013.[20] D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable high-dimensional indexing with Hadoop. In Proc. CBMI'13, 2013.[this-bigdata13] D. Moise, D. Shestakov, G. Gudmundsson, L. Amsaleg. Terabyte-scale image similarity search: experience and best practice. In Proc. IEEE BigData'13, 2013.[submitted] D. Shestakov, D. Moise, G. Gudmundsson, L. Amsaleg. Scalable high-dimensional indexing and searching with Hadoop.

Page 30: Terabyte-scale image similarity search: experience and best practice

Thank you!