Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Presented by: Huy Nguyen @huyng

Computer vision at scale with Hadoop and Storm

Outline

▪ Computer vision at 10,000 feet

▪ Hadoop training system

▪ Storm continuous stream processing

▪ Hadoop batch processing

▪ Q&A

How can we help Flickr users better organize, search, and browse

their photo collection?

photo credit: Quinn Dombrowski

What is AutoTagging?

Any photo you upload to Flickr is now automatically classified using computer vision software

Class Probability

Flowers 0.98

Outdoors 0.95

Cat 0.001

Grass 0.6

ClassifiersClassifiersClassifiers

Auto Tags Feeding Search

Our Goals

▪ Train thousands of classifiers

▪ Classify millions of photos per day

▪ Compute auto tags on backlog of billions of photos

▪ Enable reprocessing bi-weekly and every quarter

▪ Do it all within 90 days of being acquired!

Our system for training classifiers

▪ High level overview of image classification

▪ Hadoop workflow

Classification in a nutshell

How do we define a classifier?

A classifier is simply the line “z = wx - b” which divides positive and negative examples with the largest margin. Entirely defined by w and b

Training is simply adjusting w and b to find the line with the largest margin dividing negative and positive example data

Classification is done by computing z, given some data point x

Image classification in a nutshell

To classify images we need to convert raw pixels into a “feature” to plug into x

computefeature

How are image classifiers made?

flower Class Name

Negative Examples

(approx. 100K)

Positive Examples(approx. 10K)

Repeat a few thousand times for each class

ComputeFeatures Train

Classifier

Hadoop Classifier Training

Pixel Data+

Label Data

1 <base64>2 <base64>3 <base64>4 <base64>..n <base64>

dog 1 TRUEdog 2 FALSEdog 3 TRUEcat 1 FALSEcat 2 TRUEcat 3 FALSE..

Pixel Data Label Data

photo_id, pixel_data class, photo_id, label

Hadoop Classifier Training - JOIN

reduce

Pixels

Labels

1 <pixelsb64> [(cat, FALSE), (dog, TRUE)]2 <pixelsb64> [(cat,TRUE), (dog, FALSE)]3 <pixelsb64> [(cat, FALSE), (dog, TRUE)]..

▪ mappers emit (photo_id, pixels) OR (photo_id, class, label)▪ reducers, takes advantage of sorting to emit (photo_id, pixels, json(label data))

Hadoop Classifier Training - TRAIN

1 <pixelsb64> [(cat, FALSE), (dog, TRUE)]2 <pixelsb64> [(cat,TRUE), (dog, FALSE)]3 <pixelsb64> [(cat, FALSE), (dog, TRUE)]..

reduce

computefeature

trainingclassifier

dogclf.

catclf.

▪ mappers compute features, emits a (class, feat64, label) for each class image associated with image

▪ reducers, takes advantage of sorting to train. Emits (class, w, b)

▪ Workflow coordinated with OOZIE

▪ 10,000 mappers▪ 1,000 reducers▪ A few thousand classifiers in less than 6 hours▪ Replaced OpenMPI based system

▪ Easier deployment▪ Access to more hardware▪ Much simpler code

Area of Improvements

▪ Reducers have all training data in memory▪ A potential bottleneck if we want to expand training set for any given

set of classifiers▪ Known path to avoiding this bottleneck

▪ Incrementally update models as new data arrives

Our system for tagging the flickr corpus

▪ Batch processing with Hadoop

▪ Stream processing with Storm

Batch processing in Hadoop

www HDFS AutotagsMapper

Search Engine Indexer

HDFS IndexingReducer

AutotagsMapperAutotagsMapper

Hadoop Auto tagging

Pixel Data

Autotag Results

1 <base64>2 <base64>3 <base64>4 <base64>..n <base64>

1 { “cat”: .98, “dog”: 0.3 }2 { “cat”: .97, “dog”: 0.2 }3 { “cat”: .2, “dog”: 0.3 }4 { “cat”: .5, “dog”: 0.9 }..n { “cat”: .90, “dog”: 0.23 }

Pixel Data Autotag Results

photo_id, pixel_data photo_id, results

Hadoop classification performance

▪ 10,000 Mappers▪ Processed in about 1 week entire corpus

Main challenge: packaging our code

▪ Heterogenous Python & C++ codebase

▪ How do we get all of the shared library dependencies into a heterogenous system distributed across many machines?

Main challenge: packaging our code

▪ tried: pyinstaller, didn’t work

▪ rolled our own solution using strace▪ /autotags_package

▪ /lib▪ /lib/python2.7/site-packages▪ /bin▪ /app

Stream processing (our first attempt)

www Redis

PHPWorkersPHPWorkersPHPWorkers

fork autotags a.jpg

searchdb

Why we started looking into Storm

FeatureComputation Classification

300 ms 10 ms

ClassifierLoad

300 ms

Classifier loading accounted for 49% of total processing time

Solutions we evaluated at the time

▪ Mem-mapping model

▪ Batching up operations

▪ Daemonize our autotagging code

Enter Storm

Storm computational framework

▪ Spouts▪ The source of your data

stream. Spouts “emit” tuples

▪ Tuples▪ An atomic unit of work

▪ Bolts▪ Small programs that

processes your tuple stream

▪ Topology▪ A graph of bolts and

spouts

Defining a topology

TopologyBuilder builder = new TopologyBuilder();

builder.setSpout("images", new ImageSpout(), 10);

builder.setBolt("autotag", new AutoTagBolt(), 3)

.shuffleGrouping("images")

builder.setBolt("dbwriter", new DBWriterBolt(), 2)

.shuffleGrouping("autotag")

Defining a bolt

class Autotag(storm.BasicBolt):

def __init__(self):

self.classifier.load()

def process(self, tuple):

Our final solution

www Redis AutotagsSpout

AutotagsBolt

Search Engine Indexer

DB WriterBolt

Topology

Storm architecture

Supervisors

(10 nodes)

Nimbus

(1 node)

Zookeepers

(3 nodes)

Spouts and bolts live here

Fault tolerant against node failures

Nimbus server dies▪ no new jobs can be submitted▪ current jobs stay running▪ simply restart

Supervisor node dies▪ All unprocessed jobs for that node get reassigned▪ Supervisor restarted

Zookeeper node dies▪ Restart node, no data loss as long as not too many die at once

storm commands

storm jar topology-jar-path class

storm commands

storm list

storm commands

storm kill topology-name

storm commands

storm deactivate topology-jar-path class

storm commands

storm activate topology-name

Is Storm just another queuing system?

Yes, and more

▪ A framework for separating application design from scaling concerns

▪ Has a great deployment story▪ atomic start, kill, deactivate for free▪ no more rsync

▪ No single point of failure▪ Wide industry adoption▪ Is programming language agnostic

Try storm out!

https://github.com/apache/incubator-storm/tree/master/examples/storm-starter

https://github.com/nathanmarz/storm-deploy

Thank you

Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

Technology

Transcript of Flickr: Computer vision at scale with Hadoop and Storm (Huy Nguyen)

On Ranking and Influence in Social Networks Huy Nguyen Lab seminar November 2, 2012.

SOFTWARE ARCHITECTURE AND DESIGN TEAM 3 – K15T1 Hanh Luong – Leader – T095095 Hao Tran – T094442 Huy Nguyen – T094016 Hieu Le – T093798 Quang Nguyen –

Polymer Photovoltaic Cells: Prototype Presentation April 15, 2010 JESUS GUARDADO, LEAH NATION, HUY NGUYEN, TINA RO.

Grokking Engineering - Data Analytics Infrastructure at Viki - Huy Nguyen

1 Computing Trust in Social Networks Huy Nguyen Lab seminar April 15, 2011.

Applying Kriging algorithm based on Matlab environment to ......Thanh Truong Quoc Ngoc Thai Ba Kha Nguyen Xuan Huy Nguyen Xuan Ngo Dau Van Xuan Tran Van Dong Nguyen Duc Tuan Nguyen

Holy Spirit Catholic Church Huynh & Allan Suyao Jul 17: Nataly Rosales & Javier Morales Jul 18: Lisa Thanh Tam Nguyen & Loc Tran Loan Nguyen & Mickey Duong Tram Nguyen & Huy Cao Jul

A Tutorial on Windows Phone 7 and Windows Azure COSC7388 Spring 2011 Huy Nguyen.

Copyright by Quan Huy Nguyen 2016

Huy Nguyen - CSR Assgn 1 - Yr2010

NGUYEN QUOC HUY - vawr.org.vn tao/2017/TTLA (07... · nguyen quoc huy research on certain hydrodynamic features of the transition flow with the integration of surface-bottom-submerged

Thibault de Poyferré & Quang-Huy Nguyen - arXiv · 2018-10-11 · Thibault de Poyferré & Quang-Huy Nguyen Abstract. — We consider in this article the system of gravity-capillary

2014 PARTICIPATORY GOVERNANCE ASSESSMENT: TAKING … · Authors: Mr Nguyen Quang Tan with the contributions from Mr Nguyen Huu Dzung, Mr Hoang Huy Tuan, Mr Phan Trieu Giang, Ms Huynh

Huy Nguyen, PE ENGR10 20Sep12 Engr 10 Guest Lecture Huy Nguyen, PE September 20, 2012.

Bai2 Bang Hoang Huy 2011 38€¦ · acta mathematica vietnamica 145 volume 36, number 2, 2011, pp. 145–167 some properties of orlicz-lorentz spaces ha huy bang, nguyen van hoang

static-content.springer.com10.1007... · Web viewMicrobial Production of Astilbin, a Bioactive Rhamnosylated Flavanonol, From Taxifolin Nguyen Huy Thuan1*, Sailesh Malla2, Nguyen

Workshop Proceedings EXPERIENCES AND POTENTIAL FOR …€¦ · 02-01-2000 · minority)." Nguyen Huy Dung, Nguyen Hai Nam and Pham Quoc Hung 30 "Doi and Ke Villages, Hien Luong Commune,

Cai Yu, Wu Xingyao, Rafael Rabelo, Le Huy Nguyen, Valerio ...ceqip.eu/2013/talks/ceqip2013_scarani.pdf · Rafael Rabelo PhD 4th year Wu Xingyao (邬兴尧 ) PhD 1st year Lê Huy Nguyên

A Novel Chopping-Spinning MAGFET Device Vincent FRICK , Huy Binh NGUYEN and Luc HEBRARD

Hidden Markov Models for Information Extraction Recent Results and Current Projects Joseph Smarr & Huy Nguyen Advisor: Chris Manning.