Image Classification and Retrieval on Spark

Post on 15-Apr-2017

328 views 2 download

Transcript of Image Classification and Retrieval on Spark

SPARK MBUTODesign & Engineering Machine Learning Pipelines

Gianvito Siciliano

Use Case: Image Classification and Retrieval

OUTLINE1. Spark ‘Mbuto intro

2. ML problems overview

3. Classification & retrieval logic

4. Classification Models

5. Image Pipeline

OUTLINE1. Spark ‘Mbuto intro

• Abstractions

• Basic Examples

2. ML problems overview

3. Classification & retrieval logic

4. Classification Models

5. Image Pipeline

SPARK MBUTO• Spark poc to (easy) create, run and test pipelines and

workflow

• Pipelines are made by sequential steps in a SparkJobApp

• Each steps is a SparkJob

• Each job share the same Spark/SQL context

• Jobs are consecutively run by JobRunner

SPARKJOB

JOBRUNNER

SPARKJOBAPP

PIPELINE

App .main

JobRunner .run

Job

Job

.execute

.execute

next job

JOB READY TO USE

READABLE APP

App .main

JobRunner .run

Job

Job

.execute

.execute

next job

PERFORMANCE LOOKUP

A

JobR

J

J

OUTLINE1. Spark ‘Mbuto intro

2. ML problems overview

• Classification

• Retrieval

3. Classification & retrieval logic

4. Classification Models

5. Image Pipeline

IMAGE CLASSIFICATION• Multiclass image classification:

1. Choose model (NN, SVM, TREE…)

2. Train/test model (with labeled images)

3. Predict the label of new images

4. Tune the model

IMAGE RETRIEVAL• Multiclass image classification:

1. Choose metric (Euclidean, cosine…)

2. Build dictionary

3. Train/test the model

4. Query and search

5. Tune the model

WHAT CHANGES?

• Pipelines architecture

• Classification logic

• How to update the model?

CLASSIFICATION PIPELINE

DATA

TRAIN CLASSIFIER

MODELNEW DATA

PREDICTION

RETRIEVAL PIPELINE

DATA

TRAIN CLASSIFIER

MODEL QUERY

PREDICTION

OUTLINE1. Spark ‘Mbuto intro

2. ML problems overview

3. Classification & retrieval logic

4. Classification Models

5. Image Pipeline

CLASSIFICATION & RETRIEVAL• Keypoints extraction from each images

• Clustering on the keypoints universe

• Represent each image with weighted cluster vector

• Train & Test the model

• Query the model (finding the most similar images)

Features Engineering

Build the Dictionary

Build theclassifier

Query the model

C. & R. JOBS• Load whole dataset

• Extract keypoints

• Reduce the keypoints universe

• Transform the features space

• Create the dictionary (aka Codebook)

• Train, test & evaluate the classifier

• Query and get prediction

DATA

TRAIN CLASSIFIER

MODEL

PREDICTION

KMeansCLASSIFIER

ImageLOADER

.transform

SiftEXTRACTOR

KMeansQUANTISER

.fit

CLUSTERS

CfIifTRANSFORMER

ClusterVectorPIVOTER

CODEBOOK

Features Engineering

Build the Dictionary

DICTIONARY

TRANSFORMER

ESTIMATOR

VectorASSEMBLER

.transform

LabelINDEXER

KNNCLASSIFIER

.fit

.transform

.fit

KMeansCLASSIFIER

TRAIN TEST

.split

EVALUATOR

Trainclassifier

Evaluateclassifier

INSAMPLE PREDICTION

OUTSAMPLE PREDICTION

CLASSIFIER

TRANSFORMER

ESTIMATOR

KNN IMPLEMENTATION• Is a comparison model: the similarity metric is crucial!

• Nearest Neighbour search (in the codebook) is the panic point:

• KDTree: not parallel (anche se…)

• LSH: hyperparams difficult to tune

• Metric Tree: disjoint features points area

• Spill tree: too many shared points

=> Hybrid Tree

HYBRID TREE• TopTree is a Metric tree

• SubLeaf Tree are Spill tree, trained in parallel

• Nodes can be:

• OVERLAP => defeatist search

• NON OVERLAP => backtracking

NEURAL NETWORK

• Convolutional works well with images

• Hyperparameters tuning is the panic point, but can be automatised (guarda il nuovo algo)

• Training is not trivial, update the model is easy to complain

WHAT MORE?• Features engineering

• Hyperparameters tuning

• Parallel optimizations

• Persist/update steps

• Ensemble models

DATA

Combiner

PREDICTION

Normalizer

pipelineModel

Cross Validator

https://github.com/gianvi

Thanks!