A New Approach To The Multiclass Classification Problem

Category Vector Space


Problem Motivation Discussion Preliminary Results

Multi-class classification through binary classification

One-vs-All One-vs-One

Multi-class classification can be constructed often as a generalization of binary classification

In practice multi-class classification is done by combining binary classifiers

Classification ProblemProblem

Object recognition


Automated protein classification



Digit recognition


Phoneme recognition

[Waibel, Hanzawa, Hinton,Shikano, Lang 1989]












The multi-class algorithm computationally expensive

Multiclass ApplicationsLarge Category Space


Hand-writing recognition (e.g., USPS) Text classification Face detection Face expression recognition

Other Multiclass ApplicationsProblem

Data: {(xi,yi)} i =1,…,n

Classification Setup

Question: design a classification rule

y = f(x)

such that, given a new x, this predicts y with

minimal probability of error

vectorfeature dRx

label1,1 y

Training and test data drawn i.i.d. from fixed but unknown probability distribution D

Labeled training set

nnii yxyxS ,,,,














Training examples mapped to (usually high-dimensional) feature space by a feature map F(x) = (F1(x), … , Fd(x))

Learn linear decision boundary: Trade-off between maximizing geometric margin of the trainingdata and minimizing margin violations

Support Vector Machines (SVMs)Problem

Linear classifier defined in feature space by

SVM solution gives

as a linear combination of support vectors, a subset of the training vectors














Definition Of SVM Classifiers

bwf xx ,





Definition Of A Margin

History (Vapnik, 1965) if linearly separable: Place hyerplane “far” from the data:

large margin



Maximize The MarginHistory (Vapnik, 1965) if linearly separable:

Place hyerplane “far” from the data: large margin


Large margin classifierLeads to good generalization (performance on test sets)


One-vs-All (OVA) For each class build a classifier for that class vs the rest

Constructs k SVM models

Often very imbalanced classifiers

Asymmetry in the amount of training data

Earliest implementation for SVM multiclass

One-vs-One (OVO) Constructs k(k-1)/2 classifiers

Rooted binary SVM’s with k leaves

Traverse tree to reach leaf node

Combining Binary ClassifiersProblem

Race categories {White, Black, Asian}Task: Map the image training set to the race labels

The training (learning) Test (generalization)

Scenario: Ambiguous test image is presented Mixed race person A person drawn from a race which is not represented by the

system (i.e. Hispanics, Native Americans, etc)No way of assigning a mixed label

The system cannot represent the mixed race person using a combination of categories

No way of representing unknown racePossible Solution:

Indicate that the incoming image is outside the margin of each learned category

Example 1Motivation

Musical samples generated by a single instrument Electric guitar—a set of note categories {C,C#,D,D#, etc.}

Task: Map the training set musical notes to the labels Reasonable learning and generalization properties

Scenario: Given musical sequences Intervals (two notes simultaneously struck such as {C,F#} ) Chords (containing three or more notes)

Ambiguity at the training set level Forced to assign new labels to intervals and chords even though they

contain the same features—single notes—as the note categories Music sequence case, if we learned a conditional probability

distribution p(L = l|x) x is a music sequence and L = {C,C#,D, · · · ,B} is a set of note labels When x is an interval—say a tritone

No way of assigning high probability to the tritone Possible Solution:

Accommodate the tritone by assigning it a new label Large label space Truncate because of exponential size considerations

Example 2Motivation

Categories are conceived as nominal labels No underlying geometry for the categories Inability of the conditional distribution to give us a

measure (value) for interpolated categories Non-represented interpolated categories are left out Not easy to distinguish basic categories from compound


Problems With Combining Binary ClassifiersMotivation

Invoke the notion of a category vector spaceCategories are defined with a geometric structure

Assume that the set of categories(labels) forms a vector spaceMusic sequence would correspond to a label in a twelve dimensional vector

{C,C#,D,D#,E,F,F#,G,G#,A,A#,B}Basic note C,C#,D etc. would have its own coordinate axis (vector space)Learning problem:

Map the training set music sequences to vectors in a 12 dimensional space such that the training and test set errors are small

Map the training musical sequences to the 12 dimensional vector space and then (if a support vector machine approach is used), maximize the margin of the mapped vectors in the category space

Race classification example is analogous Depends on how many races we wish to explicitly represent Map the training set to race category vector space and maximize the margin

Generalization problem: Map a test set musical sequence or image into the category space and then ask if it

lies within the margin of a note (or chord) or race category

Category Vector Spaces Solution

Note: Extensions to other multi-category learning applications are straightforward assuming we can map category labels to coordinate.



Solution: The columns of W are the top D eigenvectors(corresponding to the largest eigenvalues) of

Multiclass Fisher Related Idea

Nixi ,,1,R M

D categories and a projected set of features defined bythe MC-FLD maximizes

'iswhere DMWxWz iT







aaaB mmmmNS








mxmxS1 1



Given the feature vectors




am tha

Eigenvectors are orthonormalColumns of W constitute a category vector spaceInterpret as a category space projectionOptimal solution is a set of orthogonal weight vectors


Avoided this approach since margins are not maximized in category space

We have not seen a classifier take a three class problem with labels {0,1,2}, map the input features into a vector space

Basis vectors , and Attempt to maximize the margin in the category vector

space Not seen any previous work where a pattern from a

compound category—say a combination of labels 1 and 2—is also used in training with a conversion of the compound category to a vector

T001 T010 T100



Disadvantage Of Multiclass Fisher

Input feature vectors are mapped to the category vector space using a kernel-based approach

In the category vector space, maximizing the margin is equivalent to forming hypercones

Mapped feature vectors that lie inside the hypercone have a distinct class label

Mapped vectors that lie in between hypercones are ambiguous

Hypercones are not allowed to intersect

Depicts basic categories

Description of Category Vector SpacesDiscussion

Each pattern now exists as a linear superposition of category vectors in the category space.

Ensures ambiguity is handled at a fundamental level Compound categories can be directly represented in the

category space Can maximize the compound category margin as well as

the margins for the basic categories

Advantages Of Category Vector Space Discussion

Regression Each input training set feature vector must be mapped

to a corresponding point where M is the number of feature dimensions and D the cardinality of the basic categories

Classification Each mapped feature vector must maximize its margin

relative to its own category vector against the other category vectors Here is known and corresponds to a category vector

Technical Challenges

Mix R



yyy '


controls the width of the interval for which there is no penalty Slack variable vectors are non-negative component-wise Weight vector and bias help map the feature vector to its counterpart. The choice of kernel K (GRBF or otherwise) is hidden in the operator which ∗

implements inner products by projecting vectors in into a suitable space The regularization parameter weighs the norm of against the data fitting

error. Larger the value of , the greater the emphasis on the data fitting error

Regression In Category Space








aaa wwzbw

1 1



*1 22


*iaaiaia bxwz

iaiaaia zbxw


0* ia

subject to the constraints



Associate each mapped vector with a category vector Category vectors

Can be basis vectors (axes corresponding to basic categories) in the category space Ordinary vectors (corresponding to compound categories)

In this definition of membership, no distinction is made between basic and compound categories.

We seek to maximize the margin in the category space Minimizing the norm of the mapped vectors is equivalent to maximizing the

margin provided the inequalities can be satisfied

Classification In Category Space


iii zzz

12 2


1 biai yziyz

LiiaLb ,,1,,,1

subject to the constraints


Integrated Classification and Regression Objective Function

The design of the objective function is so we can obtain an integrated dual classification and regression objective

zzbwzbw 2*

1* ;,,,,,,,


Gaussian radial basis function (GRBF) classifier with multiple outputs One for each basic category

Given a training set of registered and cropped face images Labels are {White, Black, Asian}

GRBF classifier to map the input feature vectors into the category space Since we know the label of each training set pattern we approximate the

mapped category space

Multi-Category GRBFPreliminary Results








TaiaGRBF wwbxwzbwE

11 1








TiNiii xxxx ,,, 21






TTaa zxxxwb






45 training images from the “Labeled Faces in the Wild” image database Database contains over 13,000 images that were captured using the Viola-

Jones face detector Each face has been labeled with the corresponding name of the person Of the 5749 people featured in the database 1680 individuals have multiple images with each image being unique

In the 45 training images, 15 were from each of the three races considered 45 images registered to one “standard” image (after first converting them to

grayscale) using a landmark-based thin-plate spline (TPS) The landmarks used were:

Three(3) for each eye Two(2) for the nose Two(2) for the two ears (very approximate since the ears are often not visible).

After registering the images, they were cropped and resized to 130×90 with the intensity scale adjusted to [0,1].

Free parameters were set to and These were carefully but qualitatively chosen to get a good training set separation in category space

Experimental Setup

1 10

White Basis = y1 = [1,0,0]T Black Basis = y2 = [0,1,0]T Asian Basis = y3 = [0,0,1]T

Preliminary Results

Training set images: Top row: Asian, Middle row: Black, Bottom row: White

Race Classification Training ImagesPreliminary Results

Training set images mapped into the category vector space

Category Space For Training ImagesPreliminary Results

Test set images: Top row: Asian, Middle row: Black, Bottom row: White51 test set images (17 Asian, 16 Black, 18White)Used the weights discovered by the GRBF classifier to map the input test set images into thecategory space

Race Classification Testing ImagesPreliminary Results

In the graph above we can see the separation in the category space

Category Space Testing ImagesPreliminary Results

Pairwise classificationsRoughly separate each pair by drawing lines through the originRemoving the orthogonal subspace that is not being compared against

Pairwise Projection Of Category Space Testing Images

The pairwise separations in the category space show an improved visualizationOne could in fact draw separating boundaries in the three pairwise comparisons and obtain an overall decision boundary in 3D

Preliminary Results

Nine ambiguous (from our perspective) facesWanted to exhibit the tolerance of ambiguity that is a hallmark of category spacesThe conclusion drawn from the result is a subjective one

Ambiguous faces mapped into the category space. Note how they cluster together.

Ambiguity TestingPreliminary Results

Experiment With MPEG-7 Database




Preliminary Results

Experiment With MPEG-7 Database




Preliminary Results

3 Class TrainingPreliminary Results

3 Class TestingPreliminary Results

4 Class TrainingPreliminary Results

4 Class TestingPreliminary Results


Fundamental contribution is learning of category spaces from patternsEnsures ambiguity is handled at a fundamental levelCompound categories can be directly represented in the category spaceSpecific approach integrates regression and classification (iCAR)

Combines a regression objective function (map the patterns) Maximum margin objective function

(perform multicategory classification in category space)

Questions & Discussion

Thank You


