Scalable real-time object recognition and segmentation via …pvernaza/papers/mrfRecogTR.pdf ·...

9
Scalable real-time object recognition and segmentation via cascaded, discriminative Markov random fields Paul Vernaza, Ben Taskar, and Daniel D. Lee GRASP Laboratory University of Pennsylvania {vernaza,taskar,ddlee}@seas.upenn.edu April 15, 2009 Abstract We present a method for real-time simultaneous object recognition and segmentation based on cascaded discrim- inative Markov random fields. A Markov random field models coupling between the labels of adjacent image re- gions. The MRF affinities are learned as linear functions of image features in a structured max-margin framework that admits a solution via convex optimization. In contrast to other known MRF/CRF-based approaches, our method classifies in real-time and has computational complexity that scales only logarithmically in the number of object classes. We accomplish this by applying a cascade of bi- nary MRF-classifiers in a way similar to error-correcting output coding for general multiclass learning problems. Inference in this model is exact and can be performed very efficiently using graph cuts. Experimental results are shown that demonstrate a marked improvement in classi- fication accuracy over purely local methods. 1 Introduction We consider the problem of simultaneous object recogni- tion and segmentation of color images. This is a problem that appears in the literature under many names and varia- tions, some of which we will review later. Simply stated, our problem is this: given a training set of fully labeled images, label new images in a way similar to the way the training images were labeled. We define a labeled image to be an image in which every pixel is labeled with a class label, such as “robot,” “dog,” or “camera.” We are motivated primarily by applications to robotics, where it is necessary to make such inferences in real-time. The method presented in this paper is accordingly effi- cient enough for evaluation in real-time. Unlike other known methods that perform real-time object recognition and/or segmentation, our method is able to learn contex- tual relationships via a MRF model and is able to effi- ciently and exactly evaluate the associated global infer- ence problem in real-time. Our method is also scalable in a number of ways that make it particularly suitable for practical applications. First, its worst-case computational complexity scales ap- proximately linearly in the number of objects present in the scene, but only logarithmically in the total number of object classes. As we will show later, this is in contrast to more obvious approaches whose complexities scale at least linearly in the total number of object classes. From a computational perspective, this particular attribute eas- ily allows our system to scale to tens of thousands of ob- ject classes or more—although we note that other prac- tical considerations would likely become limiting factors before our system could be scaled to such magnitudes. At training time, our method is also scalable in the sense that it can easily cope with vast amounts of training data. We employ a subgradient-based training algorithm that needs space that is only linear in the number of train- ing examples. The algorithm is also inherently paralleliz- able in the sense that, given N processors, one could solve the training problem with N examples in approximately 1

Transcript of Scalable real-time object recognition and segmentation via …pvernaza/papers/mrfRecogTR.pdf ·...

Scalable real-time object recognition and segmentation viacascaded, discriminative Markov random fields

Paul Vernaza, Ben Taskar, and Daniel D. LeeGRASP Laboratory

University of Pennsylvania{vernaza,taskar,ddlee}@seas.upenn.edu

April 15, 2009

AbstractWe present a method for real-time simultaneous objectrecognition and segmentation based on cascaded discrim-inative Markov random fields. A Markov random fieldmodels coupling between the labels of adjacent image re-gions. The MRF affinities are learned as linear functionsof image features in a structured max-margin frameworkthat admits a solution via convex optimization. In contrastto other known MRF/CRF-based approaches, our methodclassifies in real-time and has computational complexitythat scales only logarithmically in the number of objectclasses. We accomplish this by applying a cascade of bi-nary MRF-classifiers in a way similar to error-correctingoutput coding for general multiclass learning problems.Inference in this model is exact and can be performedvery efficiently using graph cuts. Experimental results areshown that demonstrate a marked improvement in classi-fication accuracy over purely local methods.

1 IntroductionWe consider the problem of simultaneous object recogni-tion and segmentation of color images. This is a problemthat appears in the literature under many names and varia-tions, some of which we will review later. Simply stated,our problem is this: given a training set of fully labeledimages, label new images in a way similar to the way thetraining images were labeled. We define a labeled imageto be an image in which every pixel is labeled with a class

label, such as “robot,” “dog,” or “camera.”We are motivated primarily by applications to robotics,

where it is necessary to make such inferences in real-time.The method presented in this paper is accordingly effi-cient enough for evaluation in real-time. Unlike otherknown methods that perform real-time object recognitionand/or segmentation, our method is able to learn contex-tual relationships via a MRF model and is able to effi-ciently and exactly evaluate the associated global infer-ence problem in real-time.

Our method is also scalable in a number of ways thatmake it particularly suitable for practical applications.First, its worst-case computational complexity scales ap-proximately linearly in the number of objects present inthe scene, but only logarithmically in the total number ofobject classes. As we will show later, this is in contrastto more obvious approaches whose complexities scale atleast linearly in the total number of object classes. Froma computational perspective, this particular attribute eas-ily allows our system to scale to tens of thousands of ob-ject classes or more—although we note that other prac-tical considerations would likely become limiting factorsbefore our system could be scaled to such magnitudes.

At training time, our method is also scalable in thesense that it can easily cope with vast amounts of trainingdata. We employ a subgradient-based training algorithmthat needs space that is only linear in the number of train-ing examples. The algorithm is also inherently paralleliz-able in the sense that, given N processors, one could solvethe training problem with N examples in approximately

1

the same amount of time as necessary to solve it with onlyone example. The training algorithm is also inherentlyparallelizable with respect to the number of classes.

We begin our exposition with a brief summary of howour method works and how it relates to comparable ap-proaches. We then describe in detail our method and itsderivation in sections 3 and 4. Finally, we show experi-mental results in section 5 that show the method’s promisecompared to purely local appraches.

2 Method overviewIn this section, we first describe at a high level how theclassifier is evaluated and trained.

First, given an image to be classified, we perform aninitial unsupervised oversegmentation of the image, asshown in Figure 1. We use the “superpixel” segmenta-tion method of Felzenszwalb and Huttenlocher [5] for thisstep, primarily because it is efficient enough for real-timeevaluation. We then create a graph where each node is asuperpixel, and the graph has an edge between two nodesif and only if their corresponding superpixels are adjacent.Features that are properties of the nodes (superpixels) arecomputed and notionally attached to the graph, in addi-tion to “edge features” that are properties of pairs of adja-cent superpixels.T This step results in a concise “featuregraph” that represents the contextual structure of the im-age.

Suppose for now that we have a good classifier thatis able to use this contextual information, but this clas-sifier is only a binary classifier. Our idea is to leveragethis good binary classifier to build a multiclass classifier.The way in which this is done is similar to the way error-correcting output codes are used to transform multiclasslearning problems into binary problems [4]; we assign abinary code to each class, and train binary classifiers togenerate those codes from input examples.

In particular, the binary code we assign to each class isan encoding of a path in a binary tree where each leaf is aclass. The predicted code for a new example is obtainedby evaluating a sequence of classifiers, each of which addsone bit to the predicted code. Since the length of any codeis logarithmic in the number of classes, evaluating a singleexample requires only a logarithmic number of classifierevaluations. Equivalently, our method can be described as

(a) Original image (b) Superpixels, graphstructure

(c) Ground truth labeling

Figure 1: Original image, initial superpixel segmentation,MRF graph structure, and ground truth labeling

a spatial cascade of contextual binary classifiers.Therefore, the key to this strategy is the underlying bi-

nary classifier. We take advantage of a contextual modelthat admits exact, tractable learning and inference meth-ods when restricted to the binary case; namely, the max-imum margin Markov (M3N ) network [11]. By us-ing M3N in this binary coding framework, we sidestepthe intractability associated with the direct application ofM3N to the multiclass case.

In a nutshell, M3N learns mappings from features toMRF affinities that will hopefully induce the desired la-beling as the maximum probability configuration of theMRF generated by applying these mappings to a givenfeature graph. The tractable version that we use here isable to learn “associative” potentials; i.e., it can learn pos-itive correlations between the labels of adjacent nodes inthe graph, also known as “guilt by association.”

2.1 Related work

There is a growing body of literature concerning contex-tual image segmentation that bears some resemblance toour work [7] [6] [12] [10]. As in our method, these meth-ods build a probabilistic model that introduces spatial cor-relation between labels of neighboring pixels or image re-

2

gions. These models are then trained in a discriminativeway, usually with the goal of maximizing the likelihood ofdata under a parametrized likelihood function. Inferenceand learning in these models is typically intractable, andone must resort to approximate methods for both. Noneof these methods are suitable for real-time applications.

With respect to these methods, the type of M3N ourmethod is built on has an advantage in that inference andlearning are both tractable and exact, although some ex-pressivity must be sacrificed to get these benefits. Suchis the case in [14], which uses the same general frame-work as the one used here, but has the disadvantage ofbeing restricted to binary prediction. We will show in thiswork how to circumvent this restriction while staying inthe realm of tractable contextual inference and learning.

We are also concerned with the issue of computationalscaling in terms of number of object classes, which hasbeen examined before in [13]. That work also claimsapproximately logarithmic scaling of computational com-plexity with respect to number of classes. However, thisis framed in the paradigm of using sliding window detec-tors for object detection, and does not leverage a contex-tual model. We note that such an approach could still beuseful in our framework as a way of producing highly dis-criminative input features. Finally, we note that Isukapalliet. al. have previously applied cascaded binary classifiersto solve the multiclass classification problem [8] and alsocommented on its applicability to problem of scaling withrespect to classes. In contrast to that work, however, weemploy this trick to escape the inherent computational dif-ficulty in evaluating multiclass contextual models.

3 Learning discriminative MRFsWe begin our detailed discussion by considering first thebinary case that is the basis for the multiclass algorithm.Although an equivalent derivation of the following ap-pears in [14], we present a simpler, more intuitive deriva-tion of the algorithm here.

Suppose we have generated a feature graph from an im-age as described in section 2. Our first task is to define the“best” binary labeling of the nodes given this information.Let y ∈ {0, 1}N be a labeling of the N nodes in the graph,and let xi be a feature vector associated with the ith node.We define a unary energy function EU

w (y) parametrized

by weights w:

EUw (y) =

∑i

{−〈w0, xi〉 if yi = 0−〈w1, xi〉 if yi = 1

= −[∑

i

〈w1, xi〉yi + 〈w0, xi〉(1− yi)] (1)

Consider what happens if w0 and w1 are both equalto a fixed vector w, and we find y∗ = argminy EU

w (y).Clearly y∗i = (sgn〈w, xi〉 + 1)/2; i.e., we choose yi

according to which side xi falls with respect to the hy-perplane w. Conversely, given the labels y, finding wamounts to the classical problem of finding a hyperplanethat separates the positively labeled xi from the negativelylabeled xi.

We add context to this model by adding the followingbinary energy function:

EBw (y) =

∑<ij>

{−〈w00, xij〉 if yi = yj = 0−〈w11, xij〉 if yi = yj = 1

= −[∑〈ij〉

〈w11, xij〉yiyj (2)

+〈w00, xij〉(1− yi)(1− yj)]

This energy function is dependent on edge features xij .Minimizing the combined energy Ew(y) = EU

w (y) +EB

w (y) is like minimizing Ew(y), except that adjacentnodes are given feature- and label-dependent bonuses foragreeing on a label. We note that a limitation of thismodel is that 〈w, xij〉 is required to be nonnegative, whichis usually achieved by enforcing that w00, w11, xij all benonnegative. This is the so-called submodularity assump-tion required for tractable inference via graph cuts [9].

This energy function therefore resolves both the matterof what is the best labeling given a parameter vector w,and how to find the best labeling efficiently. The solutionto an energy function such as Ew(y) can be obtained inpolynomial time via network flow optimization [9]. Weneed not concern ourselves with the details of this opti-mization, except to note that very efficient codes exist tosolve this problem [2].

We now turn to the issue of learning w. Let y be adesired labeling. The M3N objective is that the energyof the desired labeling Ew(y) induced by the parameters

3

w should be less than the energy of any other labeling bya margin equal to the Hamming loss function

L(y, y) =∑

i

[yi 6= yi] =∑

i

yi(1−yi)+yi(1−yi) (3)

This yields the objective

Ew(y) ≤ Ew(y)− L(y, y), ∀y ∈ {0, 1}N (4)

We can replace this exponential number of constraintsby the most restrictive constraint, yielding

Ew(y) ≤ miny∈{0,1}N

Ew(y)− L(y, y) (5)

Adding regularization on w and a slack variable yieldsthe optimization problem

minw∈W ‖w‖2 + Cξsubject to Ew(y) ≤ ξ + miny∈{0,1}N Ew(y)− L(y, y)

ξ ≥ 0(6)

where w ∈ W indicates necessary constraints onw— i.e., the components corresponding to edge featuresshould be nonnegative, as previously discussed. Writingthe constraint as

Ew(y)− ( miny∈{0,1}N

Ew(y)− L(y, y)) ≤ ξ (7)

should make it clear that this constraint must hold withequality at an optimum solution. This allows us to sub-stitute the expression on the left directly into the objectiveyielding the following convex, nondifferentiable objectivefunction:

minw∈W

‖w‖2 + C[Ew(y)− ( miny∈{0,1}N

Ew(y)− L(y, y))] (8)

This objective can be minimized by a subgradientalgorithm. The description of the subdifferential ∂is given by Danskin’s theorem [1]. Let φ(w, y) =−(Ew(y) − L(y, y)), and let Y0(w) = {y|φ(w, y) =maxy∈{0,1}N φ(w, y)}. By Danskin’s theorem,

∂ maxw

φ(w, y) = conv{∂φ(w, y)∂w

|y ∈ Y0(w)} (9)

Therefore, we can find a subgradient of the objectiveby finding a gradient of φ with respect to w at a y∗ thatminimizes Ew(y) − L(y, y). To be explicit, we have thefollowing subgradients of the slack term ξ:

∑i

xi(yi − y∗) ∈ ∂w0ξ (10)∑i

xi(y∗ − yi) ∈ ∂w1ξ (11)∑<ij>

xij((1− y∗i )(1− y∗j )− (1− yi)(1− yj)) ∈ ∂w00ξ (12)

∑<ij>

xij(y∗i y∗j − yiyj) ∈ ∂w11ξ (13)

In summary, this leads to the following iterative algo-rithm for minimizing the objective:

w ← Π

w − η

2w +

∂w0ξ∂w1ξ∂w00ξ∂w11ξ

(14)

where ∂ξ denotes any vector in the subgradient, and Πdenotes projection onto the feasible set:

Π

w0

w1

w00

w11

=

w0

w1

max(w00, 0)max(w11, 0)

(15)

This is only one possible subgradient update rule. Inpractice, we have found that it is helpful to employ a“heavy ball” method [1]. In this method, the weight up-date is passed through an IIR filter that smooths out dis-continuous jumps caused by the nondifferentiability of theobjective.

We also remark that, as noted in [14], this type of sub-gradient method scales well for the case of vast amountsof training data, for a few reasons. First, the memory re-quirements are linear in the input size, since we only need

4

to compute multiples and sums of the training features.Second, as mentioned earlier, the subgradient calculationitself is embarassingly parallelizable, the reason beingthat the required graph-cut optimization can be solved in-dependently for each training image.

4 Multiclass extensionThe inference and learning method of the previous sec-tion may be used to solve the complete object recognitionand segmentation problem given that there are only twoobject classes; this method takes as input a feature graphand outputs a binary labeling of the nodes that optimizes aglobal energy function that induces hopefully-correct seg-mentations on the training set. We now consider how toleverage this binary contextual learning algorithm to gen-erate an effective multiclass learning algorithm.

Suppose we are given a binary tree such as the one de-picted in Figure 2. Each node is associated with a scopeof object classes. The root node contains all classes, eachleaf node is assigned a distinct class, and each internalnode evenly splits the classes within its scope between itstwo children.

Given such a tree, our method works as follows. Eachinternal tree node takes as input a feature graph from itsparent. It then attempts to label each node in this graphaccording to which child’s scope contains the true classof the node. The tree node then creates two disjoint sub-graphs. Each subgraph contains only those nodes withinthe scope of one of the child nodes, and only the edgesbetween such nodes; i.e., it is the subgraph induced bythe set of nodes belonging to the scope of a child. Clas-sification begins by submitting the original image featuregraph to the root node, and it ends when every node inthe graph has propagated down to a leaf node. The finalclassification of a node (superpixel) is the class assignedto the leaf node it eventually reaches. As mentioned ear-lier, this method can be viewed as a certain type of binarycoding of a multiclass problem— the code we assign toeach class is the sequence of “left-right” moves along theunique path from the root to the class’s leaf.

Each of these tree node classifiers is exactly an instanceof a binary discriminative MRF classifier, as discussed insection 3. As in ECOC, this allows us to use that classifieras a subroutine in constructing the multiclass classifier. In

Figure 2: An illustration of the tree structure used for mul-ticlass classification in a toy problem with four classes.

order to train a given classifier on a given input featuregraph, we label each node in the feature graph with theidentity of the child (0 or 1) whose scope encompassesthe node’s class. We then use the discriminative binaryMRF classifier to learn parameters for this classifier thatinduce this labeling of the feature graph. We then cre-ate subgraphs induced by this split, as before, and feedthe subgraphs as input to this classifier’s children classi-fiers. Some sample subgraphs generated during the train-ing process are shown in Figure 3.

It is worth noting that each of the classifier nodes maybe trained independently and in parallel, yielding a com-putational complexity that does not scale with the numberof classes provided that enough processors are available.

4.1 Performance considerationsWe are now in a position to discuss the performance con-siderations that motivated this approach and why theyare justified with respect to other possible variations ofthe M3N framework. First, we note that, in the worstcase, our method requires min(|V |, N − 1) binary clas-sifier evaluations, where |V | is the number of nodes inthe feature graph. This occurs when every node in theimage is labeled with a different class. At the other ex-treme, if the classifier performs perfectly, and there areK objects in the scene, the classifier performs no morethan K log N internal classifier evaluations—in this case,exactly K groups propagate down to the leaves of thetree, encountering log N classifiers on the way. Some ofthese are shared between classes, so K log N is an upper

5

Figure 3: Feature graphs and subgraphs thereof generatedduring the training process.

bound on the total number of binary classifier evaluations.Therefore, if we make the reasonable assumption that thenumber of predicted classes is roughly equal to the num-ber of classes present, we expect the running time to beO(K log N).

We note that we initially considered other variations ofthe M3N framework, but none of the variations we con-sidered have the potential for sublinear scaling in the to-tal number of classes. One such variation arises by usinga multi-label variation of Eq. 8. The derivation in 3 isfairly general in that Eq. 8 admits a solution via a sub-gradient method under weak conditions, assuming that aninference method exists to perform the inner energy min-imization. In the multi-label (Potts model) case, the α-expansion and α − β-swap algorithms [3] have enjoyedmuch success for certain problems; however, these meth-ods require iterations of a number of graph-cut operationsthat is at least linear in the number of classes. Further-more, these inference methods are only approximate.

We also note that the solution of the M3N objective,under certain restrictions, has a convex relaxation that hasalso seen success in the multi-label case [11]. However,this solution is based on the solution of a quadratic pro-gram that does not scale well with very large training sets.

5 Experiments

We will first briefly discuss the practical consideration offeature selection before discussing experimental results.Our features were fairly simple for two reasons—the firstbeing that we are only interested in features that can becomputed in real-time, and the second being that we wishto show that even very simple features have the potentialto perform well when endowed with the benefit of context.

For node features, we chose to compute histogram-based features. The image was converted to HSV andindependent histograms were computed for each imagechannel over the area of each superpixel. These his-tograms were then normalized and concatenated into thenode feature vector. We additionally compute histogramsof image gradients within each superpixel. A constantbias feature was also added.

Edge features were computed as functions of node fea-tures. In particular, for each adjacent pair i, j of nodeswith kth node features x

(k)i and x

(k)j , respectively, we

computed the kth edge feature√

x(k)i x

(k)j . These features

are intended to convey co-occurrence of the kth feature.Finally, we computed the cosine of the angle between theadjacent nodes’ feature vectors, which is motivated froman entropy minimization perspective; if joining two super-pixels results in a relatively low-entropy feature distribu-tion, we may want to learn a preference to join them.

We implemented our method in C++, using the graphcut code provided by Boykov and Kolmogorov [2], andthe graph-based image segmentation code provided byFelzenszwalb [5]. We scaled our input images to 160x120pixels before doing any subsequent processing. At thisresolution, running on a 1.83 GHz Core 2 Duo com-puter, the total processing time per frame (including pre-segmentation, feature computation, and classifier evalua-tion) approximately varied between 90 and 120 ms.

Training images were obtained by recursive back-ground segmentation. First, a static background wouldbe learned, and an object placed against that background.Differencing was then used to obtain a segmentation maskfor that object. This new scene was then held static whilea new object was introduced to the scene, and so on. Thisresulted in somewhat imperfect ground truth segmenta-tions, as seen in Figure 1.

We evaluated our method on a small dataset of 57 im-

6

Method Mean accuracyLocal 47.2%MRF 75.5%

Table 1: Summary of results of 5-fold cross-validation

ages obtained in this way, featuring 11 object classes, in-cluding the background (Figure 4). We performed 5-foldcross-validation to estimate the classification accuracy.Classification accuracy was calculated as the percent-age of non-background pixels correctly classified. Back-ground pixels were excluded because of their prevalencein the image, which makes it possible to get a high accu-racy rate by simply guessing all background, unless theseare excluded from the accuracy calculation.

We compared our method to a linear classifier on thenode (local) features alone. This classifier was obtainedby training the contextual classifier after constraining theedge weights to be zero.

5.1 Results

Table 1 summarizes the results of this experiment. Thecascaded MRF classifier achieved nearly a 60% improve-ment in classification accuracy over the purely local clas-sifier. Figure 5 shows some typical scenes labeled byboth methods compared to the ground truth segmentation.The first thing immediately apparent from Figure 5 is thenoisy nature of many of the images classified by the lo-cal classifier. The cascaded MRF results are significantlysmoother, and tend to exhibit more “decisive” labelings,where an object will be either entirely correct or incorrect.Notable mistakes by the cascaded MRF happen on one ofthe robots, which is sometimes misclassified as a mousepad, and the mouse pad, which is sometimes misclassifiedas a robot.

Class confusion matrices for the two approaches arealso shown in Table 2. Context seems to help especiallyfor “grayRedAibo” and “grayBlueAibo”, which are bothgray with red and blue patches, respectively. Without con-text, these classes exhibit significant confusion betweenthemselves and miscellaneous other classes.

Conclusions

We have demonstrated a unique method for real-time ob-ject recognition and image segmentation based on cas-cades of discriminative MRFs. This method is scalablewith respect to parallelism, memory usage, and number ofrecognized classes, which makes it especially suitable forpractical applications. Experimental results have shownthat the method has significant potential to boost the per-formance of existing purely local classifiers.

A limitation of our method is that it seems to be sen-sitive to random variation in the pre-segmentation output,which causes temporal noise in the classified output. Weare currently investigating what effects this instability hason learning performance, and how this might be amelio-rated by the use of segmentation routines with alternativeobjectives.

We have also not discussed how to generate the assign-ment of classes to each scope in the tree of classifiers.Currently, we have observed that a random assignmentproduces adequate results. However, it is possible thatbetter results could be achieved by averaging over differ-ent random trees, in a manner similar to random forests,or by building a single tree with an information-like crite-rion.

References[1] D. P. Bertsekas. Nonlinear Programming. Athena Scien-

tific, 2003.[2] Y. Boykov and V. Kolmogorov. An experimental compari-

son of min-cut/max-flow algorithms for energy minimiza-tion in vision. IEEE Transactions on Pattern Analysis andMachine Intelligence, September 2004.

[3] Y. Boykov, O. Veksler, and R. Zabih. Fast approximateenergy minimization via graph cuts. IEEE Transactions onPattern Analysis and Machine Intelligence, 23:2001, 2001.

[4] T. G. Dietterich and G. Bakiri. Solving multiclass learn-ing problems via error-correcting output codes. Journal ofArtificial Intelligence Research, 2:263–286, 1995.

[5] P. F. Felzenszwalb and D. P. Huttenlocher. Efficient graph-based image segmentation. International Journal of Com-puter Vision, 2004.

[6] S. Gould, J. Rodgers, D. Cohen, G. Elidan, and D. Koller.Multi-class segmentation with relative location prior. In-ternational Journal of Computer Vision, 2008.

7

Figure 4: A sample of images from a test set

(a) Local

(b) MRF

(c) Ground truth

Figure 5: A comparison of typical labelings generated by the purely local classifier and the cascaded MRF classifier.

8

back

grou

nd

mou

sePa

d

rubb

erC

emen

t

vide

oCam

mug

rubi

x

whi

teA

ibo

oran

geB

all

gray

Blu

eAib

o

gray

Red

Aib

o

gree

nCan

PURELY LOCAL CLASSIFIERbackground 936798 904 752 3239 2197 196 584 281 7756 3180 2239mousePad 3598 5008 0 999 110 660 1663 256 694 1521 751rubberCement 221 0 0 240 0 0 0 0 17 0 0videoCam 3214 595 919 8442 171 1 348 0 1214 608 819mug 3974 337 214 581 9182 0 136 374 577 4 511rubix 2084 152 7 96 10 2838 0 668 442 341 410whiteAibo 642 0 0 189 1569 19 1002 0 38 29 52orangeBall 896 0 0 0 0 0 41 1360 0 0 0grayBlueAibo 5587 28 0 151 22 182 203 111 2894 1323 0grayRedAibo 4014 588 0 963 331 0 112 605 997 4891 190greenCan 2589 0 675 1889 327 143 1124 65 552 366 6063

CASCADED MRF CLASSIFIERbackground 947528 2068 54 1846 929 220 320 418 2270 1182 1134mousePad 848 8849 0 0 0 0 0 0 0 5525 38rubberCement 156 0 34 213 0 0 0 0 0 0 75videoCam 2520 0 361 12834 0 27 0 0 0 355 234mug 3139 50 2 804 11765 0 0 3 0 0 99rubix 2223 0 0 48 133 3907 0 0 0 339 398whiteAibo 436 120 0 0 0 0 2782 0 130 72 0orangeBall 231 24 0 0 0 0 0 1927 0 115 0grayBlueAibo 1051 61 0 0 0 0 31 0 9358 0 0grayRedAibo 925 3865 0 0 66 0 283 3 17 7532 0greenCan 1828 0 695 0 588 0 364 12 0 0 10306

Table 2: Class confusion matrices accumulated over 5-fold cross-validation

[7] X. He, R. S. Zemel, and M. A. Carreira-Perpinan. Mul-tiscale conditional random fields for image labeling. InProceedings of the IEEE Computer Society Conference onComputer Vision and Pattern Recognition, 2004.

[8] R. Isukapalli, A. Elgammal, and R. Greiner. Learning todetect objects of many classes using binary classifiers. InEuropean Conference on Computer Vision, 2006.

[9] V. Kolmogorov. What energy functions can be minimizedvia graph cuts? IEEE Transactions on Pattern Analysisand Machine Intelligence, Februrary 2004.

[10] S. Kumar and M. Hebert. Discriminative fields for mod-eling spatial dependencies in natural images. In NeuralInformation Processing Systems, 2003.

[11] B. Taskar, V. Chatalbashev, and D. Koller. Learning asso-ciative Markov networks. In Proceedings of the Interna-tional Conference on Machine Learning, 2004.

[12] A. Torralba, K. P. Murphy, and W. T. Freeman. Contextualmodels for object detection using boosted random fields.In Neural Information Processing Systems, 2004.

[13] A. Torralba, K. P. Murphy, and W. T. Freeman. Sharingfeatures: efficient boosting procedures for multiclass ob-ject detection. In Computer Vision and Pattern Recogni-tion, 2004.

[14] P. Vernaza, B. Taskar, and D. D. Lee. Online,self-supervised terrain classification via discriminativelytrained submodular Markov random fields. In IEEE In-ternational Conference on Robotics and Automation, May2008.

9