1. fmm-cvpr03.pdf

8/10/2019 1. fmm-cvpr03.pdf

1/8

Learning Afnity Functions for Image Segmentation:Combining Patch-based and Gradient-based Approaches

Charless Fowlkes, David Martin, Jitendra Malik

Department of Electrical Engineering and Computer Science

University of California, Berkeley, CA 94720

fowlkes,dmartin,malik @eecs.berkeley.edu

Abstract

This paper studies the problem of combining region and boundary cues for natural image segmentation. We employa large database of manually segmented images in order to learn an optimal afnity function between pairs of pix-els. These pairwise afnities can then be used to cluster the pixels into visually coherent groups. Region cues arecomputed as the similarity in brightness, color, and texturebetween image patches. Boundary cues are incorporated by looking for the presence of an intervening contour, alarge gradient along a straight line connecting two pixels.

We rst use the dataset of human segmentations to in-dividually optimize parameters of the patch and gradient features for brightness, color, and texture cues. We thenquantitatively measure the power of different feature com-binations by computing the precision and recall of classiers trained using those features. The mutual informationbetween the output of the classiers and the same-segment indicator function provides an alternative evaluation tech-nique that yields identical conclusions.

As expected, the best classier makes use of bright-ness, color, and texture features, in both patch and gradi-ent forms. We nd that for brightness, the gradient cueoutperforms the patch similarity. In contrast, using color patch similarity yields better results than using color gra-dients. Texture is the most powerful of the three channels,with both patches and gradients carrying signicant independent information. Interestingly, the proximity of the two pixels does not add any information beyond that provided by the similarity cues. We also nd that the convexity as-sumptions made by the intervening contour approach aresupported by the ecological statistics of the dataset.

1. IntroductionBoundaries and regions are closely intertwined. A closedboundary generates a region while every image region has aboundary. Psychophysics experiments suggest that humansuse both boundary and region cues to perform segmentation[43]. In order to build a vision system capable of parsingnatural images into coherent units corresponding to surfaces

and objects, it is clearly desirable to make global use of bothboundary and region information.

Historically, researchers have focused separately on thesub-problems of boundary and region grouping. Regionbased approaches are motivated by the Gestalt notion of grouping by similarity. They typically involve integratingfeatures such as color or texture over local patches of the im-age [8,12,32] and then comparing different patches [26,31].However, smooth changes in texture or brightness causedbyshading and perspective within regions pose a problem forthis approach since two distant patches can be quite dissim-ilar despite belonging to the same image segment. To over-come these difculties, gradient based approachesdetect lo-cal edge fragments marked by sharp, localized changes insome image feature [4,24, 18, 30, 20]. The fragments canthen be linked together in order to identify extended con-tours [28,42,6].

Less work has dealt directly with the problem of ndingan appropriate intermediate representation in order to incor-porate non-closed boundary fragments into segmentation.Mathematical formulations outlined by [11, 25, 23] alongwith algorithms such as [19, 13] have attempted to unifyboundary and region information. More recently, [17, 40]have demonstrated the practical utility of integrating bothin order to segment images of natural scenes.

There are widely held folk-beliefs regarding the vari-ous cues used for image segmentation: brightness gradients(caused by shading) and texture gradients (caused by per-spective) necessitate a boundary-based approach; edge de-tectors are confused by texture, so one must use patch-basedsimilarity for texture segmentation; color integrated over lo-cal patches is a robust and powerful cue. However, thesecontradictory statements have not been empirically chal-lenged. By using a dataset of human segmentations [21, 5]as groundtruth, we are able to provide quantitative resultsregarding the ecological statistics 1 of patch- and gradient-based cues and gauge their relative effectiveness.

We treat the problem of integrating both gradient andpatch information for segmentation within the framework of

1 Our approach follows the lines of Egon Brunswiks suggestion nearly50 years ago that the Gestalt factors made sense because they reected thestatistics of natural scenes [3].

1

8/10/2019 1. fmm-cvpr03.pdf

2/8

8/10/2019 1. fmm-cvpr03.pdf

3/8

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Recall

P r e c

i s i o n

F=0.60 MI=0.16 PatchF=0.61 MI=0.14 ICF=0.65 MI=0.19 Patch+ICF=0.77 MI=0.25 HumansIso F=0.77

Figure 2: Performance of humans compared to our best pixelafnity models. The dots show the precision and recall of each of 1366 human segmentations in the 250-image test set when com-pared to the other humans segmentation of the same image. Thelarge dot marks the median recall (99%) and precision (63%) of the humans. The iso-F-measure curve at F=77% is extended fromthis point to represent the frontier of human performance for thistask. The three remaining curves represent our patch-only model,contour-only model, and patch+contour model. Neither patches

nor contours are sufcient, as there is signicant independent in-formation in the patch and contour cues. The model used through-out the paper is a logistic function with quadratic terms which per-forms the best among classiers tried on this dataset.

The Berkeley Segmentation Dataset [21, 5] providesthe groundtruth segmentation data. This dataset contains12,000 manual segmentations of 1,000 images by 30 hu-man subjects. Half of the images were presented to subjectsin grayscale, and half in color. We use the color segmen-tations for 500 images, divided into test and training setsof 250 images each. Each image has been segmented by atleast 5 subjects, so the groundtruth is dened by a setof human segmentations. We declare two pixels to lie in thesame segment only if all subjects declare them to lie in thesame segment.

Given a classier output , we can evaluate the classi-ers performance in two ways. Our rst evaluation tech-nique uses the precision-recall (PR) framework, which isa standard method in the information retrieval community[33]. This framework was used by Abdou and Pratt [1] toevaluate edge detectors, and is similar to the ROC curveframework used by Bowyer et al. [2] for the same pur-pose. The approach produces a curve parameterized by de-tector threshold which shows the trade-off between noiseand accuracy as the threshold varies. For example see Fig-ure 2. Precision measures the probability that two pixelsdeclared by the classier to be in the same segment are inthe same segment, i.e. . Recall measuresthe probability that a same segment pair is detected, i.e.

. The PR approach is particularly ap-propriate when the two classes are unbalanced. By focus-ing on the scarcer classsame-segment pairs in our case

performance is not inated by the ease of detecting thedom-inant class.

Precision and recall can be combined with the F- Measure , which is simply a weighted harmonic mean:

. The weight represents the relativeimportance of precision and recall for a particular applica-tion. We use in our experiments. The F-measurecan be evaluated along the precision-recall curve, and themaximum value used to characterize the curve with a singlenumber. When two precision-recall curves do not intersect,the F-measure is a useful summary statistic.

The second approach to evaluating a classier mea-sures the mutual information between the classier out-put and the groundtruth data . Given the joint dis-tribution , the mutual informa-tion is dened as the Kullback-Liebler divergence betweenthe joint and the product of the marginals, so

. We compute the joint distributionby binning the soft classier output.

3. FeaturesWe will model the afnity between two pixels as a functionof both patch-based and gradient-based features. In eachcase, we can use brightness, color, or texture, producing atotal of six features. We also consider the distance betweenthe pixels as a seventh feature.

3.1. Patch-Based FeaturesGiven a pair of pixels, we wish to measure the brightness,color, and texture similarity between circular neighbor-hoods of some radius centered at each pixel. Distributionsof color in perceptual color spaces have been successfullyused as region descriptors in image retrieval systems such asQBIC [26], as well as many color segmentation algorithms.We employ the 1976 CIE L*a*b* color space separated intoluminance and chrominance channels. We model bright-ness and color distributions with histograms constructed bybinning kernel density estimates. Histograms are comparedwith the histogram difference operator [32].

For the brightness cue, we use the L* histogram for eachpixel. In the case of color, it is not necessary to computethe joint a*b* histogram. Instead, it sufces to computeseparate a* and b* histograms, and simply sum theircontributions. This is motivated by the fact that a* and b*correspondto the green-red and yellow-blue color opponentchannels in the visual cortex, as well as the perceptual or-thogonality of the two channels (see Palmer [27]).

The histogram difference does not make use of theperceptual distance between the bin centers. Therefore,without smoothing, perceptually similar colors can havelarge differences. Because the distance between pointsin CIELAB space is perceptually meaningful in a localneighborhood, binning a kernel density estimate whose ker-

3

8/10/2019 1. fmm-cvpr03.pdf

4/8

nel bandwidth matches the scale of this neighborhoodmeans that perceptually similar colors will have similar his-togram contributions. Beyond this scale, where colors areincommensurate, will regard them as equally different.The combination of a kernel density estimate in CIELABwith the histogram difference is a good match to thestructure of human color perception.

For the patch-based texture feature, we compare the dis-

tributions of lter responses in the two discs. There isan emerging consensus that for texture analysis, an imageshould rst be convolved with a bank of lters tuned to var-ious orientations and spatial frequencies [8, 18]. Our lterbank contains elongated quadrature pair ltersGaussiansecond derivatives and their Hilbert transformsat six ori-entations, along with one center-surround lter. The empir-ical distribution of lter responses has been shown to be apowerful feature for both texture synthesis [12] and texturediscrimination [31].

There are many options for comparing the distributions(see Puzicha et al. [31]), but we use the approach developedin [17] which is based on the idea of textons . The textonapproach estimates the joint distribution of lter responsesusing adaptive bins, which are computed with -means. Thetexture descriptor for a pixel is therefore a -bin histogramover the pixels in a disc of radius centered on the pixel.As in [17], we compare descriptors with the difference.

All of the patch-based features have parameters that re-quire tuning, such as the radius of the discs, the binningparameters for brightness and color, and the texton parame-ters for texture. Section 4.3 covers the experiments that tunethese parameters with respect to the training data.

3.2. Gradient-Based FeaturesGiven a pair of pixels, consider the straight-line path con-necting them in the image plane. If the pixels lie in differentsegments, then we expect to nd, somewhere along the line,a photometric discontinuity or intervening contour [15]. If no such discontinuity is encountered, then the afnity be-tween the pixels should be large.

In order to compute the intervening contour cue, we re-quire a boundary detector that works robustly on naturalimages. For this we employ the gradient-based boundarydetector of [20]. The output of the detector is a P b imagethat provides the posterior probability of a boundary at eachpixel. We consider the three P b images computed usingbrightness, color, and texture gradients individually, as wellas the Pb image that combines the three cues into a singleboundary map. The combined model uses a logistic func-tion trained on the dataset, which is well motivated by ev-idence in psychophysics that humans make use of multiplecues in localizing contours [34] perhaps using a linear com-bination [14]. Other classiers besides the logistic functionperformed equally well.

The gradients are computed in a nearly identical man-

0

0.005

0.01

0.015

0.02

0.025

Recall

P r e c

i s i o n

Same Image

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0

0.005

0.01

0.015

0.02

Recall

P r e c

i s i o n

Different Images

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

Figure 3: Agreement between human segmentations. The leftpanel shows the distribution of precision and recall for the 5555human segmentations of all 1020 images in the dataset. The pre-cision and recall are measured with respect to the class of same-segment pixel pairs, and each human is compared to the union of all other humans. High recall and lower precision supports thehypothesis that the different subjects perceive the same segmenta-tion hierarchy, but segment at different levels of detail. The rightpanel shows the distribution of precision and recall when the left-out human and union-of-humans come from different images. Thiscomparison provides a lower-bound for the similarity between seg-mentations without changing the statistics of the data.

ner to our patch features. Instead of comparing histogramsbetween two whole discs, the gradient is based on the his-togram difference between the two halves of a single disc,similar to [36,35]. The orientation of the dividing diagonalsets the orientation of the gradient, and the radius of the discsets the scale. All of the parameters of the gradients havebeen tuned by [20] on the same dataset to optimally detectthe boundaries marked by the human subjects.

We compute the intervening contour cue for two pix-els and from the Pb values that occur along the straightline path connecting the two pixels. We considerthe family of measures Pb for

, as well as the mean of P b . Thenext section will cover the choice of the intervening contourfunction, as well as the best way to combine the contour in-formation from the brightness, color, and texture channels.

4. Findings4.1. Validating the Groundtruth DatasetBefore applying the human segmentation data to the prob-lem of learning and evaluating afnity functions, we mustdetermine that the dataset is self-consistent. To this end,we validate the dataset by comparing each segmentation tothe remaining segmentations of the same image. Treatingthe left-out segmentation as the signal and the remainingsegmentationsas ground-truth, we apply our two evaluationmethods.

The left panel of Figure 3 shows the distribution of pre-cision and recall for the entire dataset of 1020 images.Since the signal in this case is binary-valued, we havea single point for each segmentation. The distribution ischaracterized by high recall, with a median value of 99%.This indicates that 99% of the same-segment pairs in the

4

8/10/2019 1. fmm-cvpr03.pdf

5/8

0 0.2 0.4 0.6 0.8 10

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18 0.76

FMeasure

D e n s i t y

BR = 0.027

Same ImageDifferent Images

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5 0.25

Mutual Information

D e n s i t y

BR = 0.022

Same ImageDifferent Images

Figure 4: The left panel shows the same two distributions as Fig-ure 3, with precision and recall combined using the F-measure.The right panel shows the distribution of mutual information forthe same-image and different-image cases. The low overlap ineach panel (2.7% and 2.2%) attests to the self-consistency of thedata. As expected, signicant information is shared between seg-mentations of the same image. The median F-measure of 0.76 andmutual information of 0.25 represents the target performance forour afnity models.

ground-truth are contained in a left-out human segmenta-tion. The median precision is 66%, indicating that 66% of the same-segment pairs in a left-out human segmentationare contained in the ground-truth. These values supportthe interpretation that the different subjects provide con-sistent segmentations varying simply in their degree of de-tail. For comparison, the right panel of the gure shows thedistribution when the left-out human segmentation and thegroundtruth segmentations are of different images.

The left panel of Figure 4 shows the distribution of theF-measure for the same-image and different-image cases.Similarly, the right panel shows the distributions for mu-tual information. The clear separation between same-imageand different-image comparisons attests to the consistencyof segmentations of the same image. The median F-measureof 0.76 and mutual information of 0.25 represent the max-imum achievable performance for pairwise afnity func-tions.

4.2. Validating Intervening ContourAlthough the ecological statistics of natural images indicatethat regions tend to be convex [9], the presence of an inter-vening contour does not necessarily indicate that two pixelsbelong in different segments. Concavities introduce inter-vening contours between same-segment pixel pairs. In thissection, we analyze the frequency with which this happens.

0 50 100 150 200 2500

0.2

0.4

0.6

0.8

1

Distance (pixels)

Loss from StraightLine Intervening Contour

Prob. no ICFrequency

Given the union of bound-ary maps for all human seg-mentations of an image, wemeasure the probability thatsame-segment pairs have nointervening boundary contour.The gure at left shows thisprobability as a function of

pixel separation, along with the number of same-segmentpairs at each distance. If the regions were convex, then thecurve would be xed at one. Straight-line intervening con-

p(1hop SL) p(1hop SL)

S m a l

l S c a

l e

0.320.32

0

0.2

0.4

0.6

0.8

1

L a r g e

S c a

l e

0.140.140.140.14

0

0.2

0.4

0.6

0.8

1

Figure 5: Each panel shows two pixels, marked with squares, thatall humans declared to be in the same segment. The intensity ateach point represents the empirical probability that a one-hop paththrough that point does not intersect a boundary contour. The leftcolumn is conditioned on there being an unobstructed straight-line(SL) path between the pixels, while the right column shows theprobabilities when the SL path is obstructed. The top row showsdata gathered from pixel pairs with small separation; the bottomrow for pairs with large separation. See Section 4.2 for furtherdiscussion.

tour is a good approximation to the same-segment indicatorfor small distances: 49% of same-segment pairs are in the

75% correct range.When straight-line intervening contour fails for a same-

segment pair, there exist more complex paths connecting thepixels. Consider the set of paths that consist of two straight-line segments, which we call one-hop paths. The situationswhere one-hop paths succeed but straight-line paths fail cangive us intuition about how much can be gained by examin-ing paths more complex than straight line paths.

Figure 5 shows the empirical spatial distribution of one-hop paths between same-segment pixel pairs, using theunion of human segmentations. The probability of a one-hop path existing is conditioned on (left) there being astraight-line path with no intervening contour and (right)on there being no straight-line path. If the human subjectsregions were completely convex, then the right column im-ages would be zero. Instead, we see that when straight-line intervening contour fails, there is a small but signicantprobability that a more complex one-hop path will succeed,and the probabilityof such a path is larger forsmaller scales.There is clearly some benet from the more complex pathsdue to concavities in the regions. However, the degree towhich an algorithm could take advantage of the more pow-erful one-hop version of intervening contour depends on the

frequency with which the one-hop paths nd holes in the es-timated boundary map. In any case, the gure makes clearthat the simple straight-line path is a good rst-order ap-proximation to the connectivity of same-segment pairs.

Since the straight-line version of intervening contourwill underestimate connectivity in concave regions, it mayhave a tendency toward over-segmentation. Figure 6 shows

5

8/10/2019 1. fmm-cvpr03.pdf

6/8

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5x 10

3

Recall

P r e c

i s i o n

Same Image w/IC

0 0.2 0.4 0.6 0.8 10

0.2

0.4

0.6

0.8

1

0 0.2 0.4 0.6 0.8 10

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Precision

D e n s i t y

OriginalWith IC

Figure 6: The potential over-segmentation caused by the inter-vening contour approach agrees with the renement of objects byhuman observers. The distribution of precision and recall at leftis generated in an identical manner as the left panel of Figure 3.However, we add the constraint that the same-segment pairs fromthe left-out human must not have an intervening boundary contour.The recall naturally decreases from adding a constraint to the sig-nal. However from the marginal distributions shown at right, wesee that precision increases with the added constraint. Because theunion segmentation is, on average, a renement of the left-out seg-mentation, intervening contour tends to break non-convex regionsin a manner similar to the human subjects.

the effect on precision and recall for the human data whenwe add the constraint that same-segment pairs have no in-tervening boundary contour. As in Figure 3, we are compar-ing a left-out human to the union of the remaining humans.On average, the union segmentation will be more detailedthan the left-out human. The gure shows a increase in me-dian precision from 66% to 75%, indicating that interveningcontour tends to break up non-convex segments in a mannersimilar to the human subjects. This lends condence to anapproach to perceptual organization of rst nding convexobject pieces through low-level processes, and then group-ing the object pieces with into whole objects using higher-level cues.

4.3. Performance of PatchesEach of the brightness, color, and texture patch features hasseveral parameters that require tuning. We optimized eachpatch feature independently via coordinate ascent to max-imize performance on the training data. Figure 7 showsthe result of the coordinate ascent experiments, where nochange in any single parameter further improves perfor-mance.

For brightness and color, a radius of 5.76 pixels was op-timal, though performance is similar for larger and smallerdiscs. In contrast, the texture disc radius has greater impacton performance, and the optimal radius is much larger at16.1 pixels. The brightness and color patches also have pa-rameters related to the binned kernel density estimates. Thebinning parameters for brightness are important for perfor-mance, while the color binning parameters are less critical.A larger indicates that small differences in the cue are lessperceptually signicantor at least less useful for this task.

Apart from the disc radius, the texture patch cue has

additional parameters related to the texton computation.The top right table in Figure 7 shows the optimizationover the number of textons, with 512 being optimal. Ingeneral, we found that the number of textons should beapproximately half the number of pixels in the disc. In

0

0.25

0.5

0.75

1

0 0.25 0.5 0.75 1

P r e c

i s i o n

Recall

Patch Features

Distance F=0.404 @(0.504,0.338) MI=0.0399Brightness F=0.462 @(0.598,0.377) MI=0.0728

Color F=0.499 @(0.606,0.423) MI=0.0978Texture F=0.549 @(0.593,0.512) MI=0.119B+C+T F=0.602 @(0.649,0.562) MI=0.157

addition, we nd agreementwith [20, 16] that the lterbank should contain a sin-gle scale, and that the scaleshould be as small as possible.

The graph at right showsthe performance of classierstrained on each patch fea-ture individually, along with aclassier that uses all three. Itis clear that each patch featurecontains independent information, and that texture is themost powerful cue. The performanceof a classier that usesdistance as its only cue is shown for comparison.

4.4. Performance of GradientsThe gradient cues are based on P b images, which give theposterior probability of a boundary at each image location.The Pb function can incorporate any or all of the brightness,color, and texture cues, though consider for the moment the

0

0.25

0.5

0.75

1

0 0.25 0.5 0.75 1

P r e c

i s i o n

Recall

IC Functions

Distance F=0.409 @(0.488,0.351) MI=0.041

Mean F=0.508 @(0.484,0.534) MI=0.0882L1 F=0.527 @(0.522,0.532) MI=0.097L2 F=0.592 @(0.563,0.625) MI=0.126L4 F=0.600 @(0.568,0.636) MI=0.132

Max F=0.598 @(0.563,0.637) MI=0.133

0

0.25

0.5

0.75

1

0 0.25 0.5 0.75 1

P r e c

i s i o n

Recall

Combining Gradient Cues

Distance F=0.404 @(0.504,0.338) MI=0.0399ICb F=0.556 @(0.530,0.585) MI=0.116ICc F=0.529 @(0.553,0.506) MI=0.103ICt F=0.565 @(0.535,0.598) MI=0.116

ICb+ICc+ICt F=0.609 @(0.580,0.641) MI=0.142ICbct F=0.595 @(0.553,0.644) MI=0.132

version that uses all three.Which intervening contourfunction should we use? Theupper gure at left shows theperformance of various func-tions including the mean,and the range of func-tions from sum to max. The

version is clearly thebest approach. Both themean and the sum performsignicantly worse. The re-sults are the same no matterwhich cues the P b functionuses. Note that the max doesnot include any encoding of distance.

The lower gure at leftcompares the two ways inwhich we can combine thecontour cues. We can ei-ther compute the intervening

contour feature for brightness (ICb), color (ICc), and tex-ture (ICt) separately and then combine with a classier(ICb+ICc+ICt), or we can use the P b function that combinesthe three channels into a single boundary map for the inter-vening contour feature (ICbct). We achieve better perfor-mance by computing separate contour cues.

6

8/10/2019 1. fmm-cvpr03.pdf

7/8

8/10/2019 1. fmm-cvpr03.pdf

8/8

1. fmm-cvpr03.pdf

Documents

Transcript of 1. fmm-cvpr03.pdf