Image Segmentation Stability: An Empirical...

8
Image Segmentation Stability: An Empirical Investigation CScott Brown, William Bradley Glisson, Jordan Shropshire, Ryan Benton, Thomas Watts and Timothy Sullivan School of Computing, University of South Alabama Abstract—This paper is an evaluation of the stability of image segmentation algorithms. We provide a definition of what it means for a segmentation algorithm to be stable, as well as a procedure for formulating stability measures in a given context. Using this procedure, a systematic empirical analysis of the stability of three popular image segmenation algorithms is performed. The stability of the algorithms is examined in the context of three real world trasformations: video, compression and subimage alterations. The analysis provides evidence that stability is meaningful property of an algorithm for a given set of parameters in a given application, and could be helpful for performing model and hyperparameter selection. This result lays the groundwork for developing methods for using stability as a means of automatic model and hyperparameter selection. Index Terms—image segmentation, superpixels, stability I. I NTRODUCTION Model validation and hyperparameter selection for unsuper- vised learning algorithms are not easy problems. Supervised algorithms have a natural procedure for hyperparameter selec- tion: divide the data into test and training sets and choose the model that performs best against the test sets [1]. Unsupervised algorithms have no such procedure, and metrics for measuring the quality of a result are often ad hoc and limited. Image segmentation and superpixelation algorithms are no exception to this rule. An image segmentation algorithm is a specific type of clustering algorithm concerned with clustering image pixels into semantically related groups; e.g. ‘sky’ vs. ‘sea’ vs. ‘tree’. As such, models can be validated with metrics of quality that are available to clustering algorithms, such as compactness, but also image specific metrics such as explained color variation [2]. However, these metrics only capture a limited view of what makes a ‘good’ image segmentation. Compactness penalizes correct segmentations of oddly-shaped objects, and explained color variation penalizes segments with variations in color, even if they have consistent texture [3]. In addition to these metrics, segmentation and superpixela- tion algorithms are frequently benchmarked against a single ground-truth dataset, the Berkeley Segmentation Dataset and Benchmark (BSDS500) [4]. BSDS500 consists of a set of images that have been hand-segmented to provide a means of comparison of an automated segmentation against a human one. However, optimal models or hyperparameters for the BSDS500 dataset, which is approximately 50% images of ‘nature’ scenes, may not transfer to other applications, such as facial recognition. Furthermore, the hyperparameters of some algorithms are affected by image meta-information. For example, many superpixel algorithms require the user to pick the number of superpixels, which depends on resolution. Every image in the BSDS500 dataset is 481 × 321. This is convenient for comparing the results between images in the dataset, but could pose problems when applying optimal BSDS500 hyperparameters to other image sets. With this in mind, it would be incredibly useful to have a means of model and hyperparameter selection that does not depend on a ground-truth dataset, that does not depend on limited a priori expectations of the qualities of a ‘good’ segmentation, and that is apposite across datasets and applications. Recently, the stability of a clustering has become a popular metric to aid parameter selection of a model, and to validate the results [5]. The idea is that a ‘correct’ clustering should be robust to small-scale perturbations in individual data points, because large-scale structure will be retained. Instability of a clustering may therefore indicate problems with the clustering, or with the clustering algorithm. The benefit of using stability as a measure of quality is that it is model-independent and does not require application-specific qualities of clustering ‘goodness’. Indeed, for some algorithms, theoretical results exist showing that the ‘true’ clustering is a stable one [6]. However, it is not foolproof. For example, an algorithm that always returns the same clustering regardless of input data has perfect stability. Even so, stability has been used successfully to measure the quality of a clustering [7] and also to optimize model hyperparameters [5]. Image segmentation and superpixelation is to the field of computer vision as clustering is to statistics and machine learning. It is one of the oldest and most studied problems in computer vision, and image segmenation algorithms are frequently used as a pre-processing step for other applications [8]. Areas that make use of image segmentations as a matter of course include image retrieval [9] medical imaging [10] and object detection [11]. It is therefore important to have objective metrics to evaluate an image segmentation algo- rithm, or a particular segmentation. For example, an image retrieval system would benefit from algorithms that avoid finding patterns in the addition of random noise due to image quality issues and non-random noise due to compression. An algorithm used in a motion-tracking system should ensure the retention of found patterns under motion (video) through time. An image tampering detection system would require an algorithm that retains found patterns under discrete changes, such as cropping. Since stability has been successful in evaluating clustering algorithms, and image segmentation is such a closely related

Transcript of Image Segmentation Stability: An Empirical...

Page 1: Image Segmentation Stability: An Empirical Investigationschoolofcomputing.southalabama.edu/~cbrown/brown2018image.pdf · The stability of general clustering algorithms has been studied

Image Segmentation Stability: An EmpiricalInvestigation

CScott Brown, William Bradley Glisson, Jordan Shropshire, Ryan Benton, Thomas Watts and Timothy SullivanSchool of Computing, University of South Alabama

Abstract—This paper is an evaluation of the stability ofimage segmentation algorithms. We provide a definition of whatit means for a segmentation algorithm to be stable, as wellas a procedure for formulating stability measures in a givencontext. Using this procedure, a systematic empirical analysis ofthe stability of three popular image segmenation algorithms isperformed. The stability of the algorithms is examined in thecontext of three real world trasformations: video, compressionand subimage alterations. The analysis provides evidence thatstability is meaningful property of an algorithm for a given setof parameters in a given application, and could be helpful forperforming model and hyperparameter selection. This result laysthe groundwork for developing methods for using stability as ameans of automatic model and hyperparameter selection.

Index Terms—image segmentation, superpixels, stability

I. INTRODUCTION

Model validation and hyperparameter selection for unsuper-vised learning algorithms are not easy problems. Supervisedalgorithms have a natural procedure for hyperparameter selec-tion: divide the data into test and training sets and choose themodel that performs best against the test sets [1]. Unsupervisedalgorithms have no such procedure, and metrics for measuringthe quality of a result are often ad hoc and limited.

Image segmentation and superpixelation algorithms are noexception to this rule. An image segmentation algorithm is aspecific type of clustering algorithm concerned with clusteringimage pixels into semantically related groups; e.g. ‘sky’ vs.‘sea’ vs. ‘tree’. As such, models can be validated with metricsof quality that are available to clustering algorithms, such ascompactness, but also image specific metrics such as explainedcolor variation [2]. However, these metrics only capture alimited view of what makes a ‘good’ image segmentation.Compactness penalizes correct segmentations of oddly-shapedobjects, and explained color variation penalizes segments withvariations in color, even if they have consistent texture [3].

In addition to these metrics, segmentation and superpixela-tion algorithms are frequently benchmarked against a singleground-truth dataset, the Berkeley Segmentation Dataset andBenchmark (BSDS500) [4]. BSDS500 consists of a set ofimages that have been hand-segmented to provide a means ofcomparison of an automated segmentation against a humanone. However, optimal models or hyperparameters for theBSDS500 dataset, which is approximately 50% images of‘nature’ scenes, may not transfer to other applications, suchas facial recognition. Furthermore, the hyperparameters ofsome algorithms are affected by image meta-information.For example, many superpixel algorithms require the user to

pick the number of superpixels, which depends on resolution.Every image in the BSDS500 dataset is 481 × 321. Thisis convenient for comparing the results between images inthe dataset, but could pose problems when applying optimalBSDS500 hyperparameters to other image sets. With thisin mind, it would be incredibly useful to have a means ofmodel and hyperparameter selection that does not depend ona ground-truth dataset, that does not depend on limited a prioriexpectations of the qualities of a ‘good’ segmentation, and thatis apposite across datasets and applications.

Recently, the stability of a clustering has become a popularmetric to aid parameter selection of a model, and to validatethe results [5]. The idea is that a ‘correct’ clustering shouldbe robust to small-scale perturbations in individual data points,because large-scale structure will be retained. Instability of aclustering may therefore indicate problems with the clustering,or with the clustering algorithm. The benefit of using stabilityas a measure of quality is that it is model-independent anddoes not require application-specific qualities of clustering‘goodness’. Indeed, for some algorithms, theoretical resultsexist showing that the ‘true’ clustering is a stable one [6].However, it is not foolproof. For example, an algorithm thatalways returns the same clustering regardless of input data hasperfect stability. Even so, stability has been used successfullyto measure the quality of a clustering [7] and also to optimizemodel hyperparameters [5].

Image segmentation and superpixelation is to the field ofcomputer vision as clustering is to statistics and machinelearning. It is one of the oldest and most studied problemsin computer vision, and image segmenation algorithms arefrequently used as a pre-processing step for other applications[8]. Areas that make use of image segmentations as a matterof course include image retrieval [9] medical imaging [10]and object detection [11]. It is therefore important to haveobjective metrics to evaluate an image segmentation algo-rithm, or a particular segmentation. For example, an imageretrieval system would benefit from algorithms that avoidfinding patterns in the addition of random noise due to imagequality issues and non-random noise due to compression. Analgorithm used in a motion-tracking system should ensurethe retention of found patterns under motion (video) throughtime. An image tampering detection system would require analgorithm that retains found patterns under discrete changes,such as cropping.

Since stability has been successful in evaluating clusteringalgorithms, and image segmentation is such a closely related

Page 2: Image Segmentation Stability: An Empirical Investigationschoolofcomputing.southalabama.edu/~cbrown/brown2018image.pdf · The stability of general clustering algorithms has been studied

field, we hypothesize that stability will be a useful measurefor model and hyperparameter selection of image segmentationalgorithms. With this in mind, in the context of several real-world scenarios, the remainder of this paper examines stabilitymaximization as a potential procedure for selecting amongstvarious algorithms and hyperparameters. Section 2 presentsrelated works. Section 3 proposes a general framework forgenerating measures of stability corresponding to a particularclass of transformation and a particular comparison measure.Section 4 lays out the experimental setup and our reasoningsbehind the classes of image perturbations, comparison mea-sures and algorithms chosen for evaluation. Section 5 evaluatesthe stability of the selected algorithms for each comparisonmeasure and class of perturbation, discussing possible reasonswhy certain algorithms may be more/less stable than others.Section 6 discusses potential application areas where thecurrent analysis may be relevant, such as situations where theremay be a trade-off between having a ground-truth-accuratealgorithm and a transformation-stable algorithm. Finally, theconclusion summarizes lessons learned and opportunities tocontinue the current research or to extend it to related fields.

II. RELATED WORKS

Image segmentation algorithms are closely related to thegeneral class of clustering algorithms. There is a reason-ably large body of work about measuring the stability ofvarious clustering algorithms to changes in the underlyingdataset. There is also much work in the creation of clusteringalgorithms that are specifically designed to display somedegree of stability. This section describes some of the workfrom this area that may be relevant to the subject of imagesegmentation stability. Furthermore, we identify some topicsin image segmentation which are related to stability. Finally,we discuss some results for image segmentation specifically,and summarize how the present work differs.

A related problem addresses the statistical consistency ofvarious clustering algorithms. A consistent estimator is onefor which, as the size of the sample set is increased, theestimator converges to the “true” population value. In thecase of clustering algorithms, evidence of consistency mightexhibit as stability of the algorithm to the addition of newsample data from the same population. Theoretical results areavailable for a select few algorithms, such as k-means [12]and spectral clustering [13], but such results are not known orwell-studied for most algorithms. However, since real-worldimages form an extremely narrow segment of the set of allpossible images, an algorithm that is empirically consistentfor a given application may be sufficient for most use cases.Either way, the problem of consistency in the context of imagesegmentation relates to stability under changes in resolution,which has not been studied to our knowledge, either theoreti-cally or empircally.

The stability of general clustering algorithms has beenstudied in [5] [14]. Much of the work is confined to a smallsubset of clustering algorithms and comparison methods. In[6] clustering stability of the k-means algorithm is used as a

way to select an appropriate k. In [15] clustering stability ofhierarchical clustering algorithms using the Goodman-Kruskalgamma statistic (which is specific to hierarchical clusterings)is used as a measure of the validity of a given clustering. In [7]the stability of k-means and hierarchical clustering algorithmswith reference to Jaccard similarity is studied. Various randomfluctuations including gaussian noise and subsampling of theoriginal data are used with the goal of distinguishing individual“good” clusters from “bad”. However, relatively little attentionis paid to the vast majority of clustering algorithms and cluster-ing comparison measures present in the literature. No analysiswas made in these works for the stability of algorithms withrespect to the alteration of data in a meaningful but non-random fashion, such as might happen in time-series.

Incremental clustering algorithms are a class of algorithmswhich inherently account for stability over time-series changesin the training dataset. The task here is to efficiently update thepreviously generated clusters due to the arrival of new data in adata streaming model [16] [17]. New data are generated inde-pendently from some underlying statistical distribution and themodel is updated accordingly. In this context the previous dataare still relevant to the desired clustering, and are explicitlyincluded in the parameter estimation of the new clustering.Although most image segmenation algorithms are designedfor static image analysis, there has been some research on thecreation of algorithms capable of incremental updates [18].For both image segmentation algorithms and general clusteringalgorithms, temporal stability under incremental updates is animportant characteristic [19], and a thorough investigation isprobably warranted.

Further work along a similar vein is sometimes referredto as “evolutionary” clustering, i.e. clusterings that changeover time due to changes in the underlying dataset. Theseworks seek to explicitly enforce certain characteristics, in-cluding temporal stability, on existing clustering algorithms.In contrast to incremental clustering, instead of the previousdata being carried over into the current model, the previousmodel parameters are incorporated into the formula for thepresent ones. A common application area for such algorithmsis in social-network analysis, where points and edges changeover time and prior states of the network provide relevantinformation to the current state, but are no longer valid datapoints. In [20] k-means and hierarchical clustering algorithmsare modified to incorporate requirements on temporal stability.In [21], this idea is extended to spectral clustering methods.[22] applies temporal constraints to Dirichlet process modelsin order to address some shortcomings in the previous workssuch as a requirement for constant cluster sizes. These worksare also limited in the number of algorithms that they evaluate,possibly due to the difficulty of imposing temporal restrictionson arbitrary algorithms. These methods are inherently stableby design, but these considerations are not taken into accountin most work on image segmentation algorithms.

Financial time-series clustering is another application areain which temporal stability of clustering algorithms has beenfound to be of consequence. The use case here is one of

Page 3: Image Segmentation Stability: An Empirical Investigationschoolofcomputing.southalabama.edu/~cbrown/brown2018image.pdf · The stability of general clustering algorithms has been studied

individual data points being time-series, and the clusteringalgorithm is tasked with grouping time-series according tosome similarity metric [23]. It is important in this context thatan algorithm give time stable clusterings to prevent completelyupending an asset portfolio based on minor fluctuations in themarket. In [24] a framework is proposed for evaluating thestability of time-series clusterings according to a wide varietyof criteria, including temporal stability. However, to the bestof our knowledge, no attempt has been made to translate therigor of these stability measurements in financial analysis tothe realm of image segmentation.

Another problem that is superficially similar to the task athand is the generation of spatio-temporal clusterings [25] [26].In the context of image segmentation this amounts to videosegmentation such as with supervoxel algorithms [27]. Thegoal here is to generate clusterings over a point set that hasboth spatial dimensions as well as a temporal dimension. Theproblem of clustering stability is moot for these applications,since separate clusterings are not generated for different timepoints, but a single clustering is generated that includes timeas a dimension in the individual data.

Specifically with respect to image segmentation and su-perpixels, there has been some research in evaluating thestability of algorithms over small changes to the input. How-ever, algorithm comparisons have largely been confined tocomparison of clusterings with ground-truth datasets, such asthe Berkeley Segmentation Dataset and Benchmark [4] and thePASCAL VOC challenge dataset [28] and a priori measuresof clustering quality such as compactness and color variation[2]. In [29] an evaluation is made of the stability of varioussegmentation algorithms under affine transformations. This isimportant primarily for video applications, since a change incamera angle of a flat stationary object will be observed as anaffine transformation from the user’s viewpoint. However, thisonly covers a small proportion of possible image perturbations.Neubert and Protzel [30] propose a measure for evaluating thetemporal stability of an image segmentation algorithm, andevaluate a number of algorithms on the KITTI [31] and Sintel[32] optical flow datasets. However, the measure introducedin this work is specifically based on undersegmentation error,which does not penalize the number of clusters. Also, their sta-bility measure is only applicable in the context of continuous-motion transformations in video, and requires ground-truthdata.

This paper adds to the depth of previous explorations ofimage segmenation algorithm stability by measuring stabilityin contexts other than video, and in the absence of ground-truth. In addition, it extends the definition in [5] of the stabilityof a clustering algorithm over random perturbations to includeother classes of transformations. This definition is used todevise stability measures for a segmentation algorithm undertransformations from video, compression and sub-image alter-ation. To our knowledge, a systematic study of the stability ofimage segmenation algoritms to these types of transformations,and others, in the absence of ground-truth, has not beeninvestigated until now.

III. STABILITY

We generalize to clustering algorithms taking arbitrary pa-rameters a procedure given in [5] for evaluating the instabilityof a clustering algorithm. Note that this procedure measuresthe instability of a clustering algorithm, since the distanceincreases as two clusterings diverge. Given some domain Dfrom which data is acquired, a clustering algorithm A withparameters θ that produces clusterings C : D → N, a means ofgenerating a dataset S and a perturbed version of that datasetS′ (e.g. by subsampling both from a larger dataset, by addingnoise to a given S, by applying some arbitrary transformationto S, etc) and a comparison (distance or similarity) measured between two clusterings:

1. Generate two datasets S and S′

2. Cluster the data sets S, S′ with algorithm A toobtain clusterings C,C ′

3. Compute the comparison measure d(C,C ′) be-tween the clusterings

The stability of the algorithm A can be seen as a propertymeasurable from these distances computed at various pointsS, S′. In general, it will be useful to iterate this procedureover multiple perturbations and compute some summarizingstatistic of the resulting set of distances. For example, in[5], the S, S′ are generated by subsampling from some givendataset, and a mean is computed for the generated set ofpairwise distances. This summarizing statistic can be used tocompare the performance of two algorithms A,A′. It can alsopotentially be minimized over the algorithm parameters θ toobtain the ‘best’ parameters.

This concept of stability mirrors the concept of functioncontinuity. i.e. a clustering algorithm A is stable if smallchanges to the input S, S′ produce small changes in the outputC,C ′. In general, it will be useful to iterate this procedureover multiple perturbations and compute some summarizingstatistic of the resulting set of distances. For example, in[5], the S, S′ are generated by subsampling from some givendataset, and a mean is computed for the generated set ofpairwise distances. In the proceeding section, we will beusing the mean and standard deviation as useful summarizingstatistics of stability.

Note that the means by which one generates the perturbedinputs S, S′ is important. For example, an image segmentationalgorithm may be stable under continuous motion, but notstable under compression. Such an algorithm might be moreappropriate for the task of tracking objects in a video streamthat for performing image search on a database of compressedimages. With this in mind, our algorithm comparison in-volves several different transformations on different datasets.Note also that there are a large number of possible distancemeasures that are available between two clusterings. Mostof these distance measures involve some sort of operationon the contingency table, essentially counting points thatclusters share or do not share. Some, however, do not, and arepermissive of certain relationships between clusterings, suchas refinement. In our evaluation, we will be using a handful of

Page 4: Image Segmentation Stability: An Empirical Investigationschoolofcomputing.southalabama.edu/~cbrown/brown2018image.pdf · The stability of general clustering algorithms has been studied

measures popularly used for image segmentation comparisonto cover a variety of qualities.

IV. METHODOLOGY

The experiments outlined here are an exploratory investi-gation into the properties of image segmentation algorithmstability. Establishing an actual procedure for model validationor hyperparameter selection is beyond the scope of this paper,although such applications have been shown in the literaturefor general clustering algorithms. This work will focus insteadon establishing the foundations of such a study: whether itis meaningful to talk about stability as a property of animage segmentation algorithm, instead of simply between twoindividual clusterings; whether stability is, in part, dependenton the perturbation that is being applied, and is thus nota one-size-fits-all measure; and whether some algorithms orhyperparameter selections are inherently more stable in aparticular context than others.

A. Algorithms

The Experiments section will be concerned with evaluatingthe stability of the following image segmentation algorithms.First, we consider the graph-based method by Felzenswalb andHuttenlocher (FH) [33]. FH begins by constructing a weightedgraph with edges between neighboring pixels. Each pixel startsin its own segment. It proceeds to greedily combine segmentswhen the internal variation of both of the components is lessthan the variation along a point of their boundary. The internalvariation is defined as the maximum weighted edge in theminimum spanning tree of the component plus some user-defined parameter τ . Without τ , FH can have a tendency tocreate very small segments, since internal variation can beswayed by a single pixel’s relationship to its neighbors. Forus, τ is chosen for a component C using the example in theirpaper (Eq. 1), where k is a constant hyperparameter of thealgorithm.

τ(C) = k/|C| (1)

In addition the image is pre-processed with a Gaussian filterper the suggestion in the original paper, and σ for the filter isconsidered to be a hyperparameter of the algorithm.

Second, we consider the Simple Linear Iterative Clustering(SLIC) algorithm [34]. SLIC performs k-means clustering inthe 5D space formed by concatenating the CIELAB color andspatial information for a pixel. When combining the two, itis notable that distance in color space and distance in pixelposition space have different meanings and extents. Thus, fora distance measure in the combined space SLIC introducesa scaling parameter m that controls the weight of pixelposition space distances. m has the effect of controlling thecompactness of clusters by (de)emphasizing the importance ofpixel positions relative to cluster centers. k is the number ofsegments, analagous to k-means. Per the suggestion in theoriginal paper, the algorithm is run for a maximum of 10iterations. Since FH does not explicitly allow for control ofthe number of segments, to allow for comparison, we chose k

to be the approximate number of segments found by FH fora data sequence.

Finally, we consider the Superpixels Extracted via Energy-Driven Sampling (SEEDS) algorithm [35]. SEEDS begins bydividing the image into a regular grid. SEEDS then operatesby swapping groups of pixels between segments, hill-climbingto a local optimum of an image-wide objective function. Theobjective function involves two terms. One penalizes the L2

norm of the superpixel color histograms. The other penalizesthe L2 norm of the superpixel label histogram for boundarypixel neighborhoods. The size of the N×N neighborhoods touse for computing boundary pixel histograms is a hyperparam-eter of the algorithm, as well as γ, the relative weight of thetwo terms, and the number of desired superpixels s. We usethe defaults of 3 × 3 neighborhoods and γ = 1 suggested inthe original paper, and implemented in the publicly availablecode. s is chosen using the same procedure used for SLIC.

B. Datasets

The stability of the previously mentioned algorithms areexamined with respect to the following perturbations. First, westudy stability under sub-image alterations. This is achievedusing a dataset of hard-disk images for which image segmen-tation is used to discover file boundaries. Drive contents aregathered by creating a stock Ubuntu 14.04 virtual machineinstallation, running a general system update, and then incre-mentally installing a variety of packages. The exact packagesinstalled are detailed in Table ??. Images are then createdusing an established operation similar to byteplot [36] for thevisualization of drive contents. Before each operation, eachbyte of the drive is mapped into a color space [[[How? AskJordan]]], and the resulting color string is converted to 2D bytaking fixed-length windows of the string and stacking them.In the resulting image, files and related sub-file sectors tend tohave similar colors and/or textures (Fig. ??). Image analysistechniques can then be applied to the resulting images for e.g.malware detection [37]. Since a file update will generally onlyaffect a small section of the drive, instability in this contextamounts to the addition of a file in one area of the driveaffecting the computed segmentation elsewhere in the image.Stability is measured as the distance between segmentationson neighboring stages of the update process.

Second, we study stability of a segmentation under con-tinuous non-random changes with scenes from video data.This is achieved using inter-frame stability on the first still-camera video from the Video Image Retrieval and AnalysisTool (VIRAT) ground dataset (VIRAT S 000001.mp4) [38].Videos are taken apart at a rate of 5 frames per second.Stability is measured as the distance between segmenta-tions on neighboring extracted frames. Since the videos arecompressed, segmentation algorithms will produce differentsegmentations between two neighboring frames even in theabsence of motion. In order to account for this, we specificallyanalyze a section of a video with no motion and compareto the remainder of the video, and also provide an analysisof stability under the effects of compression. This video was

Page 5: Image Segmentation Stability: An Empirical Investigationschoolofcomputing.southalabama.edu/~cbrown/brown2018image.pdf · The stability of general clustering algorithms has been studied

chosen simply because it was first. More were not evaluated toeliminate bias in the results, because sufficient insights weregained from this video to continue the experiments and alsodue to the fact that segmenting 5 frames per second of thisvideo takes about 12 hours on a single i7-4700MQ core.

Third, we study stability under compression to an increas-ingly lossy format. For this, we use image ‘175083.jpg’ fromthe Berkeley Segmentation Dataset (BSDS500) [4]. Imagesare compressed in a jpeg format incrementally in 50 stepsto the point of being almost entirely noise. We measure thestability as the distance between segmentations on neighboringlevels of compression. This image was chosen because itfeatured prominently in [2]. Again, more were not evaluatedto eliminate bias and because of the intense resource usage ofrepeated application of segmentation algorithms.

C. Metrics

For each combination of algorithm and perturbation, we usethe following comparison measures in our stability calcula-tions. Variation of Information (VOI) has pleasing theoreticalroots and is a true metric [39] (Eq. 2). Here H is Shannonentropy and I is mutual information.

VOI(C,C ′) := H(C) +H(C ′)− 2I(C,C ′) (2)

The intuition behind VOI is that we can compute the in-formation given by a clustering C that is over and abovethe information given by C ′ by subtracting their mutualinformation from the entropy of C. That quantity can be madesymmetric by computing the same quantity for C ′ and addingthe two values. The result is a quantification of the informationthe two clusterings contain that they do not both contain.

The Rand Index (RI) is the unsupervised equivalent ofaccuracy [40] (Eq. 3). Here n11 is the number of pairs ofpoints that lie in the same cluster in both C and C ′, n00 arethe number of pairs of points that lie in a different cluster inC and C ′, and n is the number of points.

RI(C,C ′) :=n00 + n11(

n2

) (3)

The intuition behind RI is that, C ′ is ‘correct’ according to Cexactly when a pair of points that are in the same cluster inC are in the same cluster in C ′, or when a pair of points thatare in different clusters in C are in different clusters in C ′.The ratio of ‘correct’ pairs to all possible pairs results in theRI. Both VOI and RI can be computed by some operation onthe contingency table, and can thus be viewed as summarizingstatistics of the contingency table. In this sense, they are bothmeasures of the degree to which two clusterings agree ordisagree, weighting all points equally.

Local Consistency Error (LCE) is a measure of the degree towhich one segmentation is a refinement of the other, allowingfor the refinement to be in different directions in differentparts of the image [4] (Eq. 5): Here R(C, pi) is the segmentcorresponding to pixel pi, and n is the number of pixels.

E(C,C ′, pi) :=|R(C, pi)−R(C ′, pi)|

|R(C, pi)|(4)

LCE(C,C ′) :=1

n

∑i

min{E(C,C ′, pi), E(C ′, C, pi)} (5)

Note that LCE(C,C ′) will be exactly zero when the segmentsof C are subsegments of segments of C ′ or vice versa, or somemixture of those two scenarios. This is a useful metric becauseimage segmentations are frequently used as a preprocessingstep for further analysis, and ‘oversegmentation’, having moresegments than strictly necessary, is an acceptable situation.

Boundary Displacement Error (BDE) is a measure of thedegree to which the boundary pixels of two segmentationsalign [41] (Eq. 6). Here B(C) is the set of boundary pixels inC (neighboring pixels that reside in different clusters).

BDE(C,C ′) :=∑

p∈B(C)

||p,B(C ′)||2 (6)

Whereas the previous metrics weight all misclustered pixelsequally, BDE weights pixels based on how close they areto the edge of a cluster. Like LCE, BDE is permissive ofoversegmentations, but only in one direction.

V. EXPERIMENTS

In this section we detail the results of our experiments. Theexperiments proceed according to the aforementioned goalof determining the feasibility of using stability as a meansof model and hyperparameter selection. Results are givenchronologically, as the experiment progresses, since resultsfrom a particular analysis inform the process of proceedingexperiments. Where particular examples are used, they arechosen arbitrarily, and not cherry-picked to give ‘good’ results.

In the Algorithm Comparison section, for each combinationof dataset/perturbation procedure, distance measure and algo-rithm, we measure the stability as the sequence of distancesbetween the perturbed images. Mean and standard deviationare evaluated as potential summarizing statistics to measurethe ‘difference’ between the stability of various algorithms.

In the Hyperparameter Comparison section, we measure theVOI stability of FH using a variety of hyperparameters. Inthe absence of any fixed objective function, the mean andstandard deviation of the sequence of distances are evaluatedin tandem, loosely, to find the ‘most’ stable solution. Becauseof the lack of a concrete objective function, a manual search isperformed. Starting with a grid, hyperparameter combinationsare evaluated in-between or outside of the grid points asthey subjectively appear that they may offer a lower meanor standard deviation, or lend evidence to the existence of alocal minimum at a previously evaluated point. For all of theexperiments performed, a discussion is included regarding theprocess of the investigation, and lessons learned.

A. Algorithm Comparison

The results of the first set of experiment results are illus-trated in Figs. 1, 2 and 3. These graphs are a essentially a timeseries of the distances between segmentations of successiveimages. A few features of these graphs are worth noting.

One feature of note from every dataset is that althoughVOI, LCE and BDE tend to at least be of the same order

Page 6: Image Segmentation Stability: An Empirical Investigationschoolofcomputing.southalabama.edu/~cbrown/brown2018image.pdf · The stability of general clustering algorithms has been studied

of magnitude, RI is totally different. This can possibly beattributed to the variable number of segments given by FH,and the outsized effect of the number of segments on RI [42].If stability with respect to the number of segments is a concern,RI might be useful. Otherwise, a better measure for stabilityof algorithms with variable numbers of clusters might be theAdjusted Rand Index [43], which is more robust to number ofsegments. Owing to this problem, RI is generally left out ofproceeding discussions.

Another feature that is apparent from all of the datasets(Figs. 1, 2 and 3) is that the distances between successiveframes for the SEEDS model are reliably greater than corre-sponding distances for FH, with the only exception being BDEfor the compression dataset. The existence of a logical patternto these values suggests that stability is a meaningful propertyof an algorithm for a given set of hyperparameters and adistance measure. This is opposed to having seen noisy valuesthat were sometimes larger for one algorithm, and sometimeslarger for another with little or no rhyme or reason. I.e. In agiven context, it is valid to talk about a model being ‘more’stable or ‘less’ stable than another.

Lastly, these graphs suggest that the mean distance betweensuccessive images is, by itself, an insufficient summary ofthe stability of a model. This can be seen in several of thegraphs, but appears most strikingly in the graph of VOI forthe video dataset. First, it is important to realize that thesuccessive images in this dataset are actually different images,so some level of difference between the segmentations ofsuccessive images is ‘true’, and should be expected. Sincethe ‘true’ segmentations are ill-defined, the ‘true’ amount ofvariation between them is as well, and is thus impossible tomeasure in practice. Nevertheless, the amount of motion over200ms in the example video is small, and never abrupt (infact, the first minute and a half involve no motion at all,the only differences being compression artifacts). Thus, it isexpected that the ‘true’ segmentations from image 1 to 2 to3 etc. have an approximately constant distance between them.This means that variability is also an important summarizingstatistic of model stability. Note from VOI in Fig. 1 thatalthough the mean distance between frames for FH and SLICare similar (xFH ≈ 0.54 and xSLIC ≈ 0.49), FH generatessimilar segmentations for video frames much less reliably(sFH ≈ 0.19, sSLIC ≈ 0.06). One possible reason for thisdisparity is that, because FH is a greedy algorithm, initialconditions have a larger effect on the final outcome. Eitherway, it is clear that the mean distance does not tell the wholestory of stability in practice.

B. Hyperparameter Comparison

This section investigates the effects of hyperparameter se-lection on algorithm stability. This is performed with the videodataset using the FH algorithm and VOI distance. The resultsof this second set of experiments are illustrated in Figs. 6, 7,intended to illustrate the effect of hyperparameter k, and ?? toillustrate the effect of σ. The actual segmentations at severalpoints are provided corresponding to the hyperparameters un-

Fig. 1. Distances between the segmentations of neighboring frames in thefirst video of the VIRAT dataset.

Fig. 2. Distances between the segmentations of neighboring images in thesequential compression of BSDS500 image 175083.

der which the algorithm shows relatively high or low stabilityin Fig. 8. For comparison, the segmentation using σ = 0.8and k = 300 from the original paper is also provided.

As can be seen in Figs. 6 and 7, the stability of the FHalgorithm with respect to choice of k follows a general pattern.The algorithm is highly unstable for low k, exhibits at leastone local minimum, peaks, and then drops toward 0 as k getsvery large.

Page 7: Image Segmentation Stability: An Empirical Investigationschoolofcomputing.southalabama.edu/~cbrown/brown2018image.pdf · The stability of general clustering algorithms has been studied

Fig. 3. Distances between the segmentations of neighboring updates in thedrive contents dataset.

Fig. 4. The mean VOI between segmentations of neighboring frames of thefirst VIRAT video for various FH hyperparameters.

Fig. 5. The standard deviation of VOI between segmentations of neighboringframes of the first VIRAT video for various FH hyperparameters.

REFERENCES

[1] J. Friedman, T. Hastie, and R. Tibshirani, The elements of statisticallearning. Springer series in statistics New York, 2001, vol. 1.

Fig. 6. The mean VOI between segmentations of neighboring frames of thefirst VIRAT video for FH for various σ. The first frame of this video can beseen in 8

[2] D. Stutz, A. Hermans, and B. Leibe, “Superpixels: An evaluation of thestate-of-the-art,” Computer Vision and Image Understanding, 2017.

[3] A. P. Moore, S. J. Prince, J. Warrell, U. Mohammed, and G. Jones,“Superpixel lattices,” in Computer Vision and Pattern Recognition, 2008.CVPR 2008. IEEE Conference on. IEEE, 2008, pp. 1–8.

[4] D. Martin, C. Fowlkes, D. Tal, and J. Malik, “A database of humansegmented natural images and its application to evaluating segmentationalgorithms and measuring ecological statistics,” in Computer Vision,2001. ICCV 2001. Proceedings. Eighth IEEE International Conferenceon, vol. 2. IEEE, 2001, pp. 416–423.

[5] U. Von Luxburg et al., “Clustering stability: an overview,” Foundationsand Trends R© in Machine Learning, vol. 2, no. 3, pp. 235–274, 2010.

[6] A. Rakhlin and A. Caponnetto, “Stability of k-means clustering,” inAdvances in neural information processing systems, 2007, pp. 1121–1128.

[7] C. Hennig, “Cluster-wise assessment of cluster stability,” ComputationalStatistics & Data Analysis, vol. 52, no. 1, pp. 258–271, 2007.

[8] X. Wei, Q. Yang, Y. Gong, M.-H. Yang, and N. Ahuja, “Superpixelhierarchy,” arXiv preprint arXiv:1605.06325, 2016.

[9] Y. Rui, T. S. Huang, and S.-F. Chang, “Image retrieval: Currenttechniques, promising directions, and open issues,” Journal of visualcommunication and image representation, vol. 10, no. 1, pp. 39–62,1999.

[10] D. L. Pham, C. Xu, and J. L. Prince, “Current methods in medical imagesegmentation,” Annual review of biomedical engineering, vol. 2, no. 1,pp. 315–337, 2000.

[11] J. Im, J. Jensen, and J. Tullis, “Object-based change detection using cor-relation image analysis and image segmentation,” International Journalof Remote Sensing, vol. 29, no. 2, pp. 399–423, 2008.

[12] D. Pollard et al., “Strong consistency of k-means clustering,” The Annalsof Statistics, vol. 9, no. 1, pp. 135–140, 1981.

[13] U. Von Luxburg, M. Belkin, and O. Bousquet, “Consistency of spectralclustering,” The Annals of Statistics, pp. 555–586, 2008.

[14] U. Von Luxburg and S. Ben-David, “Towards a statistical theory of clus-tering,” in Pascal workshop on statistics and optimization of clustering,2005, pp. 20–26.

[15] A. Ben-Hur, A. Elisseeff, and I. Guyon, “A stability based methodfor discovering structure in clustered data,” in Pacific symposium onbiocomputing, vol. 7, 2001, pp. 6–17.

[16] S. Guha and N. Mishra, “Clustering data streams,” in Data StreamManagement. Springer, 2016, pp. 169–187.

[17] M. Charikar, C. Chekuri, T. Feder, and R. Motwani, “Incremental clus-tering and dynamic information retrieval,” SIAM Journal on Computing,vol. 33, no. 6, pp. 1417–1440, 2004.

[18] J.-M. Odobez and P. Bouthemy, “Direct incremental model-based imagemotion segmentation for video analysis,” Signal Processing, vol. 66,no. 2, pp. 143–155, 1998.

[19] S. Young, I. Arel, T. P. Karnowski, and D. Rose, “A fast and stableincremental clustering algorithm,” in Information Technology: NewGenerations (ITNG), 2010 Seventh International Conference on. IEEE,2010, pp. 204–209.

Page 8: Image Segmentation Stability: An Empirical Investigationschoolofcomputing.southalabama.edu/~cbrown/brown2018image.pdf · The stability of general clustering algorithms has been studied

Fig. 7. Segmentations of the first frame of the first VIRAT video for FH forvarious σ.

[20] D. Chakrabarti, R. Kumar, and A. Tomkins, “Evolutionary clustering,”in Proceedings of the 12th ACM SIGKDD international conference onKnowledge discovery and data mining. ACM, 2006, pp. 554–560.

[21] Y. Chi, X. Song, D. Zhou, K. Hino, and B. L. Tseng, “Evolutionaryspectral clustering by incorporating temporal smoothness,” in Proceed-ings of the 13th ACM SIGKDD international conference on Knowledgediscovery and data mining. ACM, 2007, pp. 153–162.

[22] T. Xu, Z. Zhang, P. S. Yu, and B. Long, “Generative models forevolutionary clustering,” ACM Transactions on Knowledge Discoveryfrom Data (TKDD), vol. 6, no. 2, p. 7, 2012.

[23] M. Corduas and D. Piccolo, “Time series clustering and classificationby the autoregressive metric,” Computational Statistics & Data Analysis,vol. 52, no. 4, pp. 1860–1872, 2008.

[24] G. Marti, P. Very, P. Donnat, and F. Nielsen, “A proposal of amethodological framework with experimental guidelines to investigateclustering stability on financial time series,” in Machine Learning andApplications (ICMLA), 2015 IEEE 14th International Conference on.IEEE, 2015, pp. 32–37.

[25] S. Kisilevich, F. Mansmann, M. Nanni, and S. Rinzivillo, “Spatio-temporal clustering,” in Data mining and knowledge discovery hand-book. Springer, 2009, pp. 855–874.

[26] P. Kalnis, N. Mamoulis, and S. Bakiras, “On discovering moving clustersin spatio-temporal data,” in SSTD, vol. 3633. Springer, 2005, pp. 364–381.

[27] O. Veksler, Y. Boykov, and P. Mehrani, “Superpixels and supervoxels inan energy optimization framework,” Computer Vision–ECCV 2010, pp.211–224, 2010.

[28] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams, J. Winn,and A. Zisserman, “The pascal visual object classes challenge: Aretrospective,” International journal of computer vision, vol. 111, no. 1,pp. 98–136, 2015.

[29] P. Neubert and P. Protzel, “Superpixel benchmark and comparison,” inProc. Forum Bildverarbeitung, 2012, pp. 1–12.

[30] ——, “Evaluating superpixels in video: Metrics beyond figure-groundsegmentation.” in BMVC, 2013.

[31] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:The kitti dataset,” The International Journal of Robotics Research,vol. 32, no. 11, pp. 1231–1237, 2013.

[32] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A naturalisticopen source movie for optical flow evaluation,” in European Conferenceon Computer Vision. Springer, 2012, pp. 611–625.

[33] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-based imagesegmentation,” International journal of computer vision, vol. 59, no. 2,pp. 167–181, 2004.

[34] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Susstrunk,“Slic superpixels,” Tech. Rep., 2010.

[35] M. Van den Bergh, X. Boix, G. Roig, B. de Capitani, and L. Van Gool,“Seeds: Superpixels extracted via energy-driven sampling,” in Europeanconference on computer vision. Springer, 2012, pp. 13–26.

[36] G. Conti, E. Dean, M. Sinda, and B. Sangster, “Visual reverse engineer-ing of binary and data files,” Visualization for Computer Security, pp.1–17, 2008.

[37] L. Nataraj, S. Karthikeyan, G. Jacob, and B. Manjunath, “Malwareimages: visualization and automatic classification,” in Proceedings of the8th international symposium on visualization for cyber security. ACM,2011, p. 4.

[38] S. Oh, A. Hoogs, A. Perera, N. Cuntoor, C.-C. Chen, J. T. Lee,S. Mukherjee, J. Aggarwal, H. Lee, L. Davis et al., “A large-scalebenchmark dataset for event recognition in surveillance video,” inComputer vision and pattern recognition (CVPR), 2011 IEEE conferenceon. IEEE, 2011, pp. 3153–3160.

[39] M. Meila, “Comparing clusterings by the variation of information,” inColt, vol. 3. Springer, 2003, pp. 173–187.

[40] W. M. Rand, “Objective criteria for the evaluation of clustering meth-ods,” Journal of the American Statistical association, vol. 66, no. 336,pp. 846–850, 1971.

[41] J. Freixenet, X. Munoz, D. Raba, J. Martı, and X. Cufı, “Yet anothersurvey on image segmentation: Region and boundary information inte-gration,” Computer VisionECCV 2002, pp. 21–25, 2002.

[42] J. M. Santos and M. Embrechts, “On the use of the adjusted rand indexas a metric for evaluating supervised classification,” in InternationalConference on Artificial Neural Networks. Springer, 2009, pp. 175–184.

[43] L. Hubert and P. Arabie, “Comparing partitions,” Journal of classifica-tion, vol. 2, no. 1, pp. 193–218, 1985.