adaptive metric learning for saliency detection base paper

11
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2015 3321 Adaptive Metric Learning for Saliency Detection Shuang Li, Huchuan Lu, Senior Member, IEEE , Zhe Lin, Member, IEEE, Xiaohui Shen, Member, IEEE, and Brian Price Abstract—In this paper, we propose a novel adaptive metric learning algorithm (AML) for visual saliency detection. A key observation is that the saliency of a superpixel can be estimated by the distance from the most certain foreground and background seeds. Instead of measuring distance on the Euclidean space, we present a learning method based on two complementary Mahalanobis distance metrics: 1) generic metric learning (GML) and 2) specific metric learning (SML). GML aims at the global distribution of the whole training set, while SML considers the specific structure of a single image. Considering that multiple similarity measures from different views may enhance the relevant information and alleviate the irrelevant one, we try to fuse the GML and SML together and experimentally find the combining result does work well. Different from the most existing methods which are directly based on low-level features, we devise a superpixelwise Fisher vector coding approach to better distinguish salient objects from the background. We also propose an accurate seeds selection mechanism and exploit contextual and multiscale information when constructing the final saliency map. Experimental results on various image sets show that the proposed AML performs favorably against the state-of-the-arts. Index Terms— Metric learning, saliency detection, Mahalanobis distance, Fisher vector. I. I NTRODUCTION V ISUAL saliency aims at finding the regions on an image that are more visually distinctive or important and often serves as a pre-processing procedure for many vision tasks, such as image categorization [1], image retrieval [2], image compression [3], content-aware image/video resizing [4], etc. Visual saliency basically breaks down into the problem of separating the salient regions from the non-salient ones by measuring differences in their features. Numerous models and algorithms have been proposed to perform this. Unsupervised approaches [5]–[9] are stimuli-driven and rely largely on distinguishing low-level visual features. Early unsupervised models, such as Gaussian pyramids [5], central-surround [5], fuzzy growing [10] are mainly inspired by original biological Manuscript received August 21, 2014; revised February 10, 2015 and April 10, 2015; accepted May 26, 2015. Date of publication June 3, 2015; date of current version June 23, 2015. This work was supported in part by the Natural Science Foundation of China under Grant 61472060 and in part by the Fundamental Research Funds for the Central Universities under Grant DUT14YQ101. The associate editor coordinating the review of this manuscript and approving it for publication was Mr. Pierre-Marc Jodoin. S. Li and H. Lu are with the School of Information and Communication Engineering, Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian 116024, China (e-mail: [email protected]; [email protected]). Z. Lin, X. Shen, and B. Price are with Adobe Research, San Jose, CA 95110 USA (e-mail: [email protected]; [email protected]; [email protected]). Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2015.2440755 Fig. 1. The comparison between the Euclidean distance space and the Mahalanobis distance space. The Mahalanobis distance is more discriminative than the Euclidean distance, since its background part is less salient. vision stimulus. Later studies address saliency detection from broader views, e.g., convex hull [7], [11] and frequency domain [12], [13]. In contrast, supervised methods [14]–[16] incorporate high-level and known information to better distinguish the salient regions by learning salient visual information from a large number of images with ground truth labels. Despite the differences in these methods, they all require the basic ability to compute a difference measure on some regions features to distinguish them. To the best of our knowledge, all existing models address saliency detection based on the Euclidean distance. However, Euclidean distance weights features equally without considering the distribution of the data, thereby it becomes invalid when detecting objects in complex images. This phenomenon happens frequently in the saliency detection process, especially when the salient regions and backgrounds are similar, which leads to the problem that the Euclidean distances between the foregrounds and the similar backgrounds are smaller than the distances within the foregrounds. Figure 1 illustrates this problem. Given an image, we first select some initial seeds, including foreground and backgrounds seeds. The process of seeds selection is the same as Section III-C mentioned. We compute the distance between each superpixel and seeds and draw the distance distribution in Figure 1. We observed that the Mahalanobis distance is more distinctive than the Euclidean distance, since its background part is less salient. This motivates us to train a discriminative distance metric to assign appropriate weights to features so that the objects can be precisely separated from the background. We use metric learning to compute a more discriminative distance measure. Distance metric learning has been widely 1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Transcript of adaptive metric learning for saliency detection base paper

Page 1: adaptive metric learning for saliency detection base paper

IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2015 3321

Adaptive Metric Learning for Saliency DetectionShuang Li, Huchuan Lu, Senior Member, IEEE, Zhe Lin, Member, IEEE,

Xiaohui Shen, Member, IEEE, and Brian Price

Abstract— In this paper, we propose a novel adaptive metriclearning algorithm (AML) for visual saliency detection. A keyobservation is that the saliency of a superpixel can be estimatedby the distance from the most certain foreground and backgroundseeds. Instead of measuring distance on the Euclidean space,we present a learning method based on two complementaryMahalanobis distance metrics: 1) generic metric learning (GML)and 2) specific metric learning (SML). GML aims at the globaldistribution of the whole training set, while SML considers thespecific structure of a single image. Considering that multiplesimilarity measures from different views may enhance therelevant information and alleviate the irrelevant one, we tryto fuse the GML and SML together and experimentally findthe combining result does work well. Different from the mostexisting methods which are directly based on low-level features,we devise a superpixelwise Fisher vector coding approach tobetter distinguish salient objects from the background. We alsopropose an accurate seeds selection mechanism and exploitcontextual and multiscale information when constructing the finalsaliency map. Experimental results on various image setsshow that the proposed AML performs favorably against thestate-of-the-arts.

Index Terms— Metric learning, saliency detection,Mahalanobis distance, Fisher vector.

I. INTRODUCTION

V ISUAL saliency aims at finding the regions on an imagethat are more visually distinctive or important and often

serves as a pre-processing procedure for many vision tasks,such as image categorization [1], image retrieval [2], imagecompression [3], content-aware image/video resizing [4], etc.Visual saliency basically breaks down into the problem ofseparating the salient regions from the non-salient ones bymeasuring differences in their features. Numerous models andalgorithms have been proposed to perform this. Unsupervisedapproaches [5]–[9] are stimuli-driven and rely largely ondistinguishing low-level visual features. Early unsupervisedmodels, such as Gaussian pyramids [5], central-surround [5],fuzzy growing [10] are mainly inspired by original biological

Manuscript received August 21, 2014; revised February 10, 2015 andApril 10, 2015; accepted May 26, 2015. Date of publication June 3,2015; date of current version June 23, 2015. This work was supportedin part by the Natural Science Foundation of China under Grant 61472060and in part by the Fundamental Research Funds for the Central Universitiesunder Grant DUT14YQ101. The associate editor coordinating the review ofthis manuscript and approving it for publication was Mr. Pierre-Marc Jodoin.

S. Li and H. Lu are with the School of Information and CommunicationEngineering, Faculty of Electronic Information and Electrical Engineering,Dalian University of Technology, Dalian 116024, China (e-mail:[email protected]; [email protected]).

Z. Lin, X. Shen, and B. Price are with Adobe Research, San Jose,CA 95110 USA (e-mail: [email protected]; [email protected];[email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2015.2440755

Fig. 1. The comparison between the Euclidean distance space and theMahalanobis distance space. The Mahalanobis distance is more discriminativethan the Euclidean distance, since its background part is less salient.

vision stimulus. Later studies address saliency detection frombroader views, e.g., convex hull [7], [11] and frequencydomain [12], [13]. In contrast, supervised methods [14]–[16]incorporate high-level and known information to betterdistinguish the salient regions by learning salient visualinformation from a large number of images with ground truthlabels.

Despite the differences in these methods, they all requirethe basic ability to compute a difference measure on someregions features to distinguish them. To the best of ourknowledge, all existing models address saliency detectionbased on the Euclidean distance. However, Euclidean distanceweights features equally without considering the distribution ofthe data, thereby it becomes invalid when detecting objects incomplex images. This phenomenon happens frequently in thesaliency detection process, especially when the salient regionsand backgrounds are similar, which leads to the problemthat the Euclidean distances between the foregrounds andthe similar backgrounds are smaller than the distances withinthe foregrounds. Figure 1 illustrates this problem. Given animage, we first select some initial seeds, including foregroundand backgrounds seeds. The process of seeds selection is thesame as Section III-C mentioned. We compute the distancebetween each superpixel and seeds and draw the distancedistribution in Figure 1. We observed that the Mahalanobisdistance is more distinctive than the Euclidean distance, sinceits background part is less salient. This motivates us to traina discriminative distance metric to assign appropriate weightsto features so that the objects can be precisely separated fromthe background.

We use metric learning to compute a more discriminativedistance measure. Distance metric learning has been widely

1057-7149 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: adaptive metric learning for saliency detection base paper

3322 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2015

Fig. 2. The comparison between low-level features and our SFV feature. (a) input image. (b) saliency map based on low-level features. (c) saliency mapbased on SFV. (d) ground truth.

Fig. 3. Pipeline of the adaptive metric learning algorithm. IT [5], GB [19], LR [14], RC [6] are other four saliency methods.

adopted for this purpose in different applications since it takesinto account the covariance information when estimating thedata distributions and improves the performance of learningmethods significantly. To our knowledge, we are the first tosuccessfully formulate the saliency detection problem into ametric learning framework and our method works well ondifferent databases. We also propose a Superpixel-wise FisherVector coding approach which maps the low-level features,such as RGB and LAB, to high dimensional sparse vector.Compared with using low-level features directly, the SFV ismore discriminative in challenging environments as shownin Figure 2. Thus we use SFV features to describe eachsuperpixel.

In this paper, we adopt an effective feature coding methodand propose a novel metric learning based saliency detectionmodel, which incorporates both supervised andsemi-supervised information. Our algorithm considers boththe global distribution of the whole training dataset (GML)and the typical structure of a specific image (SML), andwe successfully fuse them together to extract the clusteringcharacteristics for estimating the final saliency map. Figure 3shows the pipeline of our method. First, as an extension of thetraditional Fisher Vector coding [17], Superpixel-wise FisherVector coding is proposed to describe superpixels by learningthe parameters of a Gaussian mixture model (Section III-A).Second, we train a Generic metric from the trainingset (Section III-B1) and apply it to a single image to findthe saliency seeds with the assistance of the superpixel-wise

objectness map generated by [18] (Section III-C).Third, a Specific metric based on kernel classification islearnt from the chosen seeds for each image (Section III-B2).Finally, by integrating the Generic metric and Specificmetric together (Section III-D), we obtain the clusteringinformation for each superpixel and use it to generate thefinal saliency map (Section III-E). The GML and SML asshown in Figure 3 are two intermediate images which arenot really generated when computing saliency maps. But theyserve as comparisons to demonstrate the efficiency of thefused results in Section IV-A. The main contributions of ourwork include:

• Two metric learning approaches are first applied tosaliency detection as the optimal distance measure oftwo superpixels. GML is learnt from the global trainingset while SML is learnt from the specific image trainingsamples. They are complementary to each other andachieve promising results after the affinity aggregation.

• A superpixel-wise fisher vector coding method is firstput forward which contains image contextual informationwhen representing superpixels and makes supervisedlearning methods more suitable for single imageprocessing.

• An accurate seeds selection method is first presentedbased on the Mahalanobis distance metric. The selectedseeds serve as training samples of the Specific metriclearning and reference nodes when evaluating saliencyvalues.

Page 3: adaptive metric learning for saliency detection base paper

LI et al.: ADAPTIVE METRIC LEARNING FOR SALIENCY DETECTION 3323

Experimental results on various image sets show that ourmethod is comparable with most of the state-of-the-arts andthe proposed metric learning approaches can be extended toother fields as well.

II. RELATED WORK

Significant improvement and prosperity in saliencydetection have been witnessed in recent years. Numerousunsupervised approaches have been proposed under differenttheoretical models. Cheng et al. [6] propose a global regioncontrast algorithm which simultaneously considers the spatialcoherence across the regions and the global contrast overthe entire image. However, low-level color contrast becomesinvalid when dealing with challenging scenes. Li et al. [20]compute the dense and sparse reconstruction errors basedon background templates which are extracted from imageboundaries. They propose several integration strategies, suchas multi-scale reconstruction error and Bayesian integration,which improve the performance of saliency detectionsignificantly. In [21], boundary connectivity, a robustbackground measure, is first applied to saliency detection.It characterizes the spatial layout of image regions andprovides a specific geometrical explanation to its definition.Perazzi et al. [22] formulate the saliency estimation andcomplete contrast using high-dimensional Gaussian filters.They modify SLIC [23] and demonstrate the effectiveness oftheir superpixel segmentation approach in detecting salientobjects.

Furthermore, lacking the knowledge of sizes and locationsof objects, boundary prior and objectness are often adoptedto highlight the salient regions or depress the backgrounds.Jiang et al. [18] construct saliency by integratingthree visual cues, including uniqueness, focusness andobjectness (UFO), where uniqueness represents color contrast;focusness indicates the degree of focus, often appearing as thereverse of blurriness; objectness proposed by Alexe et al. [24]is the likelihood of a given image window containing anobject. In [25], Wei et al. define the saliency value ofeach patch as the shortest distance to the image boundary,observing that image boundaries are more likely to be thebackground. However, this assumption is less convincing,especially when the scene is challenging.

Compared with unsupervised approaches, supervisedmethods are apparently rare. In [26] and [27], Jiang et al.also propose a multi-scale learning approach, which maps theregional feature vector to a saliency score and fuse thesescores across multiple levels to generate the final saliencymap. They introduce a novel feature vector, which integratesthe regional contrast, regional property and regionalbackgroundness descriptors together, to represent each regionand learn a discriminative random forest regressor to predictregional scores. Shen and Wu [14] treat an image as thecombination of sparse noises and the low-rank matrix. Theyextract low-level features to form high-level priors and thenincorporate the priors to a low-rank matrix recovery modelfor constructing the saliency map. However, the saliencyassignment near the object is unsatisfying due to the ambiguityof prior maps. Liu et al. [28] formulate the saliency detection

as a partial differential equation problem and solve it underan adaptive PDE learning framework. They learn the optimalsaliency seeds via discrete submodularity and use seeds asboundary condition to solve the Linear Elliptic System.

Inspired by these works, we construct a metric fusionframework which contains two complementary metric learningapproaches to generate robust and accurate saliency maps evenin complex scenes. Our method encodes low-level features intoa high-dimensional feature space and incorporates multi-scaleand objectness information when measuring saliency values.Therefore, our method can uniformly highlight objects withexplicit object boundaries.

III. PROPOSED ALGORITHM

In this section, we present an effective and robust adaptivemetric learning method for visual saliency detection. Theproposed algorithm proceeds through five steps to generatethe final saliency map. Firstly, we extract low-level featuresto encode the superpixels generated by the simplelinear iterative clustering (SLIC) [23] algorithmwith a Superpixel-wise Fisher Vector representation.Secondly, two Mahalanobis distance metric learningapproaches, Generic metric learning and Specific metriclearning are introduced to learn the optimal distance measureof superpixels. Thirdly, we propose a novel seeds selectionstrategy based on the Mahalanobis distance to generatesaliency seeds, which can be used to train Specific metricas training samples and evaluate the saliency values asreferenced nodes. Fourthly, a metric fusion framework ispresented to fuse the Generic and Specific metrics together.Finally, we obtain graceful and smooth saliency maps bycombining the spectral clustering and multi-scale information.

A. Superpixel-Wise Fisher Vector Coding (SFV)

Appropriate feature coding approaches can effectivelyextract main information and remove the redundancies, thusgreatly improving the performance of saliency detection.Fisher Vector can be regarded as an extension of thewell-known bag-of-words representation, since it capturesthe first-order and second-order differences between localfeatures and the centers of a Mixture of Gaussian Distributions.Recently, Chen et al. [29] extend Fisher Vector to the pointlevel image representation for object detection. For a differentpurpose, we propose to further extend the FV coding tosuperpixel level and experimentally verify the superiority ofour Superpixel-wise Fisher Vector coding method.

Given a superpixel i= {pt , t = 1, . . . , T }, where pt is a�-dimensional image pixel, and T is the number of pixelswithin i , we train a Gaussian mixture model (GMM)�λ(pt ) = ∑K

k=1 υkψk(pt) from all the pixels of animage using the Maximum Likelihood (ML) criterion. Theparameters of the K -component GMM are defined asλ = {υk, μk,�k , k = 1, . . . , K }, where υk, μk and �k arethe mixture weight, mean vector and covariance matrix ofGaussian k respectively. Similar to the FV coding method, theSFV representation can be written as a � = 2�K -dimensionalconcatenated form:

ϕi = {ζμ1, ζσ1, . . . , ζμK , ζσK} (1)

Page 4: adaptive metric learning for saliency detection base paper

3324 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2015

where ζμk and ζσk are defined as: ζμk = 1T

√υk

T∑

t=1ηt (k)

pt−μkσk

,

ζσk= 1T

√υk

T∑

t=1ηt (k)

1√2{ (pt−μk )

2

σ 2k

− 1}, and σk is the square root

of the diagonal values of �k , ηt (k) is the soft assignment of pt

to Gaussian k.The SFV representation ϕi is hereby used to describe

superpixel i in this paper. It has several advantages:• As an extension of Fisher Vector coding,

SFV successfully realizes superpixel level codingrepresentation, making Fisher Vector more suitable forsingle image processing. Instead of averaging low-levelfeatures of contained pixels, SFV statistically analyzesthe internal feature distribution of each superpixel,providing a more accurate and reliable representationfor it. Experiments show that our SFV generates moresmooth and uniform saliency maps and improves about2 percent compared with low-level features in theprecision-recall curve on the MSRA-1000 database asshown in Figure 7.

• SFV can be regarded as an adaptive Fisher Vector coding,since the parameters of the GMM model are trainedon a specific image online. This means even the samesuperpixels in different images have different codingrepresentations. Therefore, our SFV better considersimage contextual information.

• Due to the small number of superpixels in an image andtheir disjoint nature, SFV is much faster than existingstate-of-the-art FV variants. Furthermore, besides saliencydetection, SFV can also be applied to other vision tasks,such as image segmentation and content-aware imageresizing, etc.

B. Adaptive Metric LearningLearning a discriminative metric can better distinguish the

samples in different classes, as well as shortening the distancewithin the same class. Numerous models and methodshave been proposed in the last decade, especially for theMahalanobis distance metric learning, such as informationtheoretic metric learning (ITML) [30], large margin nearestneighbor (LMNN) [31], [32], and logistic discriminative basedmetric learning (LDML) [33].

However, most existing metric learning approaches learn afixed metric for all samples without considering the deeperstructure of the data, thereby breaking down in the presence ofirrelevant or unreliable features. In this paper, we propose anadaptive metric learning approach, which considers both theglobal distribution of the whole training set (GML) and thespecific structure of a single image (SML) to better separateobjects from the background. Our approach can also be viewedas an integration of a supervised distance metric learningmodel (GML) and a semi-supervised distance metric learningmodel (SML). Since GML and SML are complimentary toeach other, we get promising results after fusingthem together under an affinity aggregation framework(Section III-D).

1) Generic Metric Learning (GML): Metric learning hasbeen widely applied to vision tasks, but never been used for

saliency detection because of its long training time, which isinfeasible for single image processing. In this part, we solvethis problem by pre-training a Generic metric Mg from thefirst 500 images of MSRA-1000 database using gradientdescent, and we verify, both experimentally and empirically,that Mg is generally suitable for all images.

First, we construct a training set {ϕi , i = 1, 2, . . . ,M}consisted of superpixels extracted from all training images,where ϕi is the SFV representation of superpixel i . To findthe most discriminative Mg , we minimize

M∗g = arg min

Mg

1

2α∥∥Mg

∥∥2 +

n

{i j |δni =1,δn

j =0}D(i, j ) (2)

D(i, j ) = exp{−(ϕi − ϕ j )T Mg(ϕi − ϕ j )/σ

21 } (3)

where δni is an indicator of the i th superpixel in the nth image

belonging to the foreground or background, D(i, j ) is theexponential Mahalanobis distance between i and j under thedistance metric Mg . We set σ1 = 0.1 to control the strengthof distances.

Considering that the background is various and chaotic, anddifferent object regions are distinctive as well, we just imposerestriction on pairwise distances between positive samplesand negative ones, which is more reliable and reasonablefor the fact that salient objects are always distinct from thebackground. This minimization aims at maximizing featuredistances between foreground and background samples,thereby significantly improving the performance of saliencydetection. Eqn 2 can be easily solved by gradient descent.The Generic metric includes the information of all superpixelsin the whole training images, thus it is appropriate for mostimages.

2) Specific Metric Learning (SML): Recently,Wang et al. [34] propose a novel doublet-SVM metriclearning approach based on Kernel Classification Framework,thus formulating the metric learning into a SVM problem andachieving desirable results with less training time. However,experiments show that directly applying doublet-SVMto saliency detection cannot ensure good detectionaccuracy. Therefore, we modify this approach by addinga constraint ω(τ1,τ2), which significantly improves theperformance of the final saliency map.

Let {ϕi , i = 1, 2, . . . ,m} be the training dataset, where ϕi isthe SFV representation of a labeled superpixel extracted froma specific image. The detailed process of extracting labeledsuperpixels from an image will be discussed in Section III-C.We first divide these samples into foreground seeds andbackground seeds and label them as 1 and 0 respectively.Given a training sample ϕi with label hi , we find its q1 nearestneighbors with the same label and q2 nearest neighbors withdifferent labels, and then (q1 + q2) doublets are constructedfor it. Each doublet consists of the training sample ϕi andone of its nearest neighbors. By combining the doublets ofall samples together, a doublet set χ = {x1, x2, . . . , xZ } isestablished, where xτ = (ϕτ,1, ϕτ,2), τ = 1, 2, . . . Z is oneof the doublets, and ϕτ,1 and ϕτ,2 are the SFV of superpixelτ1 and τ2 in doublet xτ , We assign xτ a label as follows:lτ = −1 if hτ,1 = hτ,2, and lτ = 1 if hτ,1 �= hτ,2.

Page 5: adaptive metric learning for saliency detection base paper

LI et al.: ADAPTIVE METRIC LEARNING FOR SALIENCY DETECTION 3325

As an extension of degree-2 polynomial kernel, we definethe doublet level degree-2 polynomial kernel as:

K p(xτ , xι)

= tr(ω(τ1,τ2)(ϕτ,1 − ϕτ,2)(ϕτ,1 − ϕτ,2)

T

ω(ι1,ι2)(ϕι,1 − ϕι,2)(ϕι,1 − ϕι,2)T

)

= ω(τ1,τ2)ω(ι1,ι2){(ϕτ,1 − ϕτ,2)T (ϕι,1 − ϕι,2)}2 (4)

where ω(τ1,τ2) = θ(τ1,τ2) ∗ O(τ1,τ2) is a weight parameter.

θ(τ1,τ2) = 1−exp(−dist(τ1,τ2)/σ2) (5)

O(τ1,τ2) = 1 − exp{−(Oτ1 − Oτ2)2/σ2} (6)

where dist(τ1,τ2) is the space distance between superpixelτ1 and τ2, and θ(τ1,τ2) is the corresponding exponential spacedistance. Oτ1 is the objectness score defined as Eqn 11 ofsuperpixel τ1, and O(τ1,τ2) is the superpixel-wise objectnessdistance between τ1 and τ2. We set σ2 = 0.1. The weightparameter ω(τ1,τ2) provides crucial spatial and prior informa-tion regarding the interesting objects, thus it is more robust inevaluating the similarity between a pair of superpixels than thefeature distance alone. In order to determinate the similarity oftwo samples in a doublet, we further define a kernel decisionfunction as follows:

E(x) = sgn{∑

τ

ατ lτ K p(xτ , x)+ β} (7)

where ατ is the weight of doublet xτ , β is a bias parameter.We have

τ

ατ lτ K p(xτ , x)+ β

= ω(x1,x2)(ϕx,1 − ϕx,2)T Ms(ϕx,1 − ϕx,2)+ β (8)

Ms =∑

τ

ατ lτω(τ1,τ2)(ϕτ,1 − ϕτ,2)(ϕτ,1 − ϕτ,2)T (9)

For the facility of computation, we set ω(x1,x2)=1. Theproposed Specific metric Ms can be easily solved by existingSVM solvers. The Specific metric is trained only on thetest image, and it is much faster than existing metriclearning approaches. According to [34], the doublet-SVMis 2000 times, on average, faster than the ITML [30].Therefore, it is feasible to train a Specific metric for eachimage to better distinguish its objects from the background.

In this part, we propose two metric learning approaches:GML and SML. The first one considers more about the globaldistribution of the whole training set, while the second oneaims at exploring the deeper structure of a specific image.GML can be pretrained offline and is generally suitable forall images, while SML is much faster, since it can be solvedby existing SVM solvers. We need to mention that the imagespecific is not always better than the Generic metric, as it hasfewer training samples and less reliable labels. Instead, thesetwo metrics are supposed to be complementary to each otherand can be fused together to improve the performance of thefinal detection results.

C. Iterative Seeds Selection by Mahalanobis Distance (ISMD)

As a preliminary criterion of saliency detection, saliencyseeds directly influence the performance of seeds-basedsolutions. Recently, Liu et al. [28] propose an optimalseeds selection strategy via submodularity. By adding a stopcriterion, the submodularity problem can be solved and thenthe optimal seed set is obtained accordingly. In [35], Lu et al.learn optimal seeds by combining bottom-up saliency mapsand mid-level vision cues. Inspired by their works, we proposea compact but efficient iterative seeds selection scheme basedon the Mahalanobis distance assessment (ISMD).

Alexe et al. [24] present a novel objectness method tomeasure the likelihood of a given image window containingan object. Jiang et al. [18] extend the original objectness toPixel-level Objectness O(p) and Region-level Objectness Oi

by defining:

O(p) =W∑

w=1

P(w) (10)

Oi = 1

T

p∈i

O(p) (11)

where W is the number of sampling windows that containpixel p, and P(w) is the probability score of the wth window,T is the number of pixels within region i . We redefine theregion-level objectness as superpixle-wise objectness in thispaper.

Motivated by the fact that highlights of the superpixle-wiseobjectness map are more likely to be the foreground seeds,a set of initial foreground seeds is constructed from the lightesttwo percent regions of the objectness map. Consideringthat the background is massive and scattered, we pick outseveral lowest objectness values from each boundary of thesuperpixel-wise objectness map as initial background seeds.The intuition is that if superpixel i is a foreground seed, theratio of distances from foreground seeds and background seedsshould be small. We formulate the ratio as follows:

�i =

f sdrat (i, f s)

bsdrat (i, bs)

(12)

where

drat (i, f s) = φ(i, f s)(ϕi − ϕ f s)Mg(ϕi − ϕ f s)T (13)

is the Mahalanobis distance between superpixel i and oneof foreground seeds f s under the Generic metric Mg , andφ(i, f s) = d(i, f s) ∗ O(i, f s) is a weight parameter, where

d(i, f s) = exp(−dist2(i, f s)/σ2) (14)

is another kind of exponential space distance betweensuperpixel i and f s. Only when �i ≤ �0 or �i ≥ �1,i can be added to the foreground seeds set or backgroundseeds set, where �0 and �1 are two thresholds. With thenew added seeds each time, we iterate this process N1 times.Since most of the area in an image belongs to the back-ground, in order to generate more background seeds, theiteration continues N2 times more, but only selects seeds

Page 6: adaptive metric learning for saliency detection base paper

3326 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2015

Fig. 4. Iterative seeds selection by Mahalanobis distance. Initial saliencyseeds are first selected from the lightest and the darkest parts of the superpixel-wise objectness map. By computing the Mahalanobis distance between anysuperpixel and the chosen seeds, we iteratively increase the foreground andbackground seeds.

with∑

bsdrat (i, bs) ≤ �2, where �2 is a threshold. Then we

obtain the final seeds set as illustrated in Figure 4.As elaborated in Section III-B2, the Specific metric Ms

can be learnt from the labeled seeds via doublet-SVM.One may concern that Ms will rely too much on Mg , since thelabeled seeds are generated under Mg . Fortunately, by learninga generally suitable metric, we can enforce a very high seedsaccuracy (98.82% on MSRA-1000 database) which means theseeds-based Specific metric is reliable enough to measure thedistance.

D. Metric Fusion for Extracting SpectralClustering Characteristics

Aggregating several affinity matrices appropriately mayenhance the relevant and useful information, and at the sametime, alleviate the irrelevant and unreliable one. Spectralclustering is an important unsupervised clustering algorithmfor transferring the feature representation into a morediscriminative indicator space, and we call this property as“spectral clustering characteristics”. Spectral clustering hasbeen applied to many fields for its effective and outstandingperformance.

In this section, we merge the metric fusion into a spectralclustering features extraction process [36] and learn theoptimal aggregation weight for each affinity matrix. The fusionstrategy significantly improves the results of saliency detectionas shown in Figure 5. Based on the two metrics learnt above,two affinity matrices �g and �s are constructed with thecorresponding i j th element

πgi, j = exp{−φ(i, j )(ϕi − ϕ j )Mg(ϕi − ϕ j )

T/σ3}π s

i, j = exp{−φ(i, j )(ϕi − ϕ j )Ms(ϕi − ϕ j )T/σ3} (15)

where σ3 = 0.1. The affinity aggregation strategy aimsat finding the optimal clustering characteristic vector � ofall the superpixels in an image and the weight parameterϑ = [ϑg, ϑs ]T associated with �g and �s , so the fusion

Fig. 5. Evaluation of metrics. (a) input images. (b) Generic metric.(c) Specific metric. (d) fused results. (e) ground truth.

problem can be conducted as:

minϑg,ϑs

�1,...,�r

{∑

i, j

ϑ2gπ

gi, j ‖�i − � j‖2 +

i, j

ϑ2s π

si, j ‖�i − � j‖2}

= minϑg,ϑs

�1,...,�r

{ϑ2g�

T (Hg −�g)�+ ϑ2s �

T (Hs −�s)�}

= minϑg,ϑs

(βgϑ2g + βsϑ

2s ) (16)

where �i is the clustering characteristic indicator ofsuperpixel i , and r is the number of superpixels in an image,Hg = diag{h11, . . . , hrr } is the diagonal matrix of �g with itsdiagonal element hii = ∑

gi, j , βg = �T (Hg −�g)�. To solve

this problem, we first employ two constraints: the normalizedweight constraint ϑg + ϑs = 1 and the normalized spectralclustering constraint �T H� = 1. By fixing ϑ , the clusteringcharacteristic vector can be easily obtained using standardspectral clustering. If � is given, Eqn 16 can be formulated as:

minϑg,ϑs

(βgϑ2g + βsϑ

2s ) = min

μg,μs(ρgμ

2g + ρsμ

2s ) (17)

subject to

μ2g + μ2

s = 1,μg√αg

+ μs√αs

= 1 (18)

where αg = �T Hg�, ρg=βgαg

and μg = √αgϑg . This can be

easily solved by existing 1D line-search methods.To summarize, metric fusion tries to find the optimal

clustering characteristic vector � and the optimal weightparameter ϑ via a two-step iterative strategy. Since affinitymatrices incorporate φ(i, j ) in Eqn 15, the convergencecan be very fast, about three iterations in each image.We use the indicator representation to compute saliency maps(Section III-E).

E. Context-Based Multi-Scale Saliency Detection

In this section, we propose a context-based multi-scalesaliency detection algorithm to compute the saliency map foreach image. Lacking the knowledge of sizes of objects, we firstgenerate superpixels in S different scales. Then the K-meansalgorithm is applied in each scale to segment an image into

Page 7: adaptive metric learning for saliency detection base paper

LI et al.: ADAPTIVE METRIC LEARNING FOR SALIENCY DETECTION 3327

Fig. 6. The distribution of saliency values of ground truth foregrounds and backgrounds. (a) Generic metric on MSRA-1000. (b) Specific metric onMSRA-1000. (c) AML on MSRA-1000. (d) AML on MSRA-5000.

N clusters via their SFV features. According to the intuitionthat a superpixel is salient if its cluster neighbors are closeto the foreground seeds and far from the background seeds,we define the distance between superpixel i and saliency seedsin scale s as:

D(s)i, f s =

f n(s)∑

q=1

{γ ‖�i − �q‖ + (1 − γ)

N (s)c∑

j=1

W i, j ‖� j − �q‖}

D(s)i,bs =

bn(s)∑

q=1

{γ ‖�i − �q‖ + (1 − γ)

N (s)c∑

j=1

W i, j ‖� j − �q‖}

(19)

where

W i, j = Q1 exp{−dist(i, j )/σ2} ∗ Q2 exp{−(Oi − O j )2/σ2}

(20)

is the weighted distance between superpixel i and its clusterneighbor j , and �i is the clustering characteristic indicatorof superpixel i , f n and bn are the number of foregroundand background seeds chosen by our ISMD seeds selectionapproach. Q1, Q2 and γ are weight parameters, Nc is thenumber of cluster neighbors of superpixels i . The saliencyvalue of superpixel i can be formulated as:

sal(i) =S∑

s=1

νs ∗ exp(Oi )

1 + {(1 − exp(−D(s)i, f s/σ4)}/D(s)

i,bs

=S∑

s=1

νs ∗ exp(Oi ) ∗ D(s)i,bs

D(s)i,bs + 1 − exp(−D(s)

i, f s/σ4)(21)

where νs is the weight of scale s, and σ4 = 0.1.The considerations of all the other superpixels belonging to

the same cluster and multiple scales smooth the saliency mapeffectively, and make our approach more robust in dealing withcomplicated scenes.

IV. EXPERIMENTS

We evaluate the proposed method on four benchmarkdatasets. The first one is MSRA-1000 [13], a subsetof MSRA-5000, which has been widely used in previousworks with its accurate human-labelled masks. The secondone is MRAS-5000 dataset [15] which includes 5000 morecomprehensive images. The third one is THUS-10000 [37]consists of 10000 images, each of which has an unambiguoussalient object with pixel-wise ground truth labeling. The last

Fig. 7. (a) Precision-recall curve for Generic metric, Specific metric, andfused results without neighbor smoothness (MSRA-1000 and Berkeley-300).Precision-recall curve based on SFV and low-level features. Precision-recallcurve for other two fusion methods. (b) Images of fused results based on SFVand low-level features.

one is Berkeley-300 [38] which contains more challengingscenes with multiple objects of different sizes and locations.Since we have already used the first 500 imagesof MSRA-1000 for training, we evaluate our algorithmand compare it with other methods on the rest 500 images ofMSRA-1000, 4500 images of MSRA-5000, where excludes500 training images (MSRA-5000 contains all the images ofMSRA-1000), 9501 images of THUS-10000 (THUS-10000contains 499 training images), and Berkeley-300.

A. Evaluation of Metrics

We perform several comparative experiments as shownin Figure 5, Figure 6 and Figure 7(a) to demonstrate theefficiency of Generic metric (GML), Specific metric (SML),and their combination (AML based on SFV). In order toeliminate the influence of neighbor smoothness, Eqn 19, whencomparing metrics, we just compute the distance between eachsuperpixel and seeds, instead of the sum of weighted distancesof its cluster neighbors:

D(s)i, f s =

f n(s)∑

q=1

‖�i − �q‖, D(s)i,bs =

bn(s)∑

q=1

‖�i − �q‖ (22)

The precision-recall curves of the Generic metric and Specificmetric are almost the same, but their combination outperformsboth of them. We also try to add or multiply saliency mapsgenerated by these two metrics directly, but the PR curves aremuch lower than our fusion approach in Figure 7(a). Thisis consistent with our motivation: Mg is trained from the

Page 8: adaptive metric learning for saliency detection base paper

3328 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2015

Fig. 8. Results of different methods. (a), (b) Precision-recall curves on MSRA-1000. (c) Average precisions, recalls, F-measures and AUC on MSRA-1000.(d), (e) Precision-recall curves on MSRA-5000. (f) Average precisions, recalls, F-measures and AUC on MSRA-5000.

Fig. 9. Results of different methods. (a), (b) Precision-recall curves on THUS-10000. (c) Average precisions, recalls, F-measures and AUC on THUS-10000.(d), (e) Precision-recall curves on Berkeley-300. (f) Average precisions, recalls, F-measures and AUC on Berkeley-300.

whole training dataset, containing the global distribution of thedata, and Ms aims at a single image, considering the specificstructure of samples.

Figure 5 demonstrates that the fused results significantlyremove the light saliency values in the background regionsproduced by GML and SML. Since most parts in computingsaliency maps under different metrics are the same,e.g., objectness prior map, seeds selection, etc., it is reasonablethat Figure 5 (b) and (c) are similar, but there are stilldifferences between them. To further prove this, we conduct anextra experiment as shown in Figure 11. The second line is theresults generated by fusing the GML with itself, the third lineis the results generated by fusing the SML with itself, and thefourth line the obtained by fusing the GML and SML. We callthem as GG, SS, and AML respectively. Limited by the imageresolution, some differences between the GML and SML maynot be find in Figure 5, but the integration with the metricitself can apparently enlarge their distinctiveness. Furthermore,if one metric is incorrect, another one can make up it.The SS performs better than the GG in Figure 11 (a)-(e),while the GG is better in (f)-(g), and the AML tends totake the best results of them, which demonstrates that theGML and SML are indeed complimentary to each other andimprove the performance of saliency detection after fusion.Figure 11 (k)-(m) show that if both the GML and SML getbad results, the results after fusion are still bad.

In addition, we plot the distribution of saliency values inFigure 6. Ground truth masks provide a specific label, 1 or 0,for each pixel and we regard a superpixel as foreground whenmore than 80% pixels of it are labelled by 1. Otherwise, thesuperpixel will be background. We put all the foregroundsuperpixels from the whole dataset together and get thedistribution of their saliency values computed by differentsaliency methods as the red line. The blue line is the

distribution of saliency values of background superpixels.Figure 6(a), (b), (c) are the saliency distribution producedby GML, SML and AML on MSRA-1000 respectively.Figure 6(d) is AML on MSRA-5000. This shows that AML isbetter than GML and SML, since its background saliencyvalues are closer to 0.

Furthermore, our Generic metric is robust to differentdatabases. We use the metric trained from MSRA-1000 toall the databases, including MSRA-1000, MSRA-5000,THUS-10000, and Berkeley-300. As shown inFigure 8 and Figure 9, the results are still promising evenon different databases, which demonstrates the effectivenessand adaptiveness of our Generic metric. Overall, the fusedresults based on two outstanding and complementary metricsachieve higher precision and recall values and generate moreaccurate saliency maps.

B. Evaluation of Superpixel-Wise Fisher Vector

We have mentioned that our Superpixel-wise Fisher Vectorcoding approach can improve the performance of saliencydetection by capturing the average first-order and second-orderdifferences between local features and the centers of a Mixtureof Gaussian Distributions. In experiments, we extract thelow-level features: RGB and LAB to learn a 12D SFVrepresentation for each superpixel (� = 6, K = 1,� = 2�K = 12). Figure 7(a) shows the efficiency of ourSFV coding approach by comparing the precision-recall curvesof low-level features and the SFV on MSRA-1000 database.Figure 7(b) are corresponding images.

C. Evaluation of Saliency Maps

We compare the proposed saliency detection model withseveral state-of-the-art methods: IT [5], GB [19], FT [13],

Page 9: adaptive metric learning for saliency detection base paper

LI et al.: ADAPTIVE METRIC LEARNING FOR SALIENCY DETECTION 3329

Fig. 10. The comparison of previous methods, our algorithm and ground truth. (a) Test image. (b) IT [5]. (c) GB [19]. (d) GC [39].(e) CB [44]. (f) UFO [18]. (g) Proposed. (h) Ground truth.

GC [39], UFO [18], SVO [40], HS [41], PD [42], AMC [43],RCJ [37], DSR [20], DRFI [26], CB [44], RC [6], LR [14]and XL [45]. We use source codes provided by theauthors or implement them based on the available codes orsoftwares.

We conduct several quantitative comparisons of sometypical saliency detection methods. Figure 8(a), (b), (d) and (e)show that the proposed AML is comparable with most of thestate-of-the-arts on MSRA-1000 and MSRA-5000 databases.Figure 8(c) and (f) are the comparisons of average precision,recall, F-measure and AUC. We use AUC as an evaluationcriteria, since it represents the area under the PR curveand can effectively reflect the global properties of differentalgorithms. Instead of using the bounding boxes to evaluatethe saliency detection performances on MSRA-5000 database,we adopt the accurate human-labeled masks provided by [26]to ensure more reliable comparative results. We also performexperiments on THUS-10000 and Berkeley-300 databasesas shown in Figure 9. Precision-recall curves show thatAML reaches 97.4%, 94.0%, 96.5%, 81.5% precision rate onMSRA-1000, MSRA-5000, THUS-10000, and Berkeley-300respectively. All of them demonstrate the efficiency of ourmethod.

Figure 10 shows some sample results of five previousapproaches and our AML algorithm. The IT and GB methodsare capable in finding the salient regions in most cases, butthey tend to highlight the boundaries and miss lots of objectinformation because of the blurriness of saliency maps. TheGC method cannot contain all the salient pixels and oftenmislabels small background patches as salient regions. TheCB and UFO models can highlight the objects uniformly, butthey become invalid in dealing with challenging scenes. Ourmethod can catch both the small and large salient objects evenin complex environments. In addition, we can highlight theobjects uniformly with accurate boundaries and do not needto care about the number and locations of the salient objects.

We also test the average computational cost on differentdatasets: 18.15s on MSRA-1000, 18.42s on MSRA-5000,17.90s on THUS-10000 and 18.78s on Berkeley-300. The pro-posed algorithm is implemented in MATLAB on a PC machinewith Intel i7-3370 CPU (3.4 GHz) and 32 GB memory.

D. Evaluation of Selected Seeds

We train an effective Specific metric based on theassumption that the selected seeds are correct. In experiments,

Page 10: adaptive metric learning for saliency detection base paper

3330 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 24, NO. 11, NOVEMBER 2015

Fig. 11. Example results of different metrics. The first line is the input images, the second line is the results generated by fusing the GML with itself, thethird line is the results generated by fusing the SML with itself, the fourth line is obtained by fusing the GML and SML, and the last line is the ground truthimages.

we cannot ensure that the chosen seeds are completelyaccurate, but we can enforce a very high seeds accuracy. Theaccuracy of selected seeds is defined as follows:

sa = f sc + bsc

f st + bst= f sc + bsc

( f sc + f sic)+ (bsc + bsic)(23)

where

f sc =∑

n

i

(gtni &seedn

i )

bsc =∑

n

i

(gtni &seedn

i ) (24)

i represents the i th superpixel extracted from the nth imageof a typical database. gtn

i and seedni are the ground truth and

label assigned by our seeds selection mechanism of i . Theaccuracy rates of four databases are: 0.9882 on MSRA-1000,0.9769 on MSRA-5000, 0.9822 on THUS-10000 and 0.8874on Berkeley-300. We experimentally verify that the seeds areaccurate enough to generate a reliable Specific metric foreach image.

V. CONCLUSION

In this paper, we explicitly propose two Mahalanobisdistance metric learning models and a superpixel-wise fishervector representation for visual saliency detection. To ourknowledge, we are the first to apply metric learning tosaliency detection and conduct a metric fusion mechanismto improve the detection accuracy. Different from previousmethods, we adopt a new feature coding strategy and makethe supervised metric learning more suitable for single imageprocessing. In addition, we propose an accurate seeds selectionmethod based on the Mahalanobis distance measure to trainthe Specific metric and construct the final saliency map.We estimate the saliency value of each superpixel from amulti-scale view and include the contextual information whencomputing it. Experimental results with sixteen state-of-the-artalgorithms on four benchmark image databases demonstratethe efficiency of our metric learning approach and the saliencydetection model. In the future, we plan to explore more robustobject detection approaches to further improve the accuracyof saliency detection.

REFERENCES

[1] C. Siagian and L. Itti, “Rapid biologically-inspired scene classificationusing features shared with visual attention,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 29, no. 2, pp. 300–312, Feb. 2007.

[2] H. Liu, X. Xie, X. Tang, Z.-W. Li, and W.-Y. Ma, “Effective browsingof Web image search results,” in Proc. 6th ACM SIGMM Int. WorkshopMultimedia Inf. Retr., 2004, pp. 84–90.

[3] C. Christopoulos, A. Skodras, and T. Ebrahimi, “The JPEG2000 stillimage coding system: An overview,” IEEE Trans. Consum. Electron.,vol. 46, no. 4, pp. 1103–1127, Nov. 2000.

[4] Y. Niu, F. Liu, X. Li, and M. Gleicher, “Warp propagation for videoresizing,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2010,pp. 537–544.

[5] L. Itti, C. Koch, and E. Niebur, “A model of saliency-based visualattention for rapid scene analysis,” IEEE Trans. Pattern Anal. Mach.Intell., vol. 20, no. 11, pp. 1254–1259, Nov. 1998.

[6] M.-M. Cheng, G.-X. Zhang, N. J. Mitra, X. Huang, and S.-M. Hu,“Global contrast based salient region detection,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jun. 2011, pp. 409–416.

[7] Y. Xie, H. Lu, and M.-H. Yang, “Bayesian saliency via low and midlevel cues,” IEEE Trans. Image Process., vol. 22, no. 5, pp. 1689–1698,May 2013.

[8] C. Yang, L. Zhang, H. Lu, X. Ruan, and M.-H. Yang, “Saliency detectionvia graph-based manifold ranking,” in Proc. IEEE Conf. Comput. Vis.Pattern Recognit., Jun. 2013, pp. 3166–3173.

[9] J. Sun, H. Lu, and X. Liu, “Saliency region detection based on Markovabsorption probabilities,” IEEE Trans. Image Process., vol. 24, no. 5,pp. 1639–1649, May 2015.

[10] Y.-F. Ma and H.-J. Zhang, “Contrast-based image attention analysis byusing fuzzy growing,” in Proc. 11th ACM Int. Conf. Multimedia, 2003,pp. 374–381.

[11] J. Sun, H. Lu, and S. Li, “Saliency detection based on integrationof boundary and soft-segmentation,” in Proc. IEEE Int. Conf. ImageProcess., Sep./Oct. 2012, pp. 1085–1088.

[12] X. Hou and L. Zhang, “Saliency detection: A spectral residual approach,”in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jun. 2007,pp. 1–8.

[13] R. Achanta, S. Hemami, F. Estrada, and S. Süsstrunk, “Frequency-tunedsalient region detection,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit. (CVPR), Jun. 2009, pp. 1597–1604.

[14] X. Shen and Y. Wu, “A unified approach to salient object detection vialow rank matrix recovery,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2012, pp. 853–860.

[15] T. Liu et al., “Learning to detect a salient object,” IEEE Trans. PatternAnal. Mach. Intell., vol. 33, no. 2, pp. 353–367, Feb. 2011.

[16] J. Yang and M.-H. Yang, “Top-down visual saliency via joint CRF anddictionary learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Jun. 2012, pp. 2296–2303.

[17] J. Sánchez, F. Perronnin, T. Mensink, and J. Verbeek, “Imageclassification with the Fisher vector: Theory and practice,” Int. J.Comput. Vis., vol. 105, no. 3, pp. 222–245, 2013.

Page 11: adaptive metric learning for saliency detection base paper

LI et al.: ADAPTIVE METRIC LEARNING FOR SALIENCY DETECTION 3331

[18] P. Jiang, H. Ling, J. Yu, and J. Peng, “Salient region detection by UFO:Uniqueness, focusness and objectness,” in Proc. IEEE Int. Conf. Comput.Vis., Dec. 2013, pp. 1976–1983.

[19] J. Harel, C. Koch, and P. Perona, “Graph-based visual saliency,” in Proc.Adv. Neural Inf. Process. Syst., 2006, pp. 545–552.

[20] X. Li, H. Lu, L. Zhang, X. Ruan, and M.-H. Yang, “Saliency detectionvia dense and sparse reconstruction,” in Proc. IEEE Int. Conf. Comput.Vis., Dec. 2013, pp. 2976–2983.

[21] W. Zhu, S. Liang, Y. Wei, and J. Sun, “Saliency optimization fromrobust background detection,” in Proc. IEEE Conf. Comput. Vis. PatternRecognit., Jun. 2014, pp. 2814–2821.

[22] F. Perazzi, P. Krahenbuhl, Y. Pritch, and A. Hornung, “Saliency filters:Contrast based filtering for salient region detection,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jun. 2012, pp. 733–740.

[23] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk,“SLIC superpixels compared to state-of-the-art superpixel methods,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282,Nov. 2012.

[24] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the objectness ofimage windows,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 34,no. 11, pp. 2189–2202, Nov. 2012.

[25] Y. Wei, F. Wen, W. Zhu, and J. Sun, “Geodesic saliency usingbackground priors,” in Proc. 12th Eur. Conf. Comput. Vis. (ECCV), 2012,pp. 29–42.

[26] H. Jiang, Z. Yuan, M.-M. Cheng, Y. Gong, N. Zheng, and J. Wang.(2014). “Salient object detection: A discriminative regional feature inte-gration approach.” [Online]. Available: http://arxiv.org/abs/1410.5926

[27] H. Jiang, J. Wang, Z. Yuan, Y. Wu, N. Zheng, and S. Li, “Salientobject detection: A discriminative regional feature integration approach,”in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013,pp. 2083–2090.

[28] R. Liu, J. Cao, Z. Lin, and S. Shan, “Adaptive partial differentialequation learning for visual saliency detection,” in Proc. IEEE Conf.Comput. Vis. Pattern Recognit., Jun. 2014, pp. 3866–3873.

[29] Q. Chen et al., “Efficient maximum appearance search for large-scaleobject detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Jun. 2013, pp. 3190–3197.

[30] J. V. Davis, B. Kulis, P. Jain, S. Sra, and I. S. Dhillon,“Information-theoretic metric learning,” in Proc. 24th Int. Conf. Mach.Learn., 2007, pp. 209–216.

[31] K. Q. Weinberger, J. Blitzer, and L. K. Saul, “Distance metric learningfor large margin nearest neighbor classification,” in Proc. Adv. NeuralInf. Process. Syst., 2005, pp. 1473–1480.

[32] K. Q. Weinberger and L. K. Saul, “Fast solvers and efficientimplementations for distance metric learning,” in Proc. 25th Int. Conf.Mach. Learn., 2008, pp. 1160–1167.

[33] M. Guillaumin, J. Verbeek, and C. Schmid, “Is that you? Metriclearning approaches for face identification,” in Proc. IEEE 12th Int.Conf. Comput. Vis., Sep./Oct. 2009, pp. 498–505.

[34] F. Wang, W. Zuo, L. Zhang, D. Meng, and D. Zhang. (2013). “A kernelclassification framework for metric learning.” [Online]. Available:http://arxiv.org/abs/1309.5823

[35] S. Lu, V. Mahadevan, and N. Vasconcelos, “Learning optimal seeds fordiffusion-based salient object detection,” in Proc. IEEE Conf. Comput.Vis. Pattern Recognit., Jun. 2014, pp. 2790–2797.

[36] H.-C. Huang, Y.-Y. Chuang, and C.-S. Chen, “Affinity aggregation forspectral clustering,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit.,Jun. 2012, pp. 773–780.

[37] M.-M. Cheng, N. J. Mitra, X. Huang, P. H. S. Torr, and S.-M. Hu,“Global contrast based salient region detection,” IEEE Trans. PatternAnal. Mach. Intell., vol. 37, no. 3, pp. 569–582, Mar. 2014.

[38] V. Movahedi and J. H. Elder, “Design and perceptual validationof performance measures for salient object segmentation,” in Proc.IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit. Workshops,Jun. 2010, pp. 49–56.

[39] M.-M. Cheng, J. Warrell, W.-Y. Lin, S. Zheng, V. Vineet, and N. Crook,“Efficient salient region detection with soft image abstraction,” in Proc.IEEE Int. Conf. Comput. Vis., Dec. 2013, pp. 1529–1536.

[40] K.-Y. Chang, T.-L. Liu, H.-T. Chen, and S.-H. Lai, “Fusing genericobjectness and visual saliency for salient object detection,” in Proc. IEEEInt. Conf. Comput. Vis., Nov. 2011, pp. 914–921.

[41] Q. Yan, L. Xu, J. Shi, and J. Jia, “Hierarchical saliency detection,”in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., Jun. 2013,pp. 1155–1162.

[42] R. Margolin, A. Tal, and L. Zelnik-Manor, “What makes a patchdistinct?” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR),Jun. 2013, pp. 1139–1146.

[43] B. Jiang, L. Zhang, H. Lu, C. Yang, and M.-H. Yang, “Saliency detectionvia absorbing Markov chain,” in Proc. IEEE Int. Conf. Comput. Vis.,Dec. 2013, pp. 1665–1672.

[44] H. Jiang, J. Wang, Z. Yuan, T. Liu, N. Zheng, and S. Li, “Automaticsalient object segmentation based on context and shape prior,” in Proc.BMVC, 2011, pp. 110.1–110.12

[45] Y. Xie and H. Lu, “Visual saliency detection based on Bayesian model,”in Proc. 18th IEEE Int. Conf. Image Process., Sep. 2011, pp. 645–648.

Shuang Li is currently pursuing theB.E. degree with the School of Informationand Communication Engineering, Dalian Universityof Technology (DUT), China. From 2012 to 2015,she was a Research Assistant with the ComputerVision Group, DUT. Her research interests focuson saliency detection and object recognition.

Huchuan Lu (SM’12) received the M.Sc. degreein signal and information processing and thePh.D. degree in system engineering from the DalianUniversity of Technology (DUT), Dalian, China,in 1998 and 2008, respectively. He joined as aFaculty Member in 1998, and is currently a FullProfessor with the School of Information andCommunication Engineering, DUT. His currentresearch interests include the areas of computervision and pattern recognition with a focus on visualtracking, saliency detection, and segmentation.

He is also a member of the Association for Computing Machinery andan Associate Editor of the IEEE TRANSACTIONS ON SYSTEMS, MAN AND

CYBERNETICS—PART B.

Zhe Lin (M’10) received the B.Eng. degree inautomatic control from the University of Scienceand Technology of China, in 2002, the M.S. degreein electrical engineering from the Korea AdvancedInstitute of Science and Technology, in 2004, and thePh.D. degree in electrical and computer engineeringfrom the University of Maryland, College Park,in 2009. He has been a Research Intern withMicrosoft Live Labs Research. He is currently aSenior Research Scientist with Adobe Research,San Jose, CA. His research interests include deep

learning, object detection and recognition, image classification and tagging,content-based image and video retrieval, human motion tracking, and activityanalysis.

Xiaohui Shen (M’11) received the B.S. andM.S. degrees from the Department of Automation,Tsinghua University, China, and the Ph.D. degreefrom the Department of Electrical Engineeringand Computer Sciences, Northwestern University,in 2013. He is currently a Research Scientist withAdobe Research, San Jose, CA. He is generallyinterested in the research problems in the area ofcomputer vision, in particular, image retrieval, objectdetection, and image understanding.

Brian Price received the Ph.D. degree in computerscience from Brigham Young University under theadvisement of Dr. B. Morse. He has contributednew features to many Adobe products, such asPhotoshop, Photoshop Elements, and After-Effects,mostly involving interactive image segmentation andmatting. He is currently a Senior Research Scientistwith Adobe Research, specializing in computervision. His research interests include semantic seg-mentation, interactive object selection and matting,stereo and RGBD, and broad interest in computer

vision and its intersections with machine learning and computer graphics.