A linear optimal transportation framework for quantifying...

18
Noname manuscript No. (will be inserted by the editor) A linear optimal transportation framework for quantifying and visualizing variations in sets of images Wei Wang · Dejan Slepˇ cev · Saurav Basu · John A. Ozolek · Gustavo K. Rohde the date of receipt and acceptance should be inserted later Abstract Transportation-based metrics for comparing images have long been applied to analyze images, espe- cially where one can interpret the pixel intensities (or derived quantities) as a distribution of ‘mass’ that can be transported without strict geometric constraints. Here we describe a new transportation-based framework for analyzing sets of images. More specifically, we describe a new transportation-related distance between pairs of images, which we denote as linear optimal transporta- tion (LOT). The LOT can be used directly on pixel intensities, and is based on a linearized version of the Kantorovich-Wasserstein metric (an optimal transporta- tion distance, as is the earth mover’s distance). The new framework is especially well suited for computing all pairwise distances for a large database of images ef- Wei Wang Center for Bioimage Informatics, Department of Biomedi- cal Engineering, Carnegie Mellon University, Pittsburgh, PA, 15213 USA. E-mail: [email protected] Dejan Slepˇ cev Department of Mathematical Sciences, Carnegie Mel- lon University, Pittsburgh, PA, 15213 USA. E-mail: [email protected] Saurav Basu Center for Bioimage Informatics, Department of Biomedi- cal Engineering, Carnegie Mellon University, Pittsburgh, PA, 15213 USA. E-mail: [email protected] John A. Ozolek Department of Pathology, Children’s Hospital of Pittsburgh, Pittsburgh, PA, 15224 USA. E-mail: [email protected] Gustavo K. Rohde Corresponding Author, Center for Bioimage Informatics, De- partment of Biomedical Engineering, Department of Electri- cal and Computer Engineering, Lane Center for Computa- tional Biology, Carnegie Mellon University, Pittsburgh, PA, 15213 USA. Tel.: +1-412-268-3684, Fax: +1-412-268-9580, E- mail: [email protected] ficiently, and thus it can be used for pattern recognition in sets of images. In addition, the new LOT framework also allows for an isometric linear embedding, greatly facilitating the ability to visualize discriminant infor- mation in different classes of images. We demonstrate the application of the framework to several tasks such as discriminating nuclear chromatin patterns in cancer cells, decoding differences in facial expressions, galaxy morphologies, as well as sub cellular protein distribu- tions. Keywords Optimal transportation · linear embedding 1 Introduction 1.1 Background and motivation Automated image analysis methods are often used for extracting important information from image databases. Common applications include mining information in sets of microscopy images [22], understanding mass dis- tribution in celestial objects from telescopic images [34], as well as analysis of facial expressions [36], for exam- ple. We note that the prevalent technique for quanti- fying information in such large datasets has been to reduce the entire information (described by pixel inten- sity values) contained in each image in the database to a numerical feature vector (e.g. size, form factor, etc.). Presently, this approach (coupled with clustering, clas- sification, and other machine learning techniques) re- mains the prevalent method through which researchers extract quantitative information about image databases and cluster important image subgroups [6, 44, 19]. In addition to feature-based methods, several tech- niques for analyzing image databases based on explicit modeling approaches have recently emerged. Examples

Transcript of A linear optimal transportation framework for quantifying...

Page 1: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

Noname manuscript No.(will be inserted by the editor)

A linear optimal transportation framework for quantifyingand visualizing variations in sets of images

Wei Wang · Dejan Slepcev · Saurav Basu · John A. Ozolek · GustavoK. Rohde

the date of receipt and acceptance should be inserted later

Abstract Transportation-based metrics for comparingimages have long been applied to analyze images, espe-cially where one can interpret the pixel intensities (orderived quantities) as a distribution of ‘mass’ that canbe transported without strict geometric constraints. Herewe describe a new transportation-based framework foranalyzing sets of images. More specifically, we describea new transportation-related distance between pairs ofimages, which we denote as linear optimal transporta-tion (LOT). The LOT can be used directly on pixelintensities, and is based on a linearized version of theKantorovich-Wasserstein metric (an optimal transporta-tion distance, as is the earth mover’s distance). Thenew framework is especially well suited for computingall pairwise distances for a large database of images ef-

Wei WangCenter for Bioimage Informatics, Department of Biomedi-cal Engineering, Carnegie Mellon University, Pittsburgh, PA,15213 USA. E-mail: [email protected]

Dejan SlepcevDepartment of Mathematical Sciences, Carnegie Mel-lon University, Pittsburgh, PA, 15213 USA. E-mail:[email protected]

Saurav BasuCenter for Bioimage Informatics, Department of Biomedi-cal Engineering, Carnegie Mellon University, Pittsburgh, PA,15213 USA. E-mail: [email protected]

John A. OzolekDepartment of Pathology, Children’s Hospital of Pittsburgh,Pittsburgh, PA, 15224 USA. E-mail: [email protected]

Gustavo K. RohdeCorresponding Author, Center for Bioimage Informatics, De-partment of Biomedical Engineering, Department of Electri-cal and Computer Engineering, Lane Center for Computa-tional Biology, Carnegie Mellon University, Pittsburgh, PA,15213 USA. Tel.: +1-412-268-3684, Fax: +1-412-268-9580, E-mail: [email protected]

ficiently, and thus it can be used for pattern recognitionin sets of images. In addition, the new LOT frameworkalso allows for an isometric linear embedding, greatlyfacilitating the ability to visualize discriminant infor-mation in different classes of images. We demonstratethe application of the framework to several tasks suchas discriminating nuclear chromatin patterns in cancercells, decoding differences in facial expressions, galaxymorphologies, as well as sub cellular protein distribu-tions.

Keywords Optimal transportation · linear embedding

1 Introduction

1.1 Background and motivation

Automated image analysis methods are often used forextracting important information from image databases.Common applications include mining information insets of microscopy images [22], understanding mass dis-tribution in celestial objects from telescopic images [34],as well as analysis of facial expressions [36], for exam-ple. We note that the prevalent technique for quanti-fying information in such large datasets has been toreduce the entire information (described by pixel inten-sity values) contained in each image in the database toa numerical feature vector (e.g. size, form factor, etc.).Presently, this approach (coupled with clustering, clas-sification, and other machine learning techniques) re-mains the prevalent method through which researchersextract quantitative information about image databasesand cluster important image subgroups [6,44,19].

In addition to feature-based methods, several tech-niques for analyzing image databases based on explicitmodeling approaches have recently emerged. Examples

Page 2: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

2 Wei Wang et al.

include contour-based models [29], medial axis models[8], as well as model-based deconvolution methods [15].Moreover, when analyzing images within a particulartype (e.g. brain images) researchers have often used amore geometric approach. Here the entire morpholog-ical exemplar as depicted in an image is viewed as apoint in a suitably constructed metric space (see forexample [4]), often facilitating visualization. These ap-proaches have been used to characterize the statisticalvariation of a particular object in a given population(or set of populations) [4,24]. The main idea in these isto understand the variation of similar objects throughanalysis of the deformation fields required to warp oneobject (as depicted in its image) onto another.

Alternatively, transportation-based metrics have alsobeen used to analyze image data in problems when pixelintensities (or derived quantities) can be interpreted as‘mass’ free to move without strict geometric constraints[32,18,11,16,43]. They are interesting alternatives toother methods since when applied directly to pixel in-tensities, they have the potential to quantify both tex-ture and, to some extent, shape information combined(see for example [18,43]). In particular, there has beencontinuing effort to develop fast and reliable methodsfor computing transportation related distances[26,20,35,28,2,3,5,12,17]. While the computational complex-ity of these methods ranges from quadratic to linearwith respect to image size (for smooth enough images),the computations are still expensive (time wise), in par-ticular for rough images (with large gradients), andthere are issues with convergence (e.g. due to local min-ima in PDE-based variational implementations).

1.2 Overview of our approach and contributions

Our contribution is to develop a linear framework closelyrelated to the optimal transportation metric (OT) [43]for analyzing sets of images. This framework, whichwe call the linear optimal transportation (LOT), notonly provides a fast way for computing a metric (andgeodesics) between all pairs in a dataset, permittingone to use pixel intensities directly, but it also providesan isometric embedding for a set of images. Thereforeour method takes as input a potentially large set of im-ages and outputs an isometric embedding of the dataset(endowed with the LOT metric) onto the standard Eu-clidean space. Our approach achieves this task by uti-lizing the following series of steps:

– Step 1: compute a template image that will serveas a reference point for analyzing the given imagedataset.

– Step 2: for each image in the input dataset, as wellas the estimated template, compute a particle ap-proximation that will enable a linear programming-based computation of the OT distance between itand the template computed in STEP 1.

– Step 3: normalize each particle approximation withrespect to translation, rotation, and coordinate in-versions.

– Step 4: compute a quadratic-based OT distance be-tween the particle approximation of each image andthe template.

– Step 5: from the output of STEP 4, compute theLOT distances and embedding.

As compared to previous works that make use oftransportation-related metrics in image analysis, we high-light the following innovations of our approach. Thefirst is that, given a database of M images, the num-ber of transportation related optimizations required forcomputing the distance between all pairs of images is Mwhen utilizing our approach (versus M(M −1)/2 whenutilizing other approaches). This provides a substantialincrease in speed especially when performing patternrecognition tasks (e.g. classification) in large databases.Secondly, as mentioned above, our LOT framework alsoprovides an isometric linear embedding that has a cou-ple of convenient properties. One being that the em-bedding of an image newly added to the database canbe computed exactly and with one transportation op-timization only. Another being that any point in theembedded space can be visualized as an image. This in-cludes measured points (existing images in the database)but also any other point in this space. Below we showhow this embedding greatly facilitates the use of stan-dard geometric data analysis techniques such as princi-pal component analysis (PCA) and linear discriminantanalysis (LDA) for visualizing interesting variations ina set of images.

1.3 Paper organization

In section 2 we begin by reviewing the mathematicalunderpinnings of the traditional optimal transportationframework and then describe its linearized version andsome of its properties. Equation (1) gives the definitionof the traditional OT metric, and formulas (2) and (3)the linearized version of the metric. Then in section 3we describe our computational approach in the discretesetting. This includes an algorithm for ‘estimating’ thecontent of a relatively sparse image with particles (fulldetails of this algorithm are provided in appendix A).The definition of the OT distance in the discrete settingis provided in equation (5), while our approximation of

Page 3: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

A linear optimal transportation framework for quantifying and visualizing variations in sets of images 3

its linearized version in the discrete setting is given informula (9). Equations (10) and (11) provide the iso-metric linear embedding for a given particle approxima-tion according to a reference point. Section 4 describeshow one can utilize the linear embedding provided inthe LOT framework to extract useful information fromsets of images using principal component analysis andlinear discriminant analysis. The applications of LOTtowards decoding subcellular protein patterns and or-ganelle morphology, galaxy morphology as well as facialexpression differences are presented in section 5.

2 Optimal transportation framework

Transportation-based metrics are especially suited forquantifying differences between structures (depicted inquantitative images) that can be interpreted as a dis-tribution of mass with few strict geometric constraints.More precisely, we utilize the optimal transportation(Kantorovich-Wasserstein) framework to quantify howmuch mass, in relative terms, is distributed in differ-ent regions of the images. We begin by describing themathematics of the traditional OT framework, and inparticular the geometry behind it, which is crucial toour subsequent introduction of the linearized OT dis-tance.

2.1 Optimal transportation metric

Let Ω represent the domain (the unit square [0, 1]2, forexample) over which images are defined. To describeimages we use the mathematical notion of a measure.While this is somewhat abstract, it enables us to treatsimultaneously the situation when we consider the im-age to be a continuous function and when we deal withactual computations and consider the image as a dis-crete array of pixels. It is important for us to do so, be-cause many notions related to optimal transportationare simpler in the continuous setting, and in particu-lar we can give an intuitive explanation for the LOTdistance we introduce. On the other hand the discreteapproximation necessitates considering a more generalsetting because the mass (proportional to intensity)from one pixel during transport often needs to be re-distributed over several pixels.

We note that, in the current version of the tech-nique, we normalize all images in a given dataset sothat the intensity of all pixels in each image sums toone. Thus we may interpret images as probability mea-sures. The assumption is adequate for the purpose ofanalyzing shape and texture in the datasets analyzed

in this paper. We note that OT-related distances canbe used when masses are not the same [32,27].

Recall that probability measures are nonnegativeand that the measure of the whole set Ω is 1: µ(Ω) =ν(Ω) = 1.Let c : Ω × Ω → [0,∞) be the cost function.That is c(x, y) is the ‘cost’ of transporting unit mass lo-cated at x to the location y. The optimal transportationdistance measures the least possible total cost of trans-porting all of the mass from µ to ν . To make this pre-cise, consider Π(µ, ν), the set of all couplings betweenµ and ν. That is consider the set of all probability mea-sures on Ω×Ω with the first marginal µ and the secondmarginal ν. More precisely, if π ∈ Π(µ, ν) then for anymeasurable set A ⊂ Ω we have π(A × Ω) = µ(A) andπ(Ω×A) = ν(A). Each coupling describes a transporta-tion plan, that is π(A0 × A1) is telling one how much‘mass’ originally in the set A0 is being transported intothe set A1.

We consider optimal transportation with quadraticcost c(x, y) = |x−y|2. The optimal transportation (OT)distance, also known as the Kantorovich-Wassersteindistance, is then defined by

dW (µ, ν) =(

infπ∈Π(µ,ν)

∫Ω×Ω

|x− y|2dπ) 1

2

. (1)

It is well known that the above infimum is attained andthat the distance defined is indeed a metric (satisfyingthe positivity, the symmetry, and the triangle inequalityrequirements), see [39]. We denote the set of minimizersπ above by ΠOT (µ, ν).

2.2 Geometry of optimal transportation in continuoussetting

The construction of the LOT metric which we introducebelow, can be best motivated in the continuous setting.Consider measures µ and ν which have densities α andβ, that is

dµ = α(x)dx and dν = β(x)dx.

Then the following mathematical facts, available in [39],hold. The OT plan between µ and ν is unique andfurthermore the mass from each point x, is sent toa single location, given as the value of the functionϕ(x) called the optimal transportation map (see Fig-ure 1). The relation between ϕ and the optimal trans-portation plan Π introduced above is that Π(A0 ×A1) = µ(x ∈ A0 : ϕ(x) ∈ A1). That is Π isconcentrated on the graph of ϕ. We note that ϕ is ameasure preserving map from µ to ν, that is that for

Page 4: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

4 Wei Wang et al.

νµ

ϕ(x)

x

Fig. 1 Optimal transport map ϕ between measures µ and ν

whose supports are outlined.

any A,∫ϕ−1(A)α(x)dx =

∫Aβ(y)dy. The Kantorovich-

Wasserstein distance is then dW (µ, ν) =∫Ω|ϕ(x) −

x|2α(x)dx.In addition to being a metric space the set of mea-

sures is formally a Riemannian manifold [10], that isat any point there is a tangent space endowed withan inner product. In particular the tangent space atthe measure σ with density γ (i.e. dσ = γ(x)dx) isthe set of the following vector fields Tσ = v : Ω →Rd such that

∫Ω|v(x)|2γ(x)dx <∞ and the inner prod-

uct is the weighted L2:

〈v1, v2〉σ =∫Ω

v1(x) · v2(x)γ(x)dx.

The OT distance is just the length of the shortest curve(geodesic) connecting two measures [5].

A very useful fact is the geodesics have a form thatis simple to understand. Namely if µt, 0 ≤ t ≤ 1 isthe geodesic connecting µ to ν. Then µt is the mea-sure obtained when mass from µ is transported by thetransportation map x→ (1−t)x+tϕ(x). Then µt(A) =∫ϕ−1t (A)

α(x)dx.

2.3 Linear optimal transportation metric

Computing a pairwise OT distance matrix is expensivefor large datasets. In particular let TOT be the timeit takes to compute the OT distance and T2 the timeit takes to compute the Euclidean distance betweentwo moderately complex images. Using the method de-scribed below, TOT is typically on the order of tens ofseconds, while T2 is on the order of miliseconds. Fora dataset with M images, computing the pairwise dis-tances takes time on the order of M(M−1)TOT /2. Herewe introduce a version of the OT distance that is muchfaster to compute. In particular computing the distancematrix takes approximately MTOT + M(M − 1)T2/2.For a large set of images, in which the number of pixelsin each image is fixed, as the number of images in theset tends to infinity, the dominant term is M2T2.

The distance we compute is motivated by the geo-metric nature of the OT distance. Heuristically, instead

Fig. 2 The identification of the manifold with the tangentplane. Distances to σ as well as angles at σ are preserved.

of computing the geodesic distance on the manifold wecompute a ‘projection’ of the manifold to the tangentplane at a fixed point and then compute the distanceson the tangent plane. This is the main reason we use theterm linear when naming the distance. To consider theprojection one needs to fix a reference image σ (wherethe tangent plane is set). Computing the projection foreach image requires computing the OT plan betweenthe reference image σ and the given image. Once com-puted, the projection provides a natural linear embed-ding of the dataset, which we describe in section 3.5.

We first describe the LOT distance in the continu-ous setting. Let σ be a probability measure with den-sity γ. We introduce the identification (‘projection’), P ,of the manifold with the tangent space at σ. Given ameasure µ, consider the optimal transportation map ψbetween σ and µ. Then

P (µ) = v where v(x) = ψ(x)− x.

Note that v ∈ Tσ and that P is a mapping from themanifold to the tangent space. See Figure 2 for a visu-alization. Also P (σ) = 0 and dW (σ, µ)2 =

∫Ω|ψ(x) −

x|2γ(x)dx =∫Ω|v(x)|2γ(x)dx = 〈v, v〉σ = ‖P (µ) −

P (σ)‖2σ, were ‖v‖2σ is defined to be 〈v, v〉σ. So the map-ping preserves distances to σ. Let us mention that incartography such projection is known as the equidis-tant azimuthal projection. In differential geometry thiswould be the inverse of the exponential map. We definethe LOT as:

dLOT (µ, ν) = ‖P (µ)− P (ν)‖σ. (2)

When σ is not absolutely continuous with respectto the Labesgue measure the situation is more compli-cated. The purely discrete setting (when the measures

Page 5: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

A linear optimal transportation framework for quantifying and visualizing variations in sets of images 5

are made of particles) is the one relevant for computa-tions and we discuss it in detail in section 3.3. In thegeneral setting, the proper extension of the LOT dis-tance above is the shortest generalized geodesic (as de-fined in [1]) connecting µ and ν. More precisely, given areference measure σ, and measures µ and ν, with trans-portation plans πµ ∈ ΠOT (σ, µ) and πν ∈ ΠOT (σ, ν),let Π(σ, µ, ν) be the set of all measures on the productΩ × Ω × Ω such that the projection to first two co-ordinates is πµ and the projection to the first and thethird is πν . The linearized version of the OT distancebetween µ and ν is given by:

dLOT,σ(µ, ν)2 = infπ∈Π(σ,µ,ν)

∫Ω×Ω×Ω

|x− y|2dπ. (3)

2.4 Translation and Rotation normalization

It is important to note that the OT metric defined inequation (1) is not invariant under translations or rota-tions. It can be rendered translation invariant by simplyaligning all measures in a dataset µ1, · · · , µN by settingtheir center of mass to a common coordinate. This isbased on the fact that if µ is a measure on Ω with centerof mass

xµ =1

µ(Ω)

∫Ω

xdµ

and ν a measure of the same mass, then among alltranslates of the measure ν, νx(A) = ν(A − x), theone that minimizes the distance to µ is the one withthe center of mass xµ.

From now on, we will always assume that the mea-sures are centered at the origin (the implementation ofthis normalization step is given below). The Euclidean-transformation invariant Kantorovich-Wasserstein dis-tance is then defined in the following way. Given ameasure µ and an orthogonal matrix T , define µT byµT (A) = µ(T−1A). Then the invariant distance is de-fined by d(µ, ν) = minT∈O(n) dW (µT , ν). In two dimen-sions we have developed an algorithm for finding theminimum above. However we know of no reasonable al-gorithm for computing the rotation invariant linearizedOT distance (dLOT ). Below we present an algorithmthat greatly reduces the effect of rotation, but does noteliminate it completely.

3 Computing LOT distances from image data

To compute the LOT distance for a set of images weneed to first compute the OT distance from a templateimage to each of the images. There exist several ap-proaches for computing OT distances. Perhaps the most

direct idea is to discretize the problem defined in (1).This results in a linear programming problem (see (5)below). While this approach is very robust and leadsto the global minimizer, it is computationally expen-sive (the fastest strongly polynomial algorithm for thiscase is the Hungrian algorithm with time complexity ofO(n3) with n the number of pixels in an image). Otherapproaches are based on continuum ideas, where PDEsand variational techniques are often used [2,3,5,12,17].Some of them achieve close to linear scaling with thenumber of pixels. On the other hand, their performancecan deteriorate as one considers images that lack reg-ularity (are not smooth, have large gradients). A par-ticular difficulty with some of the approaches that arenot based on linear programing is that they may notconverge to the global solution if they arrive at a localminimum first.

Our approach is based on combining linear program-ming with a particle approximation of the images. It isa refinement of the algorithm we used in [43]. That is,each image is first carefully approximated by a particlemeasure (a convex combination of delta masses) thathas much fewer particles than there are pixels in theimage. This significantly reduces the complexity of theproblem. Furthermore it takes advantage of the factthat many images we consider are relatively sparse.Then one needs fewer particles for accurate approxima-tion of the image which accelerates the computation.In addition, as the number of particles approaches thenumber of pixels in the image, the approximation errortends to zero. We now present the details of the algo-rithm. In particular, we give a description of the particleapproximation, and detailed algorithms for translationand rotation normalization, and the computation of theLOT distance. The full details pertaining to the parti-cle approximation algorithm, which is more involved,are presented in Appendix A.

3.1 Particle approximation

The first step in our computational approach is to repre-sent each image as a weighted combination of ‘particles’each with mass (intensity) mi and location xi:

µ =Nµ∑i=1

miδxi . (4)

with Nµ being the number of masses used to repre-sent the measure (image) µ. Since the images are digi-tal, each image could be represented exactly using themodel above by selecting mi to be the pixel intensityvalues and δxi the pixel coordinate grid. Given that

Page 6: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

6 Wei Wang et al.

images can contain a potentially large number of pix-els, and that the cost of a linear programing solutionfor computing the OT is generally O(N3

µ), we do notrepresent each image exactly and instead choose to ap-proximate them.

The goal of the algorithm is to use at most, approx-imately, N particles to approximate each image, whilemaintaining a similar approximation error (by approx-imation error we mean the OT distance between theimage and the particle approximation) for all images ofa given dataset. For a dataset of M images, our algo-rithm requires the user to specify a number N for theapproximate number of particles that could be used toapproximate any image, and consists of these four steps:

– Step 1: use a weighted K -means algorithm [21] toapproximate each image, with the number of clus-ters set to the chosen N . As we describe in the ap-pendix, the weighted K -means is an optimal localstep for reducing the OT error between a given par-ticle approximation and an image.

– Step 2: improve the approximation of each imageby investigating regions where each image is not ap-proximated well, and add particles to those regionsto improve the approximation locally. Details areprovided in Appendix A.

– Step 3: compute the OT distance between each dig-ital image and the approximation of each image. Forimages not well approximated by the current algo-rithm (details provided in Appendix A), add parti-cles to improve the approximation.

– Step 4: For images where the approximation erroris signifficantly smaller than the average error (forthe entire dataset) remove (merge) particles untilthe error between the digital image and its approx-imation is more similar to the other images. (Thisreduces the time to compute the distances, while itdoes not increase the typical error.)

The algorithm above was chosen based on mathe-matical observations pertaining to the OT metric. Theseobservations, including a careful description of each ofthe steps above, are provided in Appendix A. We haveexperimented with several variations of the steps above,and through this experience we have converged on thealgorithm presented. A few of the details of the al-gorithm could be improved further, though we viewfurther improvements to be beyond the scope of thispaper. We emphasize that this particle approximationalgorithm works particularly well for relatively sparseimages, and less so for non-sparse (e.g. flat) images. Bel-low we demonstrate, however, that the algorithm cannonetheless be used to extract meaningful quantitativeinformation for datasets of images which are not sparse.

x2

y1x 1

f2,2

f2,3

x3

y3f3,3

f2,1

f1,1

y2

µ ν

Fig. 3 When measures are discrete as in (4), then finding thetransportation plan which minimizes the transportation cost,(5), necessitates splitting the particles. The arrows are drawnonly for positive fi,j .

Translation and rotation normalization

In two dimensions, it suffices to consider translation, ro-tation, and mirror symmetry. To achieve that we simplyalign the center of mass of each particle approximationto a fixed coordinate, and rotate each particle approx-imation to a standard reference frame according to aprincipal axis (Hotteling) transform. Each particle ap-proximation is then flipped left to right, and up anddown, simply by reversing their coordinates, until theskewness (the third standardized moment) of the coor-dinates of each sample have the same sign.

3.2 OT distance in particle setting

We start by explaining how the general framework ofsection 2 applies to particle measures (such as particleapproximations obtained above). A particle probabilitymeasure, µ, which approximates the image is given asµ =

∑Ni=1miδxi where xi ∈ Ω, mi ∈ (0, 1],

∑Ni=1mi =

1.

We note that for A ⊂ Ω, µ(A) =∑i : xi∈Ami. An

integral with respect to measure µ is∫Af(x)dµ(x) =∑

i : xi∈A f(xi)mi.

Now we turn to our main goal of the section: ob-taining the OT distance. For µ =

∑Nµi=1miδxi and ν =∑Nν

j=1 pjδyj the set of couplings Π(µ, ν) is given by a

Page 7: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

A linear optimal transportation framework for quantifying and visualizing variations in sets of images 7

set of Nµ ×Nν matrices, as follows:

Π(µ, ν) = Nµ∑i=1

Nν∑j=1

fi,jδxi,yj :

fi,j ≥ 0 for i = 1, . . . , Nµ, j = 1, . . . , NνNν∑j=1

fi,j = mi for i = 1, . . . , Nµ

Nµ∑i=1

fi,j = pj for j = 1, . . . , Nν.

Since it is clear from the context we will make no dis-tinction between measures in Π(µ, ν) and matrices f =[fi,j ] that satisfy the conditions above.

The optimal transportation distance between µ andν defined in (1) is the solution to the following linearprograming problem:

d2W (µ, ν) = min

f∈Π(µ,ν)

Nµ∑i=1

Nν∑j=1

|xi − yj |2fi,j . (5)

See Figure 3 for a visualization. For convenience, we uti-lize Matlab’s implementation of a variation of Mehro-tras dual interior point method [23] to solve the linearprogram. We note however that better suited alterna-tives, which take advantage of the special structure ofthe linear programming problem, exist and could beused [26,32] to increase the computational efficiencyof the method even further. The solution of the linearprogram above is relatively expensive [43], with typicalcomputation times being around 30 seconds (with 500particles per image) on an Apple laptop with a 2.2GHzintel dual core processor and 2GB of RAM.

For datasets containing tens of thousands of imagesthis is practical only if we need to compute a single OTdistance per image (the one to the reference image), andis prohibitive if we need the pairwise distance betweenall images. This is the reason we introduce the linearoptimal transportation framework to compute a relateddistance between µ and ν.

3.3 LOT distance in particle setting

To alleviateComputing the linear optimal transportation (LOT)

distance requires setting a reference template σ, whichwe also choose to be a particle measure σ =

∑Nσk=1 qkδzk

(how the reference template is computed is described insection 3.4 ). The distance between µ and ν given by(3) in a discrete setting becomes:

x 2

x 3

x 5

x 4

y1

x 1

y3

y5y4

y2

fk ,igk ,j

hk ,i,j

zk

µ ν

σ

Fig. 4 Given f and g in (6), for each k fixed, hk,i,j givesthe optimal transportation plan between the ‘f-image’ andthe ‘g-image’ of the particle at zk. Arrows are drawn only forpositive coefficients, and only for the particle at zk.

dLOT,σ(µ, ν)2 = minf∈ΠOT (σ,µ), g∈ΠOT (σ,ν)

minh

Nµ∑i=1

Nν∑j=1

Nσ∑k=1

hk,i,j |xi − yj |2 : where hk,i,j ≥ 0,

Nµ∑i=1

hk,i,j = gk,j andNν∑j=1

hk,i,j = fk,i

(6)

We remark that the sets of optimal transportation plans,ΠOT (σ, µ) and ΠOT (σ, ν), typically have only one ele-ment, so there is only one possibility for f , and g. SeeFigure 4 for a visualization. Usually there is more thanone possibility for h, in the case of particle measures.

We now introduce the ‘distance’ which is an approx-imation of the one above (hence we denote it as daLOT )and which is used in most computations in this paper.Namely, given OT plans f and g, as indicated in Figure5, we replace the f -image of the particle at zk, namelythe measure

∑i fk,iδxi , by one particle at the center of

mass, namely the measure qkδxk . We recall from Sec-tion 2.2 that if σ, µ and ν were measures which hada density function (and hence no particles) then thereexists an optimal transportation map and the image ofany z is just a single x. On the discrete level we havean approximation of that situation and typically the f -image of the particle at zk is spread over a few nearbyparticles. Thus the error made by replacing the f im-age by a single particle at the images center of mass istypically small.

To precisely define the new distance, let f be anoptimal transportation plan between σ =

∑Nσk=1 qkδzk

and µ =∑Nµi=1miδxi obtained in (5), and g is an opti-

mal transportation plan between σ =∑Nσk=1 qkδzk and

Page 8: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

8 Wei Wang et al.

ν =∑Nνj=1 pjδxj . Then

xk =1qk

Nµ∑i=1

fk,ixi and yk =1qk

Nν∑j=1

gk,jyj (7)

are the centroids of the forward image of the particleqkδzk by the transportation plans f and g respectively(see Figure 5). We define

daLOT,σ(µ, ν)2 = minf∈ΠOT (σ,µ)g∈ΠOT (σ,ν)

Nσ∑k=1

qk|xk − yk|2 (8)

We clarify that computing (8) in practice does not re-quire a minimization over f and g, since f and g areunique with probability one. The reason for that isthat in the discrete case the condition for optimalityof transportation plans can be formulated in terms ofcyclic monotonicity (see [40]). The nonuniuqueness ofthe OT plan can only happen if the inequalities in somecyclic monotonicity conditions become equalities, whichmeans that the coordinates of particles satisfy an alge-braic equation. And this can only happen with prob-ability zero. For example, if both measures have ex-actly two particles of the same mass then the condi-tion for nonuniqueness is that |x1 − y1|2 + |x2 − y2|2 =|x2 − y1|2 + |x1 − y2|2.

Thus we compute

daLOT,σ(µ, ν)2 =Nσ∑k=1

qk|xk − yk|2. (9)

One should notice that the daLOT,σ distance betweentwo particle measures may be zero. This is related tothe ‘resolution’ achieved by the base measure σ. In par-ticular if σ has more particles then it is less likely thatdaLOT,σ(µ, ν) = 0 for µ 6= ν. It is worth observing thatdaLOT,σ(µ, ν)2 computed in (9) is always less than orequal to dLOT,σ(µ, ν)2 introduced in equation (6).

Furthermore if we consider measures with continu-ous density σ, µ, and ν and approximate them by parti-cle measures σn, µn, and νn, then as the number of par-ticles goes to infinity dLOT,σn(µn, νn) → dLOT,σ(µ, ν)where the latter object is defined in (2). This followsfrom the stability of optimal transport [40].

Let us also remark that, while we use the linearprogramming problem as described in (5) to computethe optimal transportation plan, other methods can beused as well. In particular the LOT distance is appli-cable even when a different approach to computing theOT plan is used.

gk , 3

y2

y1

y3x 5

fk , 2

gk , 2

zk

fk , 1

x 2

x 1

x 4

νµ

σ

y5y4

x 3

xkyk

Fig. 5 For the LOT distance we replace the full f-image andthe g-image of the particle at zk by their centers of massxk and yk, (7). When there are many particles the imagesof particle at zk tend to be concentrated on a few nearbyparticles. Thus the error introduced by replacing them bytheir center of mass is small.

3.4 Template selection

Selecting the template (reference) σ above is importantsince, in our experience, a template that is distant fromall images (particle approximations) is likely to increasethe difference between dW and daLOT . In the resultspresented below, we use a reference template computedfrom the ‘average’ image. To compute such average im-age, we first align all images to remove translation, ro-tation, and horizontal and vertical flips, as described inSection 3.3. All images are then averaged, and the par-ticle approximation algorithm described above is usedto obtain a particle representation of the average im-age. We denote this template as σ and note that it willcontain Nσ particles. Once the particle approximationfor the template is computed, it is also normalized to astandard position and orientation as described above.

3.5 Isometric linear embedding

When applying our approach to a given a set of im-ages (measures) I1, · · · , IM , we first compute the par-ticle approximation for each image with the algorithmdescribed in section 3.1. We then compute a template σas described in section 3.4 and compute the OT distance(5) between a template σ (also chosen to be a particlemeasure) and the particle approximation of each image.Once these are computed, the LOT distance betweenany two particle sets is given by (9).

The lower bound of the linear optimal transporta-tion distance daLOT,σ defined in equation (9) providesa method to map a sample measure νn (estimated fromimage In) into a linear space. Let νn =

∑Nνnj=1 mjδyj ,

and recall that the reference measure is σ =∑Nσk=1 qkδzk .

The linear embedding is obtained by applying the dis-

Page 9: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

A linear optimal transportation framework for quantifying and visualizing variations in sets of images 9

crete transportation map between the reference mea-sure σ and νn to the coordinates yj via

xn =(√q1a

1n · · ·

√qNσa

Nσn

)T(10)

where ak is the centroid of the forward image of theparticle qkδkk by the optimal transportation plan, gk,j ,between images σ and νn:

akn =Nνn∑j=1

gk,jyj/qk (11)

This results in an Nσ-tuple of points in R2 which we callthe linear embedding xn of νn. That is, xn ∈ RNσ×2. Wenote that the embedding is interpretable in the sensethat any point in this space can be visualized by sim-ply plotting the vector coordinates (each in R2) in theimage space Ω.

4 Statistical analysis

When a linear embedding for the data can be assumed,standard geometric data processing techniques such asprincipal component analysis (PCA) can be used to ex-tract and visualize major trends of variation in mor-phology [8,33,38]. Briefly, given a set of data points xn,for n = 1, · · · ,M , we can compute the covariance ma-trix S = 1

M

∑n(xn−x)(xn−x)T , with x = 1

M

∑Mn=1 xn

representing center of the entire data set. PCA is amethod for computing the major trends (directions overwhich the projection of the data has largest variance) ofa dataset via the solution of the following optimizationproblem:

w∗PCA = arg max‖w‖=1

wTSw (12)

The problem above can be solved via eigenvalue andeigenvector decomposition, with each eigenvector cor-responding to a major mode of variation in the dataset(it’s corresponding eigenvalue is the variance of the dataprojected over that direction).

In addition to visualizing the main modes of varia-tion for a given dataset, important applications involvevisualizing the modes of variation that best discrimi-nate between two or more separate populations (e.g.as in control vs. effect studies). To that end, we ap-ply the methodology we developed in [41], based on thewell known Fisher linear discriminant analysis (FLDA)technique [14]. As explained in [41], simply applying theFLDA technique in morphometry visualization prob-lems can lead to erroneous interpretations, since thedirections computed by the FLDA technique are not

constrained to pass through the data. To alleviate thiseffect we modified the method as follows. Briefly, givena set of data points xn, for n = 1, · · · , N , with eachindex n belonging to class c, we modified the originalFLDA by adding a least squares-type representationpenalty in the function to be optimized. The represen-tation constrained optimization can then be reduced to

w∗LDA = arg maxw

wTSTwwT (SW + αI) w

(13)

where ST =∑n(xn− x)(xn− x)T represents the ‘total

scatter matrix’, SW =∑c

∑n∈c(xn − xc)(xn − xc)T

represents the ‘within class scatter matrix’, xc is thecenter of class c. The solution for the problem aboveis given by the well-known generalized eigenvalue de-composition STw = λ (SW + αI) w [7]. In short, thepenalized LDA method above seeks to find the direc-tion w that has both a low reconstruction error, butthat best discriminates between two populations. Weuse the method described in [41] to select the penaltyweight α.

5 Computational results

In this section, we describe results of applying the LOTmethod to quantify the statistical variation of threetypes of datasets. We analyze (sub) cellular protein pat-terns and organelles from microscopy images, visualizethe variation of shape and brightness in a galaxy imagedatabase and characterize the variation of expressionsin a facial image database. We begin by introducingthe datasets used. For the cellular image datasets, wethen show results that 1) evaluate the particle approx-imation algorithm described earlier, 2) evaluate howwell the LOT distance approximates the OT distance,3) evaluate the discrimination power of the LOT dis-tance (in comparison to more traditional feature-basedmethods), and 4) use the linear embedding providedby the LOT to visualize the geometry (summarizingtrends), including discrimination properties. For boththe galaxy and facial images, we evaluate the perfor-mance of the particle approximation and discriminationpower of the LOT metric, and visualize the discriminat-ing information. For the facial image dataset, we alsovisualize the summarizing trends of variation of facialexpressions. In the case of facial expressions, the resultsshow that LOT obtains similar quantification of expres-sion as the original paper [36], yet LOT has the disinctadvantage of being landmark free.

Page 10: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

10 Wei Wang et al.

Fig. 6 Sample images for the two data sets. A: Sample images from the liver nuclear chromatin data set. B: Sample imagesfrom the Golgi protein data set. C: Sample images from galaxy image data set. D: Sample image from facial expression dataset.

5.1 Datasets and pre-processing

The biological imaging dataset has two sub-datasets.The nuclear chromatin patterns were extracted fromhistopathology images obtained from tissue sections.Tissue for imaging was obtained from the tissue archivesof the Childrens Hospital of Pittsburgh of the Univer-sity of Pittsburgh Medical Center. The extraction pro-cess included a semi-automatic level set-based segmen-tation, together with a color to gray scale conversionand pixel normalization. The complete details relatedto this dataset are available in our previous work [43],[42]. The dataset we use in this paper consists of fivehuman cases of liver hepatoblastoma (HB), a liver tu-mor in children, each containing adjacent normal livertissue (NL). In total, 100 nuclei were extracted fromeach case (50 NL, 50 HB). The second sub dataset weuse are fluorescent images of 2 golgi proteins: giantinand GPP130. The dataset is described in detail in [9].In total, we utilize 81 GPP and 66 giantin protein pat-terns. The galaxy dataset we use contain two types ofgalaxy images including 225 Elliptical galaxies and 223Spiral galaxies. The dataset is described in [34]. The fa-cial expression dataset is the same as described in [36].We use only the normal and smiling expressions, witheach expression containing 40 images. Sample imagesfor the nuclear chromatin, golgi, galaxy and facial ex-pression datasets are show in Figure 6(A), (B), (C) and(D) respectively.

5.2 Particle approximation error

Here we quantify how well our particle-based algorithmdescribed above can approximate the different datasetswe use in this paper. Instead of an absolute figure, ofinterest is the OT error between each image and its par-ticle approximation, relative to the OT distances that

are present in a dataset. Figure 7(A) can give us someindication of the relative (in percentage) error for theparticle approximation of each image in each dataset.For each image in a dataset, we compute the errorei = ϕi

1M

Pj=1,...,M di,j

, where di,j is the OT distance com-

puted through equation (5), and ϕi is the upper boundon the OT distance between the particle approximationof image i and the image i (defined in equation (16) inthe appendix). The computation is repeated for M im-ages chosen randomly from a given dataset. Figure 7(A)contains the histogram of the errors for all the datasets.The x-axis represents the relative error in percentage,and the y-axis represents the portion of samples thathas the corresponding relative error. Though, by defini-tion, we can always reduce the approximation errors byadding more particles, considering the computationalcost, we use 500 initial particles for the nuclei, golgiand galaxy data sets, and 2000 initial particles for theface data set. The data shows that the average errorswere 14.1% for the nuclear chromatin dataset, 3.2% forthe golgi protein dataset, 13.1% for the galaxy imagedataset and 18.5% for the face data set.

5.3 Difference between OT and LOT

We compare the LOT distance given in equation (9)with the OT distance, defined in equation (5). To thatend, we compute the full OT distance matrix (all pair-wise distances) for both biological image datasets andcompare it to the LOT distance via: eOT,LOT = dOT−dLOT

dOT.

Figure 7(B) contains the relative error for the nuclearchromatin and golgi. The average absolute error forthe nuclear chromatin data is eOT,LOT = 2.94%, andeOT,LOT = 3.28% for the golgi dataset.

Page 11: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

A linear optimal transportation framework for quantifying and visualizing variations in sets of images 11

Fig. 7 Histograms of particle approximation error and LOT error. A: The histogram of relative error generated during theparticle approximation process. The x-axis represents the relative error in percentage, and the y-axis represents the portionof samples that has the corresponding relative error. B: The histogram of relative errors between LOT and OT for nuclei andgolgi data sets. The x-axis represents the relative error in percentage, and the y-axis represents the portion of samples thathas the corresponding relative error. The dotted line represents the relative errors for the liver nuclear chromatin dataset, andthe solid line represents relative errors for the Golgi dataset.

5.4 Discrimination power

In [43], we have shown that the OT metric can be usedto capture the nuclear morphological information toclassify different classes of liver and thyroid cancers.We now show that the linear OT approximation wepropose above can be used for discrimination purposeswith little or no loss in accuracy when compared totraditional feature-based approaches. To test classifica-tion accuracy we use a standard implementation of thesupport vector machine method [7] with a radial basisfunction (RBF) kernel.

For the nuclear chromatin in the liver cancer datasetwe followed the same leave one case out cross validationapproach described in [43]. The classification resultsare listed in Table 1. For comparison, we also com-puted classification results using a numerical featureapproach where the same 125 numerical features (in-cluding shape parameters, Haralick features, and multi-resolution-type features) as described in [43], were used.Stepwise discriminant analysis was used to select thesignificant features for classification (12 of them wereconsequently selected). In addition, we also provide clas-sification accuracies utilizing the method described in[28], which we denote here as EMD-L1. The EMD-L1 inthis case was applied to measure the L1 weighted earthmover’s distance (EMD) between image pairs. The ra-dius above which pixels displacements were not com-puted was set to 15 for fast computation. When usingthe EMD-L1 method, each image was downsampled byfour, after Gaussian filtering, to allow for the computa-tions to be performed in a reasonable time frame (seediscussion section for a comparison and discussion ofcomputation times).

Classification results are shown in Table 1. We notethat because we used a kernel-based SVM method, theonly difference between all implementations were thepairwise distances computed (OT vs. LOT vs. EMD-L1 vs. feature-based). In Table 1, each row correspondsto a test case, and the numbers correspond to the aver-age classification accuracy (per nucleus) for normal liverand liver cancer. The first column contains the classifi-cation accuracy for the feature-based approach, the sec-ond column the accuracy for the OT metric, the thirdcolumn the accuracy for the LOT metric, and the finalcolumn contains the computations utilizing the EMD-L1 approach [28]. We followed a similar approach forcomparing classification accuracies in the golgi proteindataset. In this case we utilized the feature set describedin [9]. This feature set was specifically designed for clas-sification tasks of this type. A five fold cross validationstrategy was used. Results are shown in Tables 2 and3.

We also computed the LOT metric for the galaxyand facial expression datasets and applied the samesupport vector machine (SVM) strategy. In the galaxycase we utilized the feature set described in [34]. A fivefold cross validation accuracy was used. Compared withthe accuracy of the feature-based approach (93.6%),the classification accuracy of LOT metric was 87.7%.For the facial expression data set, the classification wasperformed based on SVM with the LOT metric, and a90% classification accuracy was obtained.

Page 12: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

12 Wei Wang et al.

Table 1 Average classification accuracy in liver data

Feature OT LOT EMD-L1Case 1 89% 87% 86% 84%Case 2 92% 88% 89% 86%Case 3 94% 91% 90% 87%Case 4 80% 87% 85% 84%Case 5 71% 76% 74% 76%

Average 85.2% 85.8% 84.8% 83.4%

Table 2 Average classification accuracy in golgi protein data

Feature OT LOTgia gpp gia gpp gia gpp

gia 79.2% 20.8% 86.2% 13.8% 83.8% 16.2%gpp 28.6% 71.4% 35.6% 64.3% 30.7% 69.3%

Table 3 Average classification accuracy for EMD-L1 in golgiprotein data

EMD-L1gia gpp

gia 85.8% 14.2%gpp 31.3% 68.7%

5.5 Visualizing summarizing trends and discriminatinginformation

We applied the PCA technique as described in section4 to the nuclei and facial expression data sets. Thefirst three modes of variation for the nuclear chromatindatasets (containing both normal and cancerous cells)are displayed in Figure 8(A). The first three modes ofvariation for the facial expression dataset are shown inFigure 8(B). For both datasets, the first three modesof variation correspond to over 99% of the variance ofthe data. The modes of variation are displayed from themean to plus and minus four times times the standarddeviation of each corresponding mode. For the chro-matin dataset we can visually detect size (first mode),elongation (second mode), and differences in chromatinconcentration, from the center to the periphery of thenucleus (third mode). In the face dataset, since we didnot normalize for size of the face, one can detect vari-ations in size (first mode), facial hair texture (secondmode), as well as face shape (third mode). The vari-ations detected from the second and third modes aresimilar to the results reported in [36].

We also applied the method described in Section4 to visualize the most significant differences betweenthe two classes in each dataset. The most discrimi-nating modes for the nuclear chromatin dataset, golgidataset, galaxy and the facial expression dataset areshown in Figure 9(A,B,C,D), respectively. In the nu-clear chromatin case, the normal tissue cells tend to

look more like the images on the left, while the can-cerous ones tend to look more like the images on theright. In this case we can see that the most discrimi-nating effect seems to be related to differences in chro-matin placement (near periphery vs. concentrated atthe center). The p value for this direction was 0.019. Forthe golgi protein dataset, the discriminating direction(p = 0.049) shows that the giantin golgi protein tendsto be more scattered than the gpp protein, which tendsto be more elongated in its spatial distribution. For thegalaxy dataset, the discriminant direction (p = 0.021)seems to vary from a spiral structure to a bright densedisk. For the facial expression dataset, the discriminat-ing direction (p = 0.011) shows a trend from a smilingface to a serious or neutral facial expression.

6 Summary and Discussion

We described a new method for quantifying and vi-sualizing morphological differences in a set of images.Our approach called the LOT is based on the optimaltransportation framework with quadratic cost, and weutilize a linearized approximation based on a tangentspace representation to make the method amenable tolarge datasets. Our current implementation is basedon a discrete ‘particle’ approximation of each pattern,with subsequent linear programming optimization, andis therefore best suited for images which are sparse. Weevaluated the efficacy of our methods in several aspects.As our results show, the error between an image and itsparticle approximation, relative to the distances in thedataset, are relatively small for a moderate number ofparticles when the images are sparse (e.g. golgi proteinimages), and can be made small for general images ifmore time is allowed for computation. We also showedthat the LOT distance closely approximates the stan-dard OT distance between particle measures, with ab-solute percentage errors on the order of a few percentin the datasets we used. In experiments not shown, wealso evaluated the reproducibility of our particle-basedLOT computation with respect to the random parti-cle initializations required by algorithm. These experi-ments revealed that the average coefficient of variationwas on the order of a couple of percent. Finally, wealso evaluated how well the LOT distance describedcan perform in classifying sets of images in comparisonto standard feature-based methods, the traditional OTdistance [43], and the EMD-L1 method of [28]. Overall,considering all classification tests, no significant loss ofaccuracy was detected.

A major advantage of the LOT framework is the re-duced cost of computing the LOT distance between allimage pairs in a dataset. For a database of M images,

Page 13: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

A linear optimal transportation framework for quantifying and visualizing variations in sets of images 13

Fig. 8 The modes of variation given by the PCA method combined with the LOT framework, for nuclear chromatin (A) andfacial expression (B) datasets. Each row corresponds to a mode of variation, starting from the first PCA mode (top).

Fig. 9 Modes of discrimination computed using the penalized LDA method in combination with the LOT framework. Eachrow contains the mode of variation that best discriminates the two classes in each dataset. Parts (A),(B),(C),(D) refer todiscrimination modes in nuclear morphology, golgi proteins, galaxy morphology, and facial expression datasets, respectively.

section 2.3 explains that the number of transportationrelated optimization problems that need to be solvedfor computing all pairwise distances is M . In compar-ison, to our knowledge, the number of transportationrelated optimizations necessary for this purpose of allother available methods is M(M − 1)/2. In concreteterms, if all computations were to be performed us-

ing a single Apple laptop computer with 2GB or RAMand a dual processor of 2.2GHz, the total computationtime for producing all pairwise distances for comput-ing the results shown in Table 1 of our paper, for ex-ample, would be approximately 4.1 hours for the LOTmethod, 15.6 hours for the EMD-L1 method [28], and1200 hours for the OT method described in [43]. We

Page 14: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

14 Wei Wang et al.

clarify that in order to utilize the EMD-L1 method im-ages had to be downsampled to allow the computationsto be performed in reasonable times. The time quotedabove (and results shown in Tables 1, and 3) was com-puted by downsampling each image by a factor of four(after spatial filtering). We clarify again, however, thatwe have used a generic linear programming package forsolving the OT optimizations associated with our LOTframework. The computation times for the LOT frame-work could be decreased further by utilizing linear pro-gramming techniques more suitable for this problem.

In addition to fast computation, a significant inno-vation of the LOT approach is the convenient isometriclinear embedding it provides. In contrast to other meth-ods that also obtain linear embeddings, see for example[31,37], the embedding LOT space allows for one tosynthesize and visualize any point in the linear spaceas an image. Therefore the LOT embedding can facil-itate visualization of the data distribution (via PCA)as well as differences between distributions (via penal-ized LDA). Hence, it allows for the automatic visualiza-tion of differences in both shape and texture in differentclasses of images. In biological applications, such mean-ingful visualizations can be important in helping elu-cidate effects and mechanisms automatically from theimage data. For example, the results with the nuclearchromatin dataset suggest that, in this disease, cancercells tend to have their chromatin more evenly spread(euchromatin) throughout the nucleus, whereas normalcells tend to have their chromatin in more compact form(heterochromatin). This suggests that the loss of het-erochromatin is associated with the cancer phenotypein this disease. This finding is corroborated by earlierstudies [42,43] (which used different methodology), aswell as other literature which suggests that cancer pro-gression is associated with loss of heterochromatin [25,13].

We note that the use of the quadratic cost in (1)allows for a Riemannian geometric interpretation forthe space of images. With this point of view, the LOTframework we describe can be viewed as a tangent planeapproximation of the OT manifold. Thus the linear em-bedding we mentioned above can be viewed as a pro-jection onto the tangent plane. While computing the-oretical bounds for the difference between the OT andLOT distances is difficult since it would depend on thecurvature of the OT manifold (which we have no way ofestimating for a real distribution of images), several im-portant facts about this approximation are worth not-ing. Firstly, we note that the LOT is a proper math-ematical distance on its own, even when the tangentplane approximation is poor. Secondly, the projectionis a one to one mapping. Thus information cannot be

lost by projecting two images onto the same point. Fi-nally, we note that utilizing the LOT distances (insteadof the OT ones) results in no significant loss in classifi-cation accuracy. This suggests, that for these datasets,no evidence is available to prefer the OT space versusthe LOT one.

As mentioned above, the LOT embedding obtainedvia the tangent space approximation can facilitate theapplication of geometric data analysis methods such asPCA to enable one to visualize interesting variationsin texture and shape in the dataset. As compared toapplying the PCA method directly on pixel intensities(data not shown for brevity), the PCA method appliedwith the use of the LOT method yields sharper andeasier to interpret variations, since it can account forboth texture and shape variations via the transporta-tion of mass. As already noted above, several othergraph-based methods also exist for the same purpose[31,37]. However, in contrast to our LOT method, theseare not analytical meaning that only measured imagescan be visualized. Moreover, the LOT distances we pro-vide here can be used as ‘local’ distances, based onwhich such graphs can be constructed. Thus both classesof methods could be used jointly, for even better per-formance (one example is provided in [30]).

The LOT framework presented here depends on a‘particle’ approximation of the input images. The ad-vantage of such a particle approximation is that theunderlying OT optimization becomes a linear programwhose global optimum can be solved exactly with stan-dard methods. Drawbacks include the fact that suchapproach can become computationally intensive for nonsparse images. We investigated the average classifica-tion accuracy for the experiments encountered in theprevious section as a function of the number of particlesused to approximate each image. For the experimentinvolving nuclear chromatin patterns, for example, theaverage classification accuracy for the LOT method us-ing 300, 500, and 900 particles was 79%, 83.3%, and83.4% , respectively. We note that for the golgi proteindataset, utilizing the LOT with 100 particles producednearly the same accuracy as the ones reported in Ta-ble 2. These results, together with the already providedcomparison to other methods, suggest the particle ap-proximation was sufficient to classify these datasets.

Finally, we wish to point out several limitations in-herent in the approach we describe here. First, as al-ready noted, the particle approximation is best able tohandle images which are sparse. For images that arenot sparse, many more particles are needed to approx-imate the image well, thus increasing significantly thecomputational cost of the method. In addition, like alltransportation related distances, our approach is not

Page 15: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

A linear optimal transportation framework for quantifying and visualizing variations in sets of images 15

able to handle transport problems which include bound-aries. Instead the approach is best suited for analyzingimages whose intensities can be viewed as a distribu-tion of mass that is ‘free’ to move about the image.In addition, we note that transportation-type distancesare not good at telling the difference between smoothversus punctate patterns. For this specific task, otherapproaches (such as Fourier analysis, for example), maybe more fruitful. Finally, we note that for some typesof images, the traditional OT and the LOT can yieldvery different distances. This could occur when the im-ages to be compared have very sharp transitions, ofhigh frequency, and if the phase of these transitions aremismatched. We note that this is rarely the case in theimages we analyze in this paper. These and other as-pects of our current LOT framework will be the subjectof future work.

Acknowledgements:

The authors wish to thank the anonymous reviewers forhelping significantly improve this paper. WW, SB, andGKR acknowledge support from NIH grants GM088816and GM090033 (PI GKR) for supporting portions ofthis work. DS was also supported by NIH grant GM088816,as well as NSF grant DMS-0908415. He is also gratefulto the Center for Nonlinear Analysis (NSF grant DMS-0635983 and NSF PIRE grant OISE-0967140) for itssupport.

Appendix: Particle approximation algorithm

Here we present the details of the particle approxima-tion algorithm outlined in section 3.1. The goal of thealgorithm is to approximate the given probability mea-sure, µ on Ω by a particle measure with approximatelyN particles, where N is given. For images where the es-timated error of the initial approximation is much largerthan the typical error over the given dataset the num-ber of particles is increased to reduce the error, whilefor images where the error is much smaller than typi-cal the number of particles is reduced in order to savetime when computing the OT distance. The precise cri-terion and implementation of adjusting the number ofparticles is described in step 8 below.

In our application the measure µ represents the im-age. It too can be thought as a particle measure particlemeasure itself µ =

∑Nµj=1mjδyj where Nµ is the number

of pixels, yj the coordinates of the pixel centers, and mj

the intensity values. Our goal however is to approximateit with a particle measure with a lot fewer particles. Be-low we use the fact that the restriction of the measure

µ to a set V is defined as µ|V =∑Nµi=1,xi∈V miδxi . The

backbone of the algorithm rests on the following math-ematical observations.

Observation 1. The best approximation forfixed particle locations. Assume that x1, . . . , xN arefixed. Consider the Voronoi diagram with centers x1, . . . , xN .Let Vi be the cell of the Voronoi diagram that cor-respond to the center xi. Then among all probabilitymeasures of the form µ =

∑Ni=1miδxi the one that ap-

proximates µ the best is

N∑i=1

µ(Vi)δxi = argmin

dW (µ, µN ) : µN =

N∑i=1

miδxi

with mi ≥ 0 andN∑i=1

mi = µ(Ω)

(14)

Let us remark that if µ is a particle measure asabove then µ(Vi) is just the sum of intensities of allpixels whose centers lie in Vi.

Given a measure µ supported on Ω we define thecenter of mass to be

x(µ) =1

µ(Ω)

∫Ω

xdµ. (15)

Observation 2. The best local ‘recentering’of particles. It is easy to prove that among all one-particle measures the one that approximates µ the bestis the one set at the center of mass of µ:

µ(Ω)δx(µ) = argmindW (µ, µ(Ω)δy) : y ∈ Rn.

We can apply this to ‘recenter’ the delta measures ineach Voronoi cell:N∑i=1

µ(Vi) δx(Vi) = argmin

k∑i=1

dW (µ|Vi , µ(Vi)δyi)

: yi ∈ Rn

Here µ|Vi is the restriction of measure µ to the setVi. More precisely, for A ⊂ Ω, µ|Vi(A) = µ(A ∩ Vi).

Observation 3. The error for the current ap-proximation is easy to estimate. Given a particleapproximation as above one can compute a good upperbound on the error. In particular

d2W

(µ,

N∑i=1

µ(Vi) δx(Vi)

)≤

N∑i=1

d2W

(µ|Vi , µ(Vi) δx(Vi)

)=

N∑i=1

∫|x− x(µ|Vi)|2dµ|Vi .

(16)

If µ =∑Nµj=1mjδyj the upper bound on the right-

hand side becomes∑Ni=1

∑Nµj=1,yj∈Vi mj |yj− xi|2 where

Page 16: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

16 Wei Wang et al.

xi =(∑Nµ

j=1,yj∈Vi mjyj

)/∑Nµj=1,yj∈Vi mj . We should

also note that the estimate on the right-hand side givesus very useful information on ‘local’ error in each cellVi. This enables us to determine which cells need to berefined when needed.

Based on these observations, we use the following‘weighted Lloyd’ algorithm to produce a particle ap-proximation to measure µ. The idea is to use enoughparticles to approximate each image well (according tothe transport criterion), but not more than necessarygiven a chosen accuracy. The steps of our algorithm are:

1. Distribute N particles (with N an arbitrarily chosennumber), x1, . . . xN , over the domain Ω by weightedrandom sampling (without replacement) the mea-sure µ with respect to the intensity values of theimage.

2. Compute the Voronoi diagram for centers xi, . . . , xk.Based on Observation 1, set mi = µ(Vi). The mea-sure µN =

∑Ni=1 µ(Vi) δxi is the current approxima-

tion.3. Using Observation 2, we recenter the cells by settingxi,new = x(µ|Vi) (the center of mass of µ restrictedto cell Vi).

4. Repeat steps 2. and 3. until the algorithm stabilizes.(the change of error upper bound is less than 0.5%in the sequential steps).

5. Repeat the steps 1 to 4 10 times for each image andchoose the approximation with lowest error.

6. We then seek to improve the approximation in re-gions in the image which have not been well ap-proximated by the step above. We do so by intro-ducing more particles in the Voronoi cells Vi wherethe cost of transporting the intensities in that cellto its center of mass exceed a certain threshold. Todo so, we compute the transportation error erri =dW (µ|Vi , µ(Vi)δxi) in cell i. Let err be the averagetransportation error. We add particles to all cellswhere erri > 1.7err. More precisely if there are anycells where erri > 1.7err then we split the cell withthe largest error into two cells Vi1, Vi2, and then re-compute the Voronoi diagram to determine the cen-ters mi1 = µ(Vi1),mi2 = µ(Vi2) of these two newcells. We repeat this splitting process until all theVoronoi cells have error less than 1.7erri. The choiceof 1.7 was based on emperical evaluation with sev-eral datasets.

7. In order to reduce the number of particles in areasin each image where many particles are not needed,we merge nearby particles. Two particles are mergedif, by merging them, the theoretical upper bounddefined in (16) is lowered. When particles of massm1 and m2 and locations x1 and x2 are merged to

a single particle of mass m = m1 + m2 located atthe center of mass x = m1x1+m2x2

m1+m2the error in the

distance squared is bounded by:

d2W (m1δx1 +m2δx2 + µrest,mδx + µrest)

≤ m1m2

m1 +m2|x1 − x2|2

We merge the particles that add the smallest errorfirst and then continue merging until dW (µN , µmerged) <0.15dW (µN , µ). This has the overall effect of shift-ing the histogram (taken over all images) of the ap-proximation error to the right, where the inequalityis used in the sense of the available upper bound onthe distance (Observation 3 and the estimate above)and not the actual distance. This ensures that thefinal error for each image after merging is around1.15dW (µN , µ).

8. Finally, while the two steps above seek to find a goodparticle approximation for each image, with N par-ticles or more, the final step is designed to makethe particle approximation error more uniform forall images in a given dataset. For a set of imagesI1, I2, ..., IM , we estimate the average transporta-tion error, Eavg, between each digital image andits approximation, as well as standard deviation τ

of the errors. We set Esmall = Eavg − 0.5 ∗ τ andEbig = Eavg + 0.5 ∗ τ .– For images that have bigger error than Ebig, we

split the particles as in Step 6 until the error forthose images are less than Ebig.

– For images that have smaller error than Esmall,we merge nearby particles instead as in Step 7.The procedure we apply is to merge the particlesthat add the smallest error first and then con-tinue merging until dW (µ, µmerged) ≥ Esmall.

References

1. Ambrosio, L., Gigli, N., Savare, G.: Gradient flows inmetric spaces and in the space of probability measures,second edn. Lectures in Mathematics ETH Zurich.Birkhauser Verlag, Basel (2008)

2. Angenent, S., Haker, S., Tannenbaum, A.: Minimizingflows for the Monge-Kantorovich problem. SIAM J.Math. Anal. 35(1), 61–97 (electronic) (2003)

3. Barrett, J.W., Prigozhin, L.: Partial L1 Monge-Kantorovich problem: variational formulation and nu-merical approximation. Interfaces Free Bound. 11(2),201–238 (2009)

4. Beg, M., Miller, M., Trouve, A., Younes, L.: Computinglarge deformation metric mappings via geodesic flows ofdiffeomorphisms. International Journal of Computer Vi-sion 61(2), 139–157 (2005)

5. Benamou, J.D., Brenier, Y.: A computational fluid me-chanics solution to the Monge-Kantorovich mass transferproblem. Numer. Math. 84(3), 375–393 (2000)

Page 17: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

A linear optimal transportation framework for quantifying and visualizing variations in sets of images 17

6. Bengtsson, E.: Fifty years of attempts to automatescreening for cervical cancer. Med. Imaging Tech. 17,203–210 (1999)

7. Bishop, C.M.: Pattern Recognition and Machine Learn-ing (Information Science and Statistics). Springer (2006)

8. Blum, H., et al.: A transformation for extracting newdescriptors of shape. Models for the perception of speechand visual form 19(5), 362–380 (1967)

9. Boland, M.V., Murphy, R.F.: A neural network classifiercapable of recognizing the patterns of all major subcellu-lar structures in fluorescence microscope images of helacells. Bioinformatics 17(12), 1213–23 (2001)

10. do Carmo, M.P.: Riemannian geometry. Mathematics:Theory & Applications. Birkhauser Boston Inc., Boston,MA (1992). Translated from the second Portuguese edi-tion by Francis Flaherty

11. Chefd’hotel, C., Bousquet, G.: Intensity-based image reg-istration using earth mover’s distance. In: Proceedings ofSPIE, vol. 6512, p. 65122B (2007)

12. Delzanno, G.L., Finn, J.M.: Generalized Monge-Kantorovich optimization for grid generation and adap-tation in Lp. SIAM J. Sci. Comput. 32(6), 3524–3547(2010)

13. Dialynas, G.K., Vitalini, M.W., Wallrath, L.L.: Linkingheterochromatin protein 1 (hp1) to cancer progression.Mutat Res 647(1-2), 13–20 (2008)

14. Fisher, R.A.: The use of multiple measurements in taxo-nomic problems. Annals of Eugenics 7, 179–188 (1936)

15. Gardner, M., Sprague, B., Pearson, C., Cosgrove, B.,Bicek, A., Bloom, K., Salmon, E., Odde, D.: Model con-volution: A computational approach to digital image in-terpretation. Cellular and molecular bioengineering 3(2),163–170 (2010)

16. Grauman, K., Darrell, T.: Fast contour matching usingapproximate earth mover’s distance. Proceedings of the2004 IEEE Computer Society Conference on ComputerVision and Pattern Recognition, CVPR 2004 (2004)

17. Haber, E., Rehman, T., Tannenbaum, A.: An efficientnumerical method for the solution of the L2 optimal masstransfer problem. SIAM J. Sci. Comput. 32(1), 197–211(2010)

18. Haker, S., Zhu, L., Tennenbaum, A., Angenent, S.: Opti-mal mass transport for registration and warping. Intern.J. Comp. Vis. 60(3), 225–240 (2004)

19. Kong, J., Sertel, O., Shimada, H., BOyer, K.L., Saltz,J.H., Gurcan, M.N.: Computer-aided evaluation of neu-roblastoma on whole slide histology images: classifyinggrade of neuroblastic differentiation. Pattern Recogni-tion 42, 1080–1092 (2009)

20. Ling, H., Okada, K.: An efficient earth mover’s distancealgorithm for robust histogram comparison. IEEE Trans-actions on Pattern Analysis and Machine Intelligence pp.840–853 (2007)

21. Lloyd, S.P.: Least squares quantization in pcm. IEEETrans. Inf. Theory 28(2), 129–137 (1982)

22. Loo, L., Wu, L., Altschuler, S.: Image-based multivari-ate profiling of drug responses from single cells. NatureMethods 4(5), 445–454 (2007)

23. Methora, S.: On the implementation of a primal-dual in-terior point method. SIAM Journal on Scientific andStatistical Computing 2, 575–601 (1992)

24. Miller, M.I., Priebe, C.E., Qiu, A., Fischl, B., Kolasny,A., Brown, T., Park, Y., Ratnanather, J.T., Busa, E.,Jovicich, J., Yu, P., Dickerson, B.C., Buckner, R.L.: Col-laborative computational anatomy: an mri morphometrystudy of the human brain via diffeomorphic metric map-ping. Hum Brain Mapp 30(7), 2132–41 (2009)

25. Moss, T.J., Wallrath, L.L.: Connections between epi-genetic gene silencing and human disease. Mutat Res618(1-2), 163–74 (2007)

26. Orlin, J.B.: A faster strongly polynomial minimum costflow algorithm. Operations Research 41(2), 338–350(1993)

27. Pele, O., Werman, M.: A linear time histogram metricfor improved sift matching. In: ECCV (2008)

28. Pele, O., Werman, M.: Fast and robust earth mover’sdistances. In: Computer Vision, 2009 IEEE 12th Inter-national Conference on, pp. 460–467. IEEE (2009)

29. Pincus, Z., Theriot, J.A.: Comparison of quantitativemethods for cell-shape analysis. J Microsc 227(Pt 2),140–156 (2007)

30. Rohde, G.K., Ribeiro, A.J.S., Dahl, K.N., Murphy, R.F.:Deformation-based nuclear morphometry: capturing nu-clear shape variation in hela cells. Cytometry A 73(4),341–50 (2008)

31. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality re-duction by locally linear embedding. Science 290(5500),2323–2326 (2000). DOI 10.1126/science.290.5500.2323

32. Rubner, Y., Tomassi, C., Guibas, L.J.: The earth mover’sdistance as a metric for image retrieval. Intern. J. Comp.Vis. 40(2), 99–121 (2000)

33. Rueckert, D., Frangi, A.F., Schnabel, J.A.: Automaticconstruction of 3-d statistical deformation models of thebrain using nonrigid registration. IEEE Trans. Med.Imag. 22(8), 1014–1025 (2003)

34. Shamir, L.: Automatic morphological classification ofgalaxy images. Monthly Notices of the Royal Astronom-ical Society (2009)

35. Shirdhonkar, S., Jacobs, D.: Approximate earth mover’sdistance in linear time. Proceedings of the 2008 IEEEComputer Society Conference on Computer Vision andPattern Recognition, CVPR 2008 (2008)

36. Stegmann, M., Ersboll, B., Larsen, R.: Fame-a flexibleappearance modeling environment. IEEE Transactionson Medical Imaging 22(10), 1319–1331 (2003)

37. Tenenbaum, J.B., de Silva, V., Langford, J.C.: A globalgeometric framework for nonlinear dimensionality re-duction. Science 290(5500), 2319–2323 (2000). DOI10.1126/science.290.5500.2319

38. Vaillant, M., Miller, M., Younes, L., Trouve, A.: Statisticson diffeomorphisms via tangent space representations.NeuroImage 23, S161–S169 (2004)

39. Villani, C.: Topics in optimal transportation, GraduateStudies in Mathematics, vol. 58. American MathematicalSociety, Providence, RI (2003)

40. Villani, C.: Optimal transport, Grundlehren der Math-

ematischen Wissenschaften [Fundamental Principles ofMathematical Sciences], vol. 338. Springer-Verlag,Berlin (2009). DOI 10.1007/978-3-540-71050-9. URLhttp://dx.doi.org/10.1007/978-3-540-71050-9. Old andnew

41. Wang, W., Mo, Y., Ozolek, J.A., Rohde, G.K.: Penal-ized fisher discriminant analysis and its application toimage-based morphometry. Pattern Recognition Letters(accepted, 2011)

42. Wang, W., Ozolek, J., Rohde, G.: Detection and clas-sification of thyroid follicular lesions based on nuclearstructure from histopathology images. Cytometry PartA 77(5), 485–494 (2010)

43. Wang, W., Ozolek, J., Slepcev, D., Lee, A., Chen, C.,Rohde, G.: An optimal transportation approach for nu-clear structure-based pathology. IEEE Transactions onMedical Imaging 30(3), 621–631 (2011)

Page 18: A linear optimal transportation framework for quantifying ...faculty.virginia.edu/rohde/wang_ijcv_13.pdfA linear optimal transportation framework for quantifying and visualizing variations

18 Wei Wang et al.

44. Yang, L., Chen, W., Meer, P., Salaru, G., Goodell, L.,Berstis, V., Foran, D.: Virtual microscopy and grid-enabled decision support for large scale analysis of im-aged pathology specimens. IEEE Trans Inf TechnolBiomed (2009)