SUBMITTED TO IEEE TRANSACTIONS ON MEDICAL IMAGING 1 ...

SUBMITTED TO IEEE TRANSACTIONS ON MEDICAL IMAGING 1

Simultaneous Sparsity Model for HistopathologicalImage Representation and Classification

Umamahesh Srinivas, Student Member, IEEE, Hojjat Seyed Mousavi, Student Member, IEEE,Vishal Monga, Senior Member, IEEE, Arthur Hattel, and Bhushan Jayarao

Abstract—The multi-channel nature of digital histopathologicalimages presents an opportunity to exploit the correlated colorchannel information for better image modeling. Inspired byrecent work in sparsity for single channel image classification,we propose a new simultaneous Sparsity model for multi-channel Histopathological Image Representation and Classifi-cation (SHIRC). Essentially, we represent a histopathologicalimage as a sparse linear combination of training examples undersuitable channel-wise constraints. Classification is performed bysolving a newly formulated simultaneous sparsity-based optimiza-tion problem. A practical challenge is the correspondence ofimage objects (cellular and nuclear structures) at different spatiallocations in the image. We propose a robust locally adaptivevariant of SHIRC (LA-SHIRC) to tackle this issue. Experimentson two challenging real-world image data sets: (i) mammaliantissue images acquired by pathologists of the Animal DiagnosticsLab (ADL) at Pennsylvania State University, and (ii) humanintraductal breast lesions, reveal the merits of our proposal overstate-of-the-art alternatives. Further, we demonstrate that LA-SHIRC exhibits a more graceful decay in classification accuracyagainst the number of training images which is highly desirable inpractice where generous training per class is often not available.

Index Terms—Classification, histopathological image analysis,multichannel images, sparse representation.

I. INTRODUCTION

The advent of digital pathology [1] has ushered in anera of computer-assisted diagnosis and treatment of medicalconditions based on the analysis of medical images. Of activeresearch interest is the development of quantitative imageanalysis tools to complement the efforts of radiologists andpathologists towards better disease diagnosis and prognosis.This research thrust has been fueled by a variety of factors,including the availability of large volumes of patient-relatedmedical data, dramatic improvements in computational re-sources (both hardware and software), and algorithmic ad-vances in image processing, computer vision and machinelearning theory. An important emerging sub-class of problemsin medical imaging pertains to histopathological images. Re-cent work has identified the nascency of quantitative researchin histopathological image analysis and classification, and its

Copyright (c) 2010 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].

U. Srinivas, H. S. Mousavi, and V. Monga are with the Department of Elec-trical Engineering, Pennsylvania State University, USA ([email protected]).

A. Hattel and B. Jayarao are pathologists with the Department of Veterinaryand Biomedical Sciences, Pennsylvania State University, USA.

This work has been supported by a grant from the Pennsylvania Dept. ofAgriculture, and by ONR under Grant N00014-12-1-0765.

potential for important practical applications [2]–[7]. Wholeslide digital scanners process tissue slides to generate thesedigital images. Examples of histopathological images areshown in Figs. 6 and 9. It is evident that these images carryrich structural information, making them invaluable for thediagnosis of many diseases including cancer [6], [8]–[10].

Pathologists often look for visual cues at the nuclear andcellular level in order to categorize a tissue image as eitherhealthy or diseased. Motivated by this, a variety of low-level image features have been developed based on texture,morphometric characteristics (shape and spatial arrangement)and image statistics. The gray level co-occurrence matrix byHaralick et al. [11] estimates the texture of an image interms of the distribution of co-occurring pixel intensities atspecified offset positions. Morphological image features [12]have been used in medical image segmentation for detectionof vessel-like patterns [13]. Image histograms are a popularchoice of features for medical imaging [14]. Wavelet featureshave been deployed for prostate cancer diagnosis in [15].Esgiar et al. [16] have captured the self-similarity in colontissue images using fractal-based features. Tabesh et al. [17]have combined color, texture and morphometric features forprostate cancer diagnosis. Doyle et al. [18] introduced graph-based features using Delaunay triangulation and minimumspanning trees to exploit spatial structure. In [6], a hybridclassification model that combines structural and statisticalinformation has been developed. Orlov et al. [19], [20] haverecently proposed a multi-purpose feature set that aggregatestransform domain coefficients, image statistics and textureinformation. Experimental success in many different clas-sification problems has demonstrated the versatility of thisfeature set. It must be mentioned that all the features discussedabove are applicable broadly for image analysis and have beenparticularly successful in medical imaging. For classification,these features are combined with powerful classifiers such assupport vector machines (SVMs) [14], [21] and boosting [22],[23]. A comprehensive discussion of features and classifiersfor histopathological analysis is provided in [4].

A. Motivation and Challenges

While histopathology shares some commonalities with otherpopular imaging modalities such as cytology and radiology,it also exhibits two principally different characteristics [24]that pose challenges to image analysis. First, histopathologicalimages are invariably multi-channel in nature (commonlyusing three color channels - red, green and blue (RGB)). Key


geometric information is spread across the color channels. Itis well known that color information in the hematoxylin-eosin(H&E) stained slides is essential to identify the discriminativeimage signatures of healthy and diseased tissue [7], [25],[26]. Specifically, the nuclei assume a bluish tinge due tohematoxylin, while the cytoplasmic structures and connectivetissue appear red due to eosin. As seen from Fig. 6, thereis a higher density of nuclei in diseased tissue. Typically inhistopathological image analysis, features are extracted fromeach color channel of the images [17], and the classifierdecisions based on the individual feature sets are then fusedfor classification. Alternately, only the luminance channelinformation - image edges resulting mainly from illuminationvariations - is considered [26]. The former approach ignoresthe inherent correlations among the RGB channels while thelatter strategy fails to exploit chrominance channel geometry,i.e. the edges and image textures caused by objects withdifferent chrominance.

The second challenge posed by histopathology is the relativedifficulty in obtaining good features for classification due tothe geometric richness of tissue images. Tissues from differentorgans have structural diversity and often, the objects of inter-est occur at different scales and sizes [4]. As a result, featuresare usually customized for specific classification problems,most commonly cancer of the breast and prostate.

In this paper, we address both these challenges through anovel simultaneous sparsity model inspired by recent workusing sparse representations for image classification (SRC)[27]. Given a sufficiently diverse collection of training imagesfrom each class, Wright et al. [27] conjecture that any otherimage from the same class can be expressed approximately asa linear combination of those training images. So, each testimage has a sparse representation in terms of a basis matrix ordictionary comprising training images from all classes. Thislinear additive model alleviates the challenge of designingsophisticated task-specific features. Additionally, the class-specific design of such dictionaries enables class assignmentvia a simple comparison of class-specific reconstruction errors.The robustness of sparse features to real-world imaging dis-tortions has led to their widespread use in application domainssuch as remote sensing (hyperspectral imaging [28], syntheticaperture radar [29]) and biometrics (face recognition [27],[30]). SRC has also been proposed for single-channel medicalimages, in cervigram segmentation [31], [32] and colorectalpolyp and lung nodule detection [33]. To the best of ourknowledge, ours is the first discriminative sparsity model formulti-channel histopathological images1.

B. Our Contributions

The relevance of color information for image classificationtasks has been identified previously [35], [36]. We propose anew sparsity model that recognizes and exploits the diversityof information in multiple color channels of histopathologicalimages. Specifically, our key contributions are listed next.

1Preliminary version of this work was presented at IEEE InternationalSymposium on Biomedical Imaging, 2013 [34].

1) Simultaneous sparsity model for classification. Essen-tially, our simultaneous Sparsity model for multi-channelHistopathological Image Representation and Classifica-tion (SHIRC) extends the standard SRC approach [27]–[33] by designing three color dictionaries, correspondingto the RGB channels. Each multi-channel histopatho-logical image is represented as a sparse linear combi-nation of training examples under suitable channel-wiseconstraints, which capture color correlation information.The constraints agree with intuition since a sparse linearmodel for a color image necessitates identical modelsfor each of its constituent color channels with no cross-channel contributions.

2) Novel optimization problem. Our approach considersa multi-task scenario that is qualitatively similar to thevisual classification problem addressed very recently byYuan et al. [37]. In [37], three types of image features- containing texture, shape and color information - areextracted from images and a joint sparsity model isproposed to classify the images. The joint (simultaneous)sparsity model employed in [37] and the one we develophowever have some differences. First, [37] does notconsider the problem of multi-channel or color imagemodeling. Second and more crucially, the cost functionin [37] is a sum of reconstruction error terms fromeach of the feature dictionaries which results in thecommonly seen row sparsity structure on the sparsecoefficient matrix. The resulting optimization problem issolved using the popular Accelerated Proximal Gradient(APG) method [38]. In our work however, to conformto imaging physics, we introduce color channel-specificconstraints on the structure of the sparse coefficients,which do not directly conform to row sparsity, leadingto a new optimization problem. This in turn requires amodified greedy matching pursuit approach to solve theproblem (see Appendix A).

3) Locally adaptive SHIRC: A practical solution tohandle image correspondence. As discussed earlier,feature design in medical imaging is guided by theobject-based perception of pathologists. Depending onthe type of tissue, the objects could be nuclei, cells,glands, lymphocytes, etc. Crucially it is the presence orabsence of these local objects in an image that mattersto a pathologist; their absolute spatial location mattersmuch less. As a result, the object of interest may bepresent in the test image as well as the representativetraining images, albeit at different spatial locations, caus-ing a seeming breakdown of the image-level SHIRC.This scenario can occur in practice if the acquisitionprocess is not carefully calibrated.So we infuse the SHIRC with a robust locally adaptiveflavor by developing a Locally Adaptive SHIRC (LA-SHIRC). Crucially, we rely on the pathologist’s insightto carefully select multiple local regions that containthese objects of interest from training as well as testimages and use them in our linear sparsity model insteadof the entire images. Local image features often possessbetter discriminative power than image-level features


[39]. LA-SHIRC is a well-founded instantiation of thisidea to resolve the issue of spatial correspondence be-tween objects, i.e. similar structures such as cells andnuclei present at different spatial locations in the differ-ent images. As a consequence, it obviates the need forimage registration, a challenging problem. LA-SHIRCoffers flexibility in the number of local blocks chosenper image and the size of each such block (tunable tothe size of the object).

4) Experimental insights. We present a variety of resultsfor two different data sets in Section IV. In addition tothe usual overall classification rates, we report receiveroperating characteristic (ROC) curves that help visualizea key trade-off of practical relevance to pathologists(see discussion in Section III-C). Next, we evaluatethe performance of the methods as a function of thesize of the training set. In practice the success of SRChinges on the richness and size of the training dictionary.This is an important practical concern since it is noteasy to acquire labeled ground-truth medical images inlarge numbers. A highly beneficial consequence of LA-SHIRC is the reduced burden on the number of trainingimages required. To the best of our knowledge, we arethe first to explicitly perform such an evaluation forhistopathological image classification. We believe thatthe remarkable consistency of LA-SHIRC (see Fig. 9(b))as training decreases has significant practical benefits.We carry out a thorough experimental validation of ouralgorithm on two different histopathological image datasets. The first set of images is provided by pathologistsat the Animal Diagnostics Lab (ADL), PennsylvaniaState University, and will henceforth be referred to asthe ADL data set. It contains tissue images from threemammalian organs - kidney, lung, and spleen. For eachorgan, images belonging to two categories - healthy orinflammatory - are provided. The second data set iscourtesy the Clarian Pathology Lab and Computer andInformation Science Dept., Indiana University-PurdueUniversity Indianapolis (IUPUI). The images correspondto human intraductal breast lesions (IBL) and have beenacquired according to the process described in [10].For our experiments, we consider the two well-definedcategories: usual ductal hyperplasia (UDH) and ductalcarcinoma in situ (DCIS). UDH is considered benignwhile DCIS is actionable. The ADL and IBL data setsare good examples to demonstrate the wide variability inhistopathological imagery and the consequent need foradaptive classification strategies such as the LA-SHIRC.

The remainder of this paper is organized as follows. InSection II, we review recent pioneering work in sparserepresentation-based image classification [27], which formsthe foundation for our contribution. The proposed simultane-ous sparsity model for histopathological image classificationis introduced in Section III. In Section IV, extensive experi-mental results are presented for two different histopathologicalimage sets. Section V concludes the paper.

II. BACKGROUND: SPARSE REPRESENTATION-BASEDIMAGE CLASSIFICATION (SRC)

Suppose that there are K different image classes (cor-responding to categories of medical conditions), labeledC1, . . . ,CK . Let there be Ni training samples (each in Rn) cor-responding to class Ci, i = 1, . . . ,K. It is understood that eachsample is the vectorized version of the corresponding grayscale(or single channel) image. The training samples correspondingto class Ci can be collected in a matrix DDDi ∈ Rn×Ni , and thecollection of all training samples is expressed using the matrix:

DDD = [DDD1 DDD2 . . . DDDK ], (1)

where DDD∈Rn×T , with T = ∑Kk=1 Nk. A new test sample yyy∈Rn

can be expressed as a sparse linear combination of the trainingsamples:

yyy ' DDD1ααα1 + . . .+DDDKαααK =DDDααα, (2)

where ααα is ideally expected to be a sparse vector (i.e., only afew entries in ααα are nonzero). The classifier seeks the sparsestrepresentation by solving the following problem:

ααα = argmin‖ααα‖0 subject to ‖DDDααα−yyy‖2 ≤ ε, (3)

where ‖·‖0 denotes the number of nonzero entries in the vectorand ε is a suitably chosen reconstruction error tolerance. Theproblem in (3) can be solved by greedy pursuit algorithms [40],[41]. Once the sparse vector is recovered, the identity of yyy isgiven by the minimal class-specific reconstruction residual:

Class(yyy) = argmini‖yyy−DDDδi(ααα)‖ , (4)

where δi(ααα) is a vector whose only nonzero entries are thesame as those in ααα which are associated with class Ci.

Rooted in optimization theory, the robustness of the sparsefeature to real-world image distortions like noise and occlusionhas led to its widespread application in practical classificationtasks. Modifications to (3) include relaxing the non-convex l0-term to the l1-norm [42] and introducing regularization termsto capture physically meaningful constraints [32].

In many scenarios, we have access to multiple sets ofmeasurements that capture information about the same image.The SRC model is extended to incorporate this additionalinformation by enforcing a common support set of trainingimages for the T correlated test images yyy1, . . . ,yyyT :

YYY =[yyy1 yyy2 · · · yyyT

]=[DDDααα1 DDDααα2 · · · DDDαααT

]=DDD

[ααα1 ααα2 · · · αααT

]︸︷︷︸SSS

=DDDSSS. (5)

The vectors αααi, i = 1, . . . ,T , all have non-zero entries at thesame locations, albeit with different weights, leading to therecovery of a sparse matrix SSS with only a few nonzero rows:

SSS = argmin‖YYY −DDDSSS‖F subject to ‖SSS‖row,0 ≤ K0, (6)

where ‖SSS‖row,0 denotes the number of non-zero rows of SSSand ‖·‖F is the Frobenius norm. The greedy SimultaneousOrthogonal Matching Pursuit (SOMP) [43] algorithm andconvex relaxations of the row-sparsity norm [44] have beenproposed to solve the non-convex problem in (6). Examplereal-world manifestations of this multi-variate scenario occur


Fig. 1. Color channel-specific dictionary design for SHIRC. The constituent RGB channels of the i-th sample training image occupy the i-th columns ofthe dictionaries DDDr,DDDg and DDDb respectively. Coefficient vectors αααr,αααg and αααb are color-coded to indicate the dictionary corresponding to each coefficient.Filled-in blocks indicate non-zero coefficients. The filling pattern illustrates that identical linear representation models hold for each color channel of the testimage, with possibly different weights in the coefficient vector.

in the form of spatially local pixels for hyperspectral targetclassification [28], multiple feature sets for automatic imageannotation [45], kernel features for object categorization, or asquery images for video-based face recognition [37].

III. CONTRIBUTIONS

A. SHIRC: A simultaneous Sparsity model for Histopatholog-ical Image Representation and Classification

Section II has identified the central analytical formulationunderlying the simultaneous sparsity methods in literature.Typically a single dictionary DDD is used, as in [28]. In othercases, each event is in fact characterized by multiple het-erogeneous sources, resulting in multi-task versions of SRC[37], [45]. Although different dictionaries are used for thedifferent sources, the issue of correlation among different rep-resentations of the same image is not thoroughly investigated.Our contribution in this paper is an example of multi-taskclassification, with separate dictionaries designed from theRGB channels of histopathological images.

For ease of exposition, we consider the binary classificationproblem of classifying images as either healthy or diseased.DDDh and DDDd indicate the training dictionaries of healthy anddiseased images respectively. A total of N training imagesare chosen. We represent a color image as a matrix YYY :=[yyyr yyyg yyyb] ∈ Rn×3, where the superscripts r,g,b correspondto the RGB color channels respectively. The dictionary DDDis redefined as the concatenation of three color-specific dic-tionaries, DDD :=

[DDDr DDDg DDDb

]∈ Rn×3N . Each color dictionary

DDDc,c ∈ {r,g,b}, is the concatenation of sub-dictionaries fromboth classes belonging to the c-th color channel:

DDDc := [DDDch DDDc

d ],c ∈ {r,g,b}. (7)

⇒DDD := [DDDr DDDg DDDb] = [DDDrh DDDr

d DDDgh DDDg

d DDDbh DDDb

d ]. (8)

The color dictionaries are designed to obey column correspon-dence, i.e., the i-th column from each of the color dictionaries

Fig. 2. Illustration to motivate LA-SHIRC. Shown here are four images fromthe IBL data set. The test image cannot be represented accurately as a linearcombination of the training images. However, the four local regions markedin yellow (individual cells in the tissue) have structural similarities, althoughthey are located at different spatial locations in the image.

DDDc taken together correspond to the i-th training image. Fig. 1shows the arrangement of training images into channel-specificdictionaries. A test image YYY can now be represented as a linearcombination of training samples as follows:

YYY =DDDSSS =[DDDr

h DDDrd DDDg

h DDDgd DDDb

h DDDbd

][ααα

rααα

gααα

b], (9)

where the coefficient vectors αααc ∈ R3N ,c ∈ {r,g,b}, and SSS =[αααr αααg αααb

]∈ R3N×3.

A natural question to ask at this juncture is: why do weneed a model that permits a separate dictionary per channel?A naive alternative would be to use a single color dictionary,wherein the channels of each color image are stacked into asingle column. Let xxx = [xxxT

r xxxTg xxxT

b ]T ∈ R3n denote one such

training image. Define the matrix WWW as follows:

WWW := [wrIII wgIII wbIII] ∈ Rn×3n,

where III ∈Rn×n is the identity matrix. xxx =WWWxxx = wrxxxr +wgxxxg+wbxxxb ∈ Rn is the grayscale version of xxx when the weightswr,wg,wb are chosen appropriately. Now consider a trainingdictionary AAA obtained by stacking the N vectorized color


images xxx into a 3n×N matrix:

AAA :=

DDDr

DDDg

DDDb

A test image yyy can now be written as:

yyy =AAAααα,

where ααα ∈RN is the coefficient vector. Applying the transfor-mation matrix WWW to both sides of the above equation,

WWWyyy = WWWAAAααα

⇒ yyy = AAAααα,

which is the linear representation model for the grayscale(single channel) case. Crucially, the coefficient vector ααα hasnot changed. This shows that the model which uses a singlevector per color image is identical to the grayscale imagemodel in SRC.

It remains to be justified that a separate coefficient perchannel is physically meaningful. There is high variabilityin the staining process inherent to histopathological imaging[4]. This can lead to training and test images with differentgains in the color channels. Using a coefficient per channelincorporates color balancing in the model.

A closer investigation of SSS reveals interesting characteristics:1) It is reasonable to assume that the c-th channel of the

test image (i.e. yyyc) can be represented by the linear spanof the training samples belonging to the c-th channelalone (i.e. only those training samples in DDDc). So thecolumns of SSS ideally have the following structure:

αααr =

αααr

hαααr

d000000000000

,αααg =

000000

αααgh

αααgd

000000

,αααb =

000000000000

αααbh

αααbd

,

where 000 denotes the conformal zero vector. In otherwords, SSS exhibits block-diagonal structure.

2) Each color channel representation yyyc of the test imageis in fact a sparse linear combination of the trainingsamples in DDDc. Suppose the image belongs to classh (healthy); then only those coefficients in αααc thatcorrespond to DDDc

h are expected to be non-zero.3) The locations of non-zero weights of color training

samples in the linear combination exhibit one-to-onecorrespondence across channels. If the j-th trainingsample in DDDr has a non-zero contribution to yyyr, then forc ∈ {g,b}, yyyc has non-zero contribution from the j-thtraining sample in DDDc.

B. Optimization Problem

This immediately suggests a joint sparsity model similarto (6). However, the row sparsity constraint leading to theSOMP solution is not obvious from this formulation. Instead,

Fig. 3. Local cell regions selected from the IBL data set. Top row: DCIS(actionable), bottom row: UDH (benign). Structural differences between thetwo classes are readily apparent.

we introduce a new matrix SSS′ ∈RN×3 as the transformation ofSSS with the redundant zero coefficients removed:

SSS′ =[

αααrh ααα

gh αααb

hαααr

d αααgd αααb

d

]. (10)

This is possible by first defining HHH ∈ R3N×3 and JJJ ∈ RN×3N ,

HHH =

111N 000 000000 111N 000000 000 111N

, JJJ = [IIIN IIIN IIIN ] , (11)

where 111N ∈ RN is the vector of all ones, and IIIN denotes theN-dimensional identity matrix. Now,

SSS′ = JJJ (HHH ◦SSS) , (12)

where ◦ denotes the Hadamard product, (HHH ◦SSS)i j , hi jsi j ∀ i, j.Finally, we formulate a sparsity-enforcing optimization prob-lem that is novel to the best of our knowledge:

SSS′= argmin

∥∥SSS′∥∥

row,0 subject to ‖YYY −DDDSSS‖F ≤ ε. (13)

Solving the problem in (13) presents a challenge in that astraightforward application of SOMP [43] is not possible dueto the non-invertibility of the Hadamard operator. We havedeveloped a greedy algorithmic modification of SOMP thatfares well in practice. In the interest of presenting all ourcontributions in a logically motivated sequence, the analyticaldetails of the algorithm are presented in Appendix A.

The final classification decision is made by comparing theclass-specific reconstruction errors to a threshold τ:

R(YYY ) =∥∥∥YYY −DDDhSSSh

∥∥∥2−∥∥∥YYY −DDDdSSSd

∥∥∥2

h≷d

τ, (14)

where SSSh and SSSd are matrices whose entries are those in SSSassociated with DDDh and DDDd respectively.

Shown in Fig. 5 is an illustration of the reconstructedversions of the original image, which is in fact that of acancerous (DCIS) breast lesion from the IBL data set. Thereconstructed images using training from the healthy andinflammatory classes are separately shown. It is clear thatreconstruction from the true class (DCIS in this case) moreclosely resembles the original image than the reconstructionfrom the UDH class. The reconstruction residuals for the twoclasses are 0.31 and 0.81 respectively, showing that the twoclasses are clearly separable using this metric.

Multi-class problems: In the ensuing experimental validation,we consider binary (healthy versus inflammatory/diseased)classification problems from two different data sets. In general,


Fig. 4. LA-SHIRC: Locally adaptive variant of SHIRC. The yellow boxes indicate local objects of interest such as cells and nuclei. The new dictionary DDDis built using multiple local blocks from each training image. In every test image, the local objects are classified using the simultaneous sparsity model andtheir decisions are fused for overall image-level classification.

(a) (b) (c)

Fig. 5. An illustration of class-specific reconstruction showing clear separability in the reconstruction residual. a) Original DCIS test image, b) Reconstructedimage from UDH dictionary; reconstruction error = 0.81, c) Reconstructed image from DCIS dictionary; reconstruction error = 0.31.

a more challenging problem is that of grading inflamma-tions. Our approach extends to such a K-class scenario ina straightforward manner by incorporating additional class-specific dictionaries in DDD. The classification rule is thenmodified as follows:

Class(YYY ) = arg mink=1,...,K

∥∥∥YYY −DDDδk(SSS)∥∥∥

F, (15)

where δk(SSS) is the matrix whose only non-zero entries are thesame as those in SSS associated with class Ck.

C. LA-SHIRC: Locally Adaptive SHIRC

Some histopathological image collections present a uniquepractical challenge in that the presence or absence of desiredobjects (e.g. cells, nuclei) in images is more crucial - comparedto their actual locations - for pathologists for disease diagnosis.Consequently, if these discriminative objects (sub-images) arenot in spatial correspondence in the test and training images,it would seem that SHIRC cannot handle this scenario. Fig. 2illustrates this for sample images from the IBL data set.

This issue can be handled practically to some extent bycareful pre-processing that manually segments out the objects

of interest for further processing [10]. However this approachcauses a loss in contextual information that is also crucial forpathologists to make class decisions. We propose a robust al-gorithmic modification of SHIRC, known as Locally AdaptiveSHIRC (LA-SHIRC), to address this concern.

It is well known that local image features are often moreuseful than global features from a discriminative standpoint[39]. This also conforms with the intuition behind patholo-gists’ decisions. Accordingly, we modify SHIRC to classifylocal image objects instead of the global image. Several localobjects of the same dimension (vectorized versions lie inRm,m� n) are identified in each training image based on therecommendation of the pathologist. Fig. 3 shows individualcells from the IBL data set. In fact, these obvious differencesin cell structure have been exploited for classification [10] bydesigning morphological features such as cell perimeter andratio of major-to-minor axis of the best-fitting ellipse.

The dictionary DDD in SHIRC is now replaced by a newdictionary DDD that comprises the local blocks. In Fig. 4, theyellow boxes indicate the local regions. Assuming N full-sizetraining images, the selection of B local blocks per image


(a) Healthy lung. (b) Healthy lung. (c) Inflamed lung. (d) Inflamed lung.

(e) Healthy kidney. (f) Healthy kidney. (g) Inflamed kidney. (h) Inflamed kidney.

(i) Healthy spleen. (j) Healthy spleen. (k) Inflamed spleen. (l) Inflamed spleen.

Fig. 6. Sample images from the ADL data set. Each row corresponds to tissues from one mammalian organ.

results in a training dictionary of size NB. Note that evenfor fixed N, the dictionary DDD ∈ Rm×NB has more samples (orequivalently, leads to a richer representation) than DDD ∈ Rn×N .Therefore, a test image block is expressed as a sparse linearcombination of all local image blocks from the trainingimages. B blocks are identified in every test image, and aclass decision is obtained for each such block by solving (13)but with the dictionary DDD. Finally the identity of the overallimage is decided by combining the local decisions from itsconstituent blocks.

D. Local Block Selection: A Practical Approach

Fig. 7 shows the implementation pipeline of LA-SHIRCfor any general histopathological image data set. The figureis divided into two phases: training and test. The input tothe training phase is a collection of labeled training images.We also benefit from the pathologist’s insight to identifythe discriminative local regions in each image. For example,in Fig. 3, we see the differences between the (local) cellstructures from the two classes. Our first goal is to segment outmultiple such regions from all training images. In the processof identifying the local regions, the pathologist estimates therelative size of such regions in comparison with the size ofthe whole image. To illustrate, in the IBL data set (Fig. 2),a possible estimate could be that the local regions roughlyoccupy 5% of the entire image (or say, a block of 40× 40pixels). With this information, we can choose our local blocks

using either of two strategies.

We can incorporate an informed image segmentation pre-processing step to identify the local regions. In our experi-ments on the IBL data set, we identify that the local regionscan be captured very well using a classical morphology-basedblob detection technique [12]. Using the estimate of the localregion’s size, we choose an appropriate blob dimension andidentify all such blobs (or cells) in each of the training images.Since each image in our data set contains many cells, weextract a reasonable number per training image, typicallyabout 10. More generally, we can leverage state-of-the-artsegmentation algorithms that have been designed specificallyfor certain image data sets.

Alternately, we fix the size of a rectangular block basedon the cell size estimate and adopt a blind random tilingstrategy to select multiple overlapping blocks in each trainingimage. Fig. 8 an illustration. This idea is in fact inspired bysimilar strategies employed for robust face recognition [46]and secure image hashing [47]. The goal is to uniformly selectfrom different regions of the image; choosing overlappingregions ensures that most of the image is covered. Here, thenumber of such blocks per training image is a function ofhow densely the discriminative regions are distributed in theoverall image, a decision that is again made by the pathologist.It is reasonable to expect that classification performance willimprove as the number of blocks in the test image increases,albeit possibly at the cost of more computation. While both


Fig. 7. Pipeline for practical implementation of LA-SHIRC.

Fig. 8. Random tiling strategy for selection of local blocks.

strategies are viable, preference for either is based on factorssuch as: (i) availability of custom segmentation tools for aparticular data set, and (ii) the additional computation incurredin the selection (and classification) of a larger number ofblocks to make the performance of the random tiling strategycomparable to the informed approach.

In both strategies, the dictionary DDD is constructed from

all these local regions from all training images. It must beemphasized that only the structural information about the localblocks is captured in the dictionary; where they come from(spatially) in the training images, and which training imagesthey come from, are not relevant.

The block selection process for a new test image is con-sistent with the strategy chosen for the training images. It isreasonable to believe that the relative scale of the local blocksis similar to the scale in training images - this can be calibratedfairly accurately during the image acquisition process; if not,the test images can be suitably resized. As a result, we nowhave an instantiation of our SHIRC framework but at the localblock level.Benefits of LA-SHIRC:• We represent a single local block from a test image as

a sparse linear combination of all local blocks from alltraining images. Therefore, this approach obviates theneed for image registration at the global (whole image)level, which is in itself a challenging problem.

• We classify several individual blocks from the test imageseparately and combine the decisions as discussed next.The number of local blocks in the test image should becomparable to the number used in the training process.

• The adaptive term in the name LA-SHIRC indicates theflexibility offered by the algorithm in terms of choosingthe number of blocks B and their dimension m. Objectsof interest in histopathological image analysis exist atdifferent scales [4] and the tunability of LA-SHIRCmakes it amenable to a variety of histopathological clas-sification problems. LA-SHIRC satisfactorily handles theissue of spatial correspondence of objects. Additionally,as will be demonstrated through experiments in SectionIV-D, it ensures high classification accuracy even with asmall number of (global) training images. This has highpractical relevance since generous number of traininghistopathological images per condition (healthy/diseased)may not always be available.

Decision fusion: A common way of combining local classdecisions is majority voting. Suppose YYY i, i= 1, . . . ,B, representthe B local blocks from image YYY . Then,

Class(YYY ) = maxk=1,...,K

|{i : Class(YYY i) = k}| , i = 1, . . . ,B, (16)

where | · | denotes set cardinality and Class(YYY i) is determinedby (15). This scheme suffers from the drawback that it isnot a soft measure, i.e. there is no indicator of the degree ofconfidence in the class decision. Additionally, the individualblocks are classified using class-specific residuals akin to SRC[27] and it is desirable to utilize this information for fusion.

Accordingly, we use a different approach to fuse individualdecisions based intuitively on the maximum likelihood. Let SSSibe the recovered sparse representation matrix of the block YYY i.The probability of YYY i belonging to the k-th class is defined tobe inversely proportional to the residual associated with thedictionary atoms in the k-th class:

pki = P(Class(YYY i) = k) =

1/Rki

∑Kk=1(1/Rk

i) , (17)


Fig. 9. Sample breast lesion images from the IBL data set. Top row: healthy (UDH) lesions, bottom row: cancerous (DCIS) lesions.

where Rki =

∥∥∥YYY i−DDDδδδk

(SSSi

)∥∥∥2. The identity of the test image

YYY is then given by:

Class(YYY ) = arg maxk=1,...,K

(B

∏i=1

pki

). (18)

The pki may be interpreted intuitively in terms of the

(frequentist) maximum likelihood. Essentially, the idea is thatthe class k that leads to the lowest reconstruction error for localblock i corresponds to the largest “probability” or “likelihood”among all the pk

i . The subsequent fusion step merely seeks theclass k that maximizes the product of all such “probabilities”from all the B local blocks.

E. The Role of Feature Extraction

SRC [27] was originally proposed as a method that views avectorized image as a linear combination of vectorized trainingimages. The richness (size of training set) of the training dic-tionary coupled with the low dimensions of the images leadsto an over-complete dictionary. Expectedly, the curse of di-mensionality strikes and renders the problem computationallyintensive. Nonetheless, there is a precedent for the successfulapplication of SRC directly on raw images in a variety oftasks such as face recognition [27], SAR automatic targetrecognition [29], hyperspectral image classification [48] andmany more. Our approach is a continuation of efforts in thisdirection, the novelty being in the application to histopathol-ogy. The local version of our algorithm, LA-SHIRC, involveslocal image blocks of much smaller dimension compared to theentire image. This alleviates the issue of high dimensionalityto an extent.

A more common way of handling high dimensionality is towork with image features instead of the images themselves.However, it is easier to interpret the linear representationmodel in SRC for images rather than on features. It is notapparent at first glance why a test feature should be a linear

combination of training features. Also, one of the motivationsfor our work is the relative difficulty in designing histopatho-logical image features due to the geometric richness of tissueimages. The linear representation model alleviates this diffi-culty. That said, many recent approaches have employed SRC(or extensions) on image features (see [32], [37] for example).Our classifier can be similarly employed on image featuresthat are chosen based on domain knowledge. Importantly, thestructure of the coefficient matrix remains unchanged, andthe same optimization problem is solved by Algorithm 1 inAppendix A. We present experimental evaluation of SHIRCand LA-SHIRC on features in Section IV.

IV. VALIDATION AND EXPERIMENTAL RESULTS

A. Experimental Set-Up: Image Data Sets

We compare the performance of SHIRC and LA-SHIRCagainst state-of-the-art alternatives for two challenging real-world histopathological image data sets.

ADL data set: These images are provided by pathologists atthe Animal Diagnostics Lab, Pennsylvania State University.The tissue images have been acquired from three differentmammalian organs - kidney, lung, and spleen. For each organ,images belonging to two categories - healthy or inflammatory- are provided. The H&E-stained tissues are scanned usinga whole slide digital scanner at 40x optical magnification, toobtain digital images of pixel size 4000×3000. All images aredownsampled to 100×75 pixels in an aliasing-free manner forthe purpose of computational speed-up. The algorithm in factworks at any image resolution. Example images2 are shownin Fig. 6. There are a total of 120 images for each organ, ofwhich 40 images are used for training and the rest for testing.The ground truth labels for healthy and inflammatory tissue

2While the entire data set cannot be made publicly available, sample full-resolution images can be viewed at: http://signal.ee.psu.edu/histimg.html.


are assigned following manual detection and segmentation per-formed by ADL pathologists. We present classification resultsseparately for each organ. In the experiments to follow, resultsusing LA-SHIRC on the ADL data set are not reported since asingle block (i.e. the entire image) was deemed by pathologiststo have sufficient discriminative information. Essentially, theperformance of LA-SHIRC is identical to SHIRC in this case.

It is worthwhile to briefly understand the biologicalmechanisms underlying the different conditions in theseimages. Inflammatory cell tissue in cattle is often a signof a contagious disease, and its cause and duration areindicated by the presence of specific types of white bloodcells. Inflammation due to allergic reactions, bacteria, orparasites is indicated by the presence of eosinophils. Acuteinfections are identified by the presence of neutrophils, whilemacrophages and lymphocytes indicate a chronic infection.In Fig. 6, we observe that a healthy lung is characterizedby large clear openings of the alveoli, while in the inflamedlung, the alveoli are filled with bluish-purple inflammatorycells. Similar clusters of dark blue nuclei indicate the onsetof inflammation in the other organs.

IBL data set: The second data set comprises images of humanintraductal breast lesions [10]. The images belong to either oftwo well-defined categories: usual ductal hyperplasia (UDH)and ductal carcinoma in situ (DCIS). UDH is consideredbenign and the patients are advised follow-up check-ups,while DCIS is actionable and the patients require surgicalintervention. Ground truth class labels for the images areassigned manually by the pathologists. A total of 40 patientcases - 20 well-defined DCIS and 20 UDH - are identifiedfor experiments in the manner described in [10]. Each casecontains a number of Regions of Interest (RoIs), and we havechosen a total of 120 images (RoIs), consisting of a randomlyselected set of 60 images for training and the remaining 60RoIs for test. Each RoI represents a full-size image for ourexperiments. Smaller local regions are chosen carefully withineach such RoI for LA-SHIRC as described in III-C, using aclassical morphology-based blob detection technique [12].

We compare the performance of SHIRC and LA-SHIRCwith two competing approaches:• SVM: this method combines state-of-the-art feature ex-

traction and classification. We use the collection of fea-tures from WND-CHARM [19], [20] which is knownto be a powerful toolkit of features for medical images.A support vector machine is used for decisions unlikeweighted nearest neighbor in [19] to further enhanceclassification. We pick the most relevant features forhistopathology [4], including but not limited to (colorchannel-wise) histogram information, image statistics,morphological features and wavelet coefficients fromeach color channel. The source code for WND-CHARMis made available by the National Institutes of Healthonline at: http://ome.grc.nia.nih.gov/wnd-charm/.

• SRC: the single channel sparse representation-based clas-sification approach reviewed in Section II. Specifically,we employ SRC directly on the luminance channel(obtained as a function of the RGB channels) of the

TABLE ICONFUSION MATRIX: LUNG.

Class Healthy Inflammatory MethodHealthy 0.8875 0.1125 SVM

0.7250 0.2750 SRC0.7500 0.2500 SHIRC

Inflammatory 0.3762 0.6238 SVM0.2417 0.7583 SRC0.1500 0.8500 SHIRC

TABLE IICONFUSION MATRIX: KIDNEY.


0.8750 0.1250 SRC0.8250 0.1750 SHIRC


histopathological images, as proposed initially for facerecognition and applied widely thereafter.

For results on the IBL data set, we also directly reportcorresponding numbers from [10] - the multiple-instancelearning (MIL) algorithm - which is a full image analysis andclassification system developed custom for the IBL data set.

In supervised classification, it is likely that some particu-larly well-chosen training sets can lead to high classificationaccuracy. In order to mitigate this issue of selection bias,we perform 10 different trials of each experiment. In eachtrial, we randomly select a set of training images – all resultsreported are the average of the classification accuracies fromthe individual trials.

B. Validation of Central Idea: Overall Classification Accuracy

First, we provide experimental validation of the centralhypothesis of this paper: that exploiting color information in aprincipled manner through simultaneous sparsity models leadsto better classification performance over existing techniquesfor histopathological image classification. To this end, wepresent overall classification accuracy for each of the threeorgans from the ADL data set, in the form of bar graphs inFig. 10(a). SHIRC outperforms SVM and SRC in each ofthe three organs, thereby confirming the merits of our multi-channel simultaneous sparsity model in (13). The selectionof application-specific features coupled with the inclusionof features from the RGB channels ensures that the SVMclassifier performs competitively, particularly for the lung.

A similar experiment using the full-size images from theIBL data set illustrates the variability in histopathologicalimagery. Each image in the data set contains multiple cellsat different spatial locations, as seen in Fig. 9. SHIRC is notdesigned to handle this practical challenge. The bar graph inFig. 10(b) shows that the SVM classifier and the systemicMIL approach in [10] offer the best classification accuracy.This is not surprising because MIL [10] incorporates elaboratesegmentation and pre-processing followed by feature extrac-tion strategies customized to the acquired set of images. Thisexperimental scenario occurs frequently enough in practice andserves as our motivation to develop LA-SHIRC.


(a) ADL data set. (b) IBL data set.

Fig. 10. Bar graphs indicating the overall classification accuracies of the competing methods.

TABLE IIICONFUSION MATRIX: SPLEEN.


0.7083 0.2917 SRC0.6500 0.3500 SHIRC


TABLE IVCONFUSION MATRIX: INTRADUCTAL BREAST LESIONS.

Class UDH DCIS MethodUDH 0.8636 0.1364 SVM

0.6800 0.3200 SRC0.6818 0.3182 SHIRC0.9333 0.0667 LA-SHIRC

DCIS 0.0909 0.9091 SVM0.4400 0.5600 SRC0.3600 0.6400 SHIRC0.1000 0.9000 LA-SHIRC

C. Detailed Results: Confusion Matrices and ROC Curves

Next, we present a more elaborate interpretation of classi-fication performance in the form of confusion matrices andROC curves. Each row of a confusion matrix refers to theactual class identity of test images and each column indicatesthe classifier output.

Tables I-III show the mean confusion matrices for the ADLdata set. In continuation of trends from Fig. 10(a), SHIRCoffers the best disease detection accuracy - a quantitativemetric of high relevance to pathologists - for each organ, whilemaintaining high classification accuracy for healthy imagestoo. An interesting observation can be made from Table III.The SVM classifier reveals a tendency to classify the diseasedtissue images much more accurately than the healthy tissues.In other words, there is a high false alarm rate (healthy imagemistakenly classified as inflammatory) associated with theSVM classifier. SHIRC however offers a more consistent class-specific performance, resulting in the best overall performance.The corresponding results using LA-SHIRC are identical toSHIRC and hence not shown, since a single block (i.e. theentire image) was deemed by pathologists to have sufficient

TABLE VFALSE ALARM PROBABILITY FOR FIXED DETECTION RATE.

Images Fixed rate False alarm rateof detection SVM SRC SHIRC LA-SHIRC

Lung (ADL) 0.15 0.71 0.42 0.26 0.26Kidney (ADL) 0.15 0.50 0.27 0.22 0.22Spleen (ADL) 0.15 0.45 0.40 0.33 0.33

IBL 0.10 0.17 0.69 0.65 0.10

discriminative information.Table IV shows the mean confusion matrix for the IBL

data set. SHIRC provides an average classification accuracyof 66.09%, in comparison with about 87.9% using the MILapproach [10]. However, LA-SHIRC results in a significantimprovement in performance, even better than the rates re-ported using SVM, MIL or SRC. The informed version ofLA-SHIRC as described in Section III-C was implementedfor the results in Table IV (and the ROC curves in Fig. 10)using - a pathologist recommended - 9 local blocks per image.

It is noteworthy that a pre-processing stage involving carefulimage segmentation is performed prior to feature extraction inMIL [4], implying that MIL is representative of state-of-the-artclassification techniques using local image information.

Typically in medical image classification problems, pathol-ogists desire algorithms that reduce the probability of miss(classifying diseased image as healthy) while also ensuringthat the false alarm rate remains low. However, there is atrade-off between these two quantities, conveniently describedusing a receiver operating characteristic (ROC) curve. Fig.11 shows the ROC curves for the ADL and IBL data sets.The lowest curve (closest to the origin) has the best overallperformance and the optimal operating point minimizes thesum of the miss and false alarm probabilities. In Figs. 11(a)-(c), SHIRC offers the best trade-off. In Fig. 11(d), the LA-SHIRC outperforms SVM, and both methods are much betterthan SRC and SHIRC.Remarks: Note that ROCs for MIL [10] could not be reportedbecause the image analysis and classification system in [10]has a variety of pre-processing, segmentation and other imageprocessing and classification steps which makes exact repro-duction impossible in the absence of publicly available code.


(a) Lung (ADL). (b) Kidney (ADL).

(c) Spleen (ADL). (d) IBL.

Fig. 11. Receiver operating characteristic (ROC) curves for different organs, methods, and data sets.

Also note that in Fig. 11(a)-(c), the ROC curve for LA-SHIRChas not been reported because SHIRC and LA-SHIRC areidentical; the pathologists recommended using a single block,i.e. the entire image, that captures discriminative information.

Depending on the inherent degree of difficulty in classifyinga particular image set and the severity of the penalty formisclassifying a diseased image3, a pathologist can choose anacceptable probability of miss and corresponding false alarmrate for each method. Table V shows that for each organ inthe ADL data set, a higher false alarm must be tolerated withthe SVM method, compared to SRC and SHIRC, in order tomaintain a fixed rate of miss. For the IBL data set, the LA-SHIRC incurs the lowest false alarm rate to achieve a missrate of 10%.

D. Performance vs. Size of Training Image Set

This experiment offers new insight into the practical perfor-mance of our algorithms. Real-world classification tasks oftensuffer from the lack of availability of large training sets. We

3For example, DCIS requires immediate surgical attention, while a mildviral infection may only prolong for a few more days if not diagnosed early.

present a comparison of overall classification accuracy as afunction of the training set size for the different methods.

In Fig. 12(a), overall classification accuracy is reported forADL data set (kidney) corresponding to five different trainingscenarios. The case of 20 training images per class pertains tolow training, while 40 training images per class representsadequate training for this data set. As before, comparisonis made against the single channel SRC and state-of-the-artfeature extraction plus SVM classifier. Unsurprisingly, all threemethods suffer in performance as training is reduced.

Fig. 12(b) reports analogous results for the IBL data set.Here the regime of low training is defined by 20 imagesper class, while the adequate training scenario is capturedby 60 images per class. Analyzing the results in Fig. 12(b)for the IBL data set, a more interesting trend reveals itself.As discussed before in Section III-C, LA-SHIRC can lead toricher dictionaries made out of local image blocks even as thenumber of training images is not increased. This allows LA-SHIRC to perform extremely well even under low training- offering about 90% accuracy - as is evident from Fig.12(b). This benefit however comes at the cost of increasedcomputational complexity at the time of inference because thedictionary size (number of columns) is significantly increased


(a) Kidney (ADL). (b) IBL.

Fig. 12. Overall classification accuracy as a function of training set size for different number of training images per class.

Fig. 13. Overall classification accuracy as a function of the number of localblocks used for decision-making in LA-SHIRC. The results are reported forthe IBL data set.

in LA-SHIRC vs. SHIRC.

E. Performance vs. Number of Local Blocks

In Section III-D, we propose two different strategies toselect the local blocks that capture discriminative information.For each strategy - informed and blind selection - we selectdifferent numbers of local blocks from the test images, classifythem separately according to the pipeline in Fig. 7, and fusethe decisions using the technique discussed in Section III-D.

For the results presented so far, we have used 9 well-selected local blocks for the IBL data set. Fig. 13 shows thevariation in classification performance of LA-SHIRC as thenumber of local blocks is increased, both for informed andblind selection. We use 20 training images per class, which is

in fact representative of low training as discussed in SectionIV-D. Two significant trends emerge from this figure. First,unsurprisingly informed block selection leads to better perfor-mance even with a smaller number of blocks. The performanceof the blind selection strategy is poor when a small numberof blocks is chosen. However, when sufficiently large numberof blocks are included for decision-making, its performanceis very similar to that of informed selection. The benefit ofblind selection is that it can potentially circumvent the needfor elaborate segmentation/block detection techniques albeit atthe cost of additional computation. Second, we observe that inthe informed selection, the performance only improves slightlyas the number of local blocks is increased beyond a certainnumber (in this case 9). This is also intuitively satisfyingbecause informed selection is based on explicit pathologistinput in terms of both size and number of blocks necessaryfor adequately good discrimination.

F. Evaluation: Classification of Image Features

In this experiment, we evaluate SRC and SHIRC on a goodchoice of histopathological image features obtained from [19].The results are shown in Fig. 14 and Table VI. In comparisonwith Table II, it can be seen that the classification accuracyfor SHIRC with features is in fact inferior to the classificationaccuracy on the raw images.This shows that feature design forhistopathology is a challenging task. It also confirms that ourproposed SHIRC alleviates this burden on feature selectioneffectively.

G. Runtime Comparison

It is well-known that SRC (and its extensions) have a(computationally) trivial training stage - stacking the vec-tors into a dictionary - unlike the SVM, which involvesthe solution of a quadratic program to identify the supportvectors. Significantly, this training stage is entirely performed


Fig. 14. ROC curve for Kidney, and for different methods

TABLE VICONFUSION MATRIX: KIDNEY. CLASSIFICATION RATES ON FEATURES


0.6925 0.3075 SRC0.8171 0.1829 SHIRC

Inflmmatory 0.2812 0.7188 SVM0.3036 0.6964 SRC0.2000 0.8000 SHIRC

offline. When a previously unseen vector is being classifiedin the test stage, SRC solves the optimization problem in Eq.(3) in realtime via greedy approaches [40], [41] or convexrelaxations. The classification for SVM, on the other hand,amounts to evaluating a linear combination of a small set ofinner products (with the support vectors). In our comparison,we only report the time taken to classify test vectors for SRC,SHIRC and LA-SHIRC. The results are obtained by runningMATLAB (version 2012a) on a 64-bit Windows 7 systemequipped with Intel(R) Core i7-2600 3.4 GHz processor and8 GB RAM.

SRC (using the greedy basis pursuit) takes about 0.13seconds on average, while SHIRC takes about 0.55 seconds toclassify a single test image. Assuming that the classification ofeach local block is performed in sequence on a single core, itis not surprising that the performance of LA-SHIRC increaseslinearly as the number of local blocks used. The time takento classify a single local block is about 0.35 seconds. Thisis lesser than the SHIRC runtime since the dimension of thelocal block is typically much smaller in comparison to thesize of the entire image. For three and nine local blocks, theLA-SHIRC runtime is about 0.48 seconds and 1.09 secondsrespectively.

There is a trade-off between accuracy and computationalrequirements. The upside of LA-SHIRC is the robustnessin classification performance, even with a small number oftraining samples, which is more essential than fast realtimeoperation in histopathological imaging applications.

H. Outlier Rejection

Sometimes in histopathological image classification, an im-age may show up from a hitherto unseen class. Thus it isdesirable for a classification algorithm to exhibit the propertyof outlier rejection. The class decision fusion in LA-SHIRC,shown in (18), naturally leads to a metric to identify outliers.The overall class decision is the class that maximizes theproduct of the individual class “likelihoods.” We flag a testimage as an outlier if this product is less than a well-chosenthreshold τ.

A more effective way of checking for outliers is inspiredby the sparsity concentration index (SCI) in SRC [27]. As anatural extension of this metric for the joint sparsity problem,we use a joint version of SCI as follows:

JSCI(SSS′) :=K maxk ‖δk(SSS′)‖row,0/‖SSS′‖row,0−1

K−1, (19)

where K is the number of classes. This index takes on valuesin the range [0,1]. A value of JSCI(SSS′) close to 1 indicates thatthe test sample belongs to one of the trained classes, whereasa value close to 0 indicates an outlier.

In this experiment, we build our dictionary from two classesof liver tissue images - healthy and inflammatory - and test foroutlier rejection using a third class (necrotic). These imageshave been acquired as part of the ADL data set. We trainedthe dictionary with 20 images per class and tested over 40outlier test images. We chose a threshold of 0.13 which isobtained experimentally. Then if JSCI value is greater thanthis threshold it means the test sample is indeed from one ofthe trained classes, otherwise it means an outlier. Using theJSCI results in outlier rejection rates of about 68% in SRCand 75% in SHIRC. Similar trends can be seen when we usea threshold on the product of probabilities in (18).

I. Summary of Results and Reproducibility

Here, we summarize the key messages from the variousexperiments described in Section IV-B-IV-F. Our first aim is tovalidate our central hypothesis, that a multi-channel simultane-ous sparsity model can be used to represent and classify colorhistopathological images. Accordingly, we demonstrate theimprovements offered by our proposed approach (SHIRC) overstate-of-the-art alternatives. While the expected classificationtrends are seen for the ADL data set, the IBL data set presentsa challenging scenario where the SHIRC method, if applieddirectly, could lead to worse performance than a traditionalSVM classifier. We then show more elaborate experimentalresults, in the form of confusion matrices and ROC curves,to demonstrate that LA-SHIRC, the locally adaptive variantof our SHIRC, indeed outperforms all other techniques bycreating richer dictionaries that are synthesized by combininglocal image blocks. A new experimental insight is gainedin that LA-SHIRC offers high classification accuracy evenwith a limited number of training images. Next, we showthe performance of LA-SHIRC as a function of the numberof local blocks to highlight the complementary benefits ofthe local block selection strategies. Finally, we discuss thefollowing issues of practical interest: the role of image features


for classification, a comparison of algorithm runtimes and theability of the proposed algorithms to reject images from unseenclasses.

In order to facilitate the use of our proposed algorithmsfor medical image classification and other multi-variate/multi-task classification problems, the MATLAB code for the algo-rithms and all experiments described here is posted online at:http://signal.ee.psu.edu/histimg.html.

V. DISCUSSION AND CONCLUSION

A. Summary

In this paper, we have proposed a simultaneous sparsitymodel for histopathological image representation and classi-fication. The central idea of our approach is to exploit thecorrelations among the red, green and blue channels of thecolor images in a sparse linear model setting with attendantcolor channel constraints. We formulate and solve a newsparsity-based optimization problem. We also introduce arobust locally adaptive version of the simultaneous sparsitymodel to address the issue of correspondence of local imageobjects located at different spatial locations. This modificationresults in benefits that have significant practical relevance: wedemonstrate that the sparsity model for classification can workeven under limited training if local blocks are chosen carefully.

B. Future Work

A natural future direction for this work is to deploy SHIRC(and LA-SHIRC) widely as an important diagnostic tool inexisting histopathological image analysis systems such as [10].Particular problems such as grading in histopathology [9],[17]and multimodal fusion for disease diagnosis [3] can beinvestigated using our propsoed techniques.

Sparse representation-based image classification is an areaof ongoing research interest, and here we identify some con-nections to our work in published literature. Our frameworkcan be generalized for any multi-variate/multi-task classifica-tion problem [49] by simply including training from thosetasks as new sub-dictionaries. Recent work in multi-taskclassification has explored the idea of sparse models on imagefeatures [37]. Admittedly, the sparse linear model may notbe justifiable for all types of histopathological classificationproblems. One way of incorporating non-linear sparse modelsis to consider the sparse model in a feature space inducedby a kernel function [48]. Recent work has focused attentionon solving the costly sparsity optimization problem more ef-fectively [50]–[52]. We believe a deeper investigation towardsefficient solutions to our modified optimization problem is aworthwhile research pursuit.

ACKNOWLEDGMENT

The authors thank Prof. Murat Dundar of IUPUI for pro-viding the breast lesion images used in our experiments.

APPENDIX AA GREEDY PURSUIT APPROACH TO MULTI-TASK

CLASSIFICATION

Notation: Let yyyi ∈ Rm, i = 1, . . . ,T be T different represen-tations of the same physical event, which is to be classifiedinto one of K different classes. Let YYY := [yyy1 . . . yyyT ] ∈ Rm×T .Assuming n training samples/events in total, we design Tdictionaries DDDi ∈ Rm×n, i = 1, . . . ,T , corresponding to theT representations. We define a new composite dictionaryDDD := [DDD1 . . . DDDT ] ∈ Rm×nT . Further, each dictionary DDDi isrepresented as the concatenation of the sub-dictionaries fromall classes corresponding to the i-th representation of the event:

DDDi := [DDD1i DDD2

i . . . DDDKi ], (20)

where DDD ji represents the collection of training samples for

representation i that belong to the j-th class. So, we have:

DDD := [DDD1 . . . DDDT ] = [DDD11 DDD2

1 . . . DDDK1 . . . DDD1

T DDD2T . . . DDDK

T ].(21)

A test event YYY can now be represented as a linear combi-nation of training samples as follows:

YYY = [yyy1 . . . yyyT ] =DDDSSS

=[DDD1

1 DDD21 . . . DDDK

1 . . . DDD1T DDD2

T . . . DDDKT][ααα1 . . . αααT ] ,

where the coefficient vectors αααi ∈ RnT , i = 1, . . . ,T , and SSS =[ααα1 . . . αααT ] ∈ RnT×T .

Since SSS obeys column correspondence, we introduce a newmatrix SSS′ ∈ Rn×T as the transformation of SSS with the zerocoefficients removed,

SSS′ =

ααα11 . . . ααα1

i ααα1T

......

......

...αααK

1 . . . αααKi . . . αααK

T

,where ααα

ji refers to the sub-vector extracted from αααi that

corresponds to coefficients from the j-th class. Note that, inthe i-th column of SSS′, only the coefficients corresponding toDDDi are retained (for i = 1, . . . ,T ).

We can now apply row-sparsity constraints similar to the ap-proach in [43]. Our modified optimization problem becomes:

SSS′= argmin

SSS′

∥∥SSS′∥∥

row,0 subject to ‖YYY −DDDSSS‖F ≤ ε, (22)

for some tolerance ε> 0. We minimize the number of non-zerorows, while the constraint guarantees a good approximation.

The matrix SSS can be transformed into SSS′ by introducingmatrices HHH ∈ RnT×T and JJJ ∈ Rn×nT ,

HHH = diag [111 111 . . . 111] ,JJJ = [IIIn IIIn . . . IIIn] ,

where 111 ∈ Rn is the vector of all ones, and IIIn denotes the n-dimensional identity matrix. Finally, we obtain SSS′ = JJJ (HHH ◦SSS),where ◦ denotes the Hadamard product, (HHH ◦SSS)i j , hi jsi j forall i, j. Eq. (22) represents a hard optimization problem dueto presence of the non-invertible transformation from SSS to SSS′.We bypass this difficulty by proposing a modified version ofthe SOMP algorithm for the multi-task multivariate case.

Recall that the original SOMP algorithm gives K distinctatoms (assuming K iterations) from a dictionary DDD that best


Algorithm 1 SOMP for multi-task multivariate sparserepresentation-based classificationInput: Dictionary DDD, signal matrix YYY , number of iterations K

Initialization: residual RRR0 =YYY , index set Λ0 = φ, iterationcounter k = 1while k ≤ K do

(1) Find the index of the atom that best approximates allresiduals:λi,k = arg max

j=1,...,n∑

Tq=1 wq

∥∥RRRtk−1dddq, j

∥∥p , p≥ 1

(2) Update the index set Λi,k = Λi,k−1⋃{

λi,k}, i =

1, . . . ,T(3) Compute the orthogonal projector pppi,k =(

DDDtΛi,k

DDDΛi,k

)−1DDDt

Λi,kyyyi, for i = 1, . . . ,T , where

DDDΛi,k ∈ Rn×k consists of the k atoms in DDDi indexedin Λi,k(4) Update the residual matrix RRRk = YYY −[DDDΛ1,k ppp1,k . . . DDDΛT,k pppT,k

](5) Increment k: k← k+1

end whileOutput: Index set Λi = Λi,K , i = 1, . . . ,T ; sparse representa-

tion SSS′

whose non-zero rows indexed for each representa-tion by Λi, i, i = 1, . . . ,T , are the K rows of the matrix(

DDDtΛi,K

DDDΛi,K

)−1DDDt

Λi,KYYY .

represent the data matrix YYY . In every iteration k, SOMPmeasures the residual for each atom in DDD and creates anorthogonal projection with maximal correlation. Extending thisto the multi-task setting, for every representation i, i= 1, . . . ,T ,we can identify the index set that gives the highest correlationwith the residual at the k-th iteration as follows:

λi,k = arg maxj=1,...,n

T

∑q=1

wq∥∥RRRt

k−1dddq, j∥∥

p , p≥ 1,

where wq denotes the weight (confidence) assigned to the q-th representation, dddq, j represents the j-th column of DDDq,q =1, . . . ,T , and the superscript (·)t indicates the matrix transcriptoperator. After finding λi,k, we modify the index set to:

Λi,k = Λi,k−1⋃

λi,k, i = 1, . . . ,T.

Thus, by finding the index set for the T distinct representa-tions, we can create an orthogonal projection with each of theatoms in their corresponding representations. The algorithm issummarized in Algorithm 1.

REFERENCES

[1] A. J. Mendez, P. G. Tahoces, M. J. Lado, M. Souto, and J. J. Vidal,“Computer-aided diagnosis: Automatic detection of malignant masses indigitized mammograms,” Med Phys., vol. 25, pp. 957–964, June 1998.

[2] L. E. Boucheron, “Object- and spatial-level quantitative analysis ofmultispectral histopathology images for detection and characterizationof cancer,” Ph.D. dissertation, Dept. of ECE, University of California,Santa Barbara, 2008.

[3] A. Madabhushi, “Digital pathology image analysis: Opportunities andchallenges,” Imaging Med., vol. 1, no. 1, pp. 7–10, 2009.

[4] M. N. Gurcan, et al., “Histopathological image analysis: A review,”IEEE Rev. Biomed. Eng., vol. 2, pp. 147–171, 2009.

[5] H. Fox, “Is H&E morphology coming to an end?” J. Clin. Pathol.,vol. 53, pp. 38–40, 2000.

[6] E. Ozdemir and C. Gunduz-Demir, “A hybrid classification model fordigital pathology using structural and statistical pattern recognition,”IEEE Trans. Med. Imag., vol. 32, no. 2, pp. 474–483, Feb. 2013.

[7] M. Gavrilovic, et al., “Blind color decomposition of histological im-ages,” IEEE Trans. Med. Imag., vol. 32, no. 6, pp. 983–994, June 2013.

[8] G. Alexe, et al., “Towards improved cancer diagnosis and prognosisusing analysis of gene expression data and computer aided imaging,”Exp. Biol. Med., vol. 234, pp. 860–879, 2009.

[9] A. Basavanhally, et al., “Computerized image-based detection and grad-ing of lymphocytic infiltration in HER2+ breast cancer histopathology,”IEEE Trans. Biomed. Eng., vol. 57, no. 3, pp. 642–653, 2010.

[10] M. M. Dundar, et al., “Computerized classification of intraductal breastlesions using histopathological images,” IEEE Trans. Biomed. Eng.,vol. 58, no. 7, pp. 1977–1984, 2011.

[11] R. M. Haralick, K. Shanmugam, and I. Dinstein, “Textural features forimage classification,” IEEE Trans. Syst., Man, Cybern., vol. SMC-3,no. 6, pp. 610–621, 1973.

[12] J. Serra, Image Analysis and Mathematical Morphology. AcademicPress, 1982.

[13] F. Zana and J.-C. Klein, “Segmentation of vessel-like patterns usingmathematical morphology and curvature evaluation,” IEEE Trans. ImageProcessing, vol. 10, no. 7, pp. 1010–1019, 2001.

[14] O. Chapelle, P. Haffner, and V. N. Vapnik, “Support vector machinesfor histogram-based image classification,” IEEE Trans. Neural Networks,vol. 10, no. 5, pp. 1055–1064, 1999.

[15] A. Wetzel, et al., “Evaluation of prostate tumor grades by content basedimage retrieval,” in Proc. SPIE, vol. 3584, 1999, pp. 244–252.

[16] A. Esgiar, R. Naguib, B. Sharif, M. Bennett, and A. Murray, “Fractalanalysis in the detection of colonic cancer images,” IEEE Trans. Inform.Technol. Biomed., vol. 6, no. 1, pp. 54–58, 2002.

[17] A. Tabesh, et al., “Multifeature prostate cancer diagnosis and gleasongrading of histological images,” IEEE Trans. Med. Imag., vol. 26, no. 10,pp. 1366–1378, 2007.

[18] S. Doyle, S. Agner, A. Madabhushi, M. Feldman, and J. Tomaszewski,“Automated grading of breast cancer histopathology using spectralclustering with textural and architectural image features,” in Proc. IEEEInt. Symp. Biomed. Imag., 2008, pp. 496–499.

[19] N. Orlov, et al., “WND-CHARM: Multi-purpose image classificationusing compound image transforms,” Pattern Recogn. Lett., vol. 29,no. 11, pp. 1684–1693, 2008.

[20] L. Shamir, et al., “Wndchrm - an open source utility for biological imageanalysis,” Source Code for Biology and Medicine, vol. 3, no. 13, 2008.

[21] V. N. Vapnik, The nature of statistical learning theory. New York,USA: Springer, 1995.

[22] Y. Freund and R. E. Shapire, “A short introduction to boosting,” J. Jap.Soc. Artificial Intell., vol. 14, no. 5, pp. 771–780, Sept. 1999.

[23] A. Basavanhally, et al., “A boosted classifier for integrating multiplefields of view: Breast cancer grading in histopathology,” in Proc. IEEEInt. Symp. Biomed. Imag., 2011, pp. 125–128.

[24] J. D. Hipp, A. Fernandez, C. C. Compton, and U. J. Balis, “Why apathology image should not be considered as a radiology image,” J.Pathol. Inform., 2011, 2:26.

[25] S. Naik, et al., “Automated gland and nuclei segmentation for grading ofprostate and breast cancer histopathology,” in ISBI Workshop Comput.Histopathology, 2008, pp. 284–287.

[26] O. Sertel, et al., “Histopathological image analysis using model-basedintermediate representations and color texture: Follicular lymphomagrading,” J. Signal Processing Syst., vol. 55, no. 1-3, pp. 169–183, 2009.

[27] J. Wright, A. Y. Yang, A. Ganesh, S. Sastry, and Y. Ma, “Robustface recognition via sparse representation,” IEEE Trans. Pattern Anal.Machine Intell., vol. 31, no. 2, pp. 210–227, Feb. 2009.

[28] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral imageclassification using dictionary-based sparse representation,” IEEE Trans.Geosci. Remote Sensing, vol. 49, no. 10, pp. 3973–3985, Oct. 2011.

[29] H. Zhang, N. M. Nasrabadi, Y. Zhang, and T. S. Huang, “Multi-viewautomatic target recognition using joint sparse representation,” IEEETrans. Aerosp. Electron. Syst., vol. 48, no. 3, pp. 2481–2497, July 2012.

[30] A. Wagner, et al., “Towards a practical face recognition system: Robustalignment and illumination by sparse representation,” IEEE Trans.Pattern Anal. Machine Intell., vol. 34, no. 2, pp. 372–386, Feb. 2012.

[31] S. Zhang, J. Huang, D. Metaxas, W. Wang, and X. Huang, “Discrimina-tive sparse representations for cervigram image segmentation,” in Proc.IEEE Int. Symp. Biomed. Imag., 2010, pp. 133–136.

[32] Y. Yu, et al., “Group sparsity based classification for cervigram segmen-tation,” in Proc. IEEE Int. Symp. Biomed. Imag., 2011, pp. 1425–1429.


[33] M. Liu, L. Lu, X. Ye, S. Yu, and M. Salganicoff, “Sparse classificationfor computer aided diagnosis using learned dictionaries,” in Proc. Med.Image Comput. Computer Assisted Intervention, vol. 6893, 2011, pp.41–48.

[34] U. Srinivas, et al., “SHIRC: A simultaneous Sparsity model forHistopathological Image Representation and Classification,” in Proc.IEEE Int. Symp. Biomed. Imag., 2013.

[35] Z. Liu, J. Yang, and C. Liu, “Extracting multiple features in discriminantcolor space for face recognition,” IEEE Trans. Image Processing, vol. 19,no. 9, pp. 2502–2509, 2010.

[36] S.-J. Wang, J. Yang, N. Zhang, and C.-G. Zhou, “Tensor discriminantcolor space for face recognition,” IEEE Trans. Image Processing, vol. 20,no. 9, pp. 2490–2501, 2011.

[37] X.-T. Yuan, X. Liu, and S. Yan, “Visual classification with multitask jointsparse representation,” IEEE Trans. Image Processing, vol. 21, no. 10,pp. 4349–4360, Oct. 2012.

[38] Y. Nesterov, “Gradient methods for minimizing composite objectivefunction,” CORE, Catholic Univ. Louvain, Louvain-la-Neuve, Belgium,Tech. Rep. 2007/076, 2007.

[39] J. Zou, Q. Ji, and G. Nagy, “A comparative study of local matchingapproach for face recognition,” IEEE Trans. Image Processing, vol. 16,no. 10, pp. 2617–2628, 2007.

[40] J. Tropp and A. Gilbert, “Signal recovery from random measurementsvia orthogonal matching pursuit,” IEEE Trans. Inform. Theory, vol. 53,no. 12, pp. 4655–4666, 2005.

[41] W. Dai and O. Milenkovic, “Subspace pursuit for compressive sensingsignal reconstruction,” IEEE Trans. Inform. Theory, vol. 55, no. 5, pp.2230–2249, 2009.

[42] D. L. Donoho, “For most large underdetermined systems of linearequations the minimal l1-norm solution is also the sparsest solution,”Comm. Pure Appl. Math., vol. 59, no. 6, pp. 797–829, 2006.

[43] J. A. Tropp, A. C. Gilbert, and M. J. Strauss, “Algorithms for simulta-neous sparse approximation. Part I: Greedy pursuit,” Signal Processing,vol. 86, pp. 572–588, Apr. 2006.

[44] J. A. Tropp, “Algorithms for simultaneous sparse approximation. PartII: Convex relaxation,” Signal Processing, vol. 86, pp. 589–602, Apr.2006.

[45] S. Zhang, J. Huang, H. Li, and D. N. Metaxas, “Automatic imageannotation and retrieval using group sparsity,” IEEE Trans. Syst., Man,Cybern. B, vol. 42, no. 32, pp. 838–849, June 2012.

[46] Y. Chen, T. T. Do, and T. D. Tran, “Robust face recognition usinglocally adaptive sparse representation,” in Proc. IEEE Int. Conf. ImageProcessing, 2010, pp. 1657–1660.

[47] V. Monga and K. Mihcak, “Robust and secure image hashing vianon-negative matrix approximations,” IEEE Trans. Inform. ForensicsSecurity.

[48] Y. Chen, N. M. Nasrabadi, and T. D. Tran, “Hyperspectral imageclassification using dictionary-based sparse representation,” IEEE Trans.Geosci. Remote Sensing, vol. 51, no. 1, pp. 217–231, Jan. 2013.

[49] G. Obozinski, B. Taskar, and M. I. Jordan, “Joint covariate selectionand joint subspace selection for multiple classification problems,” Stat.Comput., vol. 20, no. 2, pp. 231–252, 2009.

[50] A. M. Bruckstein, D. L. Donoho, and M. Elad, “From sparse solutions ofsystems of equations to sparse modeling of signals and images,” SIAMReview, vol. 51, no. 1, pp. 34–81, 2009.

[51] A. Y. Yang, A. Ganesh, S. S. Sastry, and Y. Ma, “Fast l1-minimizationalgorithms and an application in robust face recognition: A review,” inProc. IEEE Int. Conf. Image Processing, 2010, pp. 1849–1852.

[52] V. Shia, A. Y. Yang, S. S. Sastry, A. Wagner, and Y. Ma, “Fast l1-minimization and parallelization for face recognition,” in Proc. IEEEAsilomar Conf., 2011, pp. 1199–1203.

SUBMITTED TO IEEE TRANSACTIONS ON MEDICAL IMAGING 1 ...

Documents

Transcript of SUBMITTED TO IEEE TRANSACTIONS ON MEDICAL IMAGING 1 ...