Supervised Multi-View Canonical Correlation Analysis ...

Supervised Multi-View Canonical Correlation Analysis:Fused Multimodal Prediction of Disease Diagnosis and

Prognosis

Asha Singanamallia, Haibo Wang a, George Lee a, Natalie Shih b, Mark Rosen b, StephenMaster b, John Tomasewski c, Michael Feldman b, Anant Madabhushia,

aCase Western Reserve University, Cleveland, OH; b University of Pennsylvania, Philadelphia,PA; cUniversity of Buffalo, Buffalo, NY.

ABSTRACTWhile the plethora of information from multiple imaging and non-imaging data streams presents an opportunityfor discovery of fused multimodal, multiscale biomarkers, they also introduce multiple independent sources ofnoise that hinder their collective utility. The goal of this work is to create fused predictors of disease diagnosisand prognosis by combining multiple data streams, which we hypothesize will provide improved performanceas compared to predictors from individual data streams. To achieve this goal, we introduce supervised multi-view canonical correlation analysis (sMVCCA), a novel data fusion method that attempts to find a commonrepresentation for multiscale, multimodal data where class separation is maximized while noise is minimized.In doing so, sMVCCA assumes that the different sources of information are complementary and thereby actsynergistically when combined. Although this method can be applied to any number of modalities and to anydisease domain, we demonstrate its utility using three datasets. We fuse (i) 1.5 Tesla (T) magnetic resonanceimaging (MRI) features with cerbrospinal fluid (CSF) proteomic measurements for early diagnosis of Alzheimer’sdisease (n = 30), (ii) 3T Dynamic Contrast Enhanced (DCE) MRI and T2w MRI for in vivo prediction ofprostate cancer grade on a per slice basis (n = 33) and (iii) quantitative histomorphometric features of glandsand proteomic measurements from mass spectrometry for prediction of 5 year biochemical recurrence post-radical prostatectomy (n = 40). Random Forest classifier applied to the sMVCCA fused subspace, as comparedto that of MVCCA, PCA and LDA, yielded the highest classification AUC of 0.82 +/- 0.05, 0.76 +/- 0.01,0.70 +/- 0.07, respectively for the aforementioned datasets. In addition, sMVCCA fused subspace provided13.6%, 7.6% and 15.3% increase in AUC as compared with that of the best performing individual view in eachof the three datasets, respectively. For the biochemical recurrence dataset, Kaplan-Meier curves generated fromclassifier prediction in the fused subspace reached the significance threshold (p = 0.05) for distinguishing betweenpatients with and without 5 year biochemical recurrence, unlike those generated from classifier predictions of theindividual modalities.

1. INTRODUCTIONIncreasing accessibility to multiscale, multimodal biomedical data has begun to pave the way for personalizedmedicine. In particular, the advent of high-throughput molecular assays has yielded a plethora of molecularmarkers associated with diagnosis and prognosis in a number of different disease domains.1 However, few of thesemarkers have translated into clinical practice.2 Alternatively, quantitative imaging features are now beginning tobe considered as potential biomarkers, a term that has most commonly been associated with molecular signaturesthus far. As a result, a number of promising quantitative imaging markers such as textural features on T2wmagnetic resonance imaging (MRI)3 and textural kinetic features4 on dynamic contrast enhanced (DCE) MRIare beginning to emerge for disease characterization (e.g. prostate and breast cancer localization). Availabilityof multiple, complementary markers and data streams now presents an opportunity to fuse different sources ofinformation in order to potentially improve prediction of disease diagnosis and prognosis as compared to anyindividual marker or data stream.

Although a number of data fusion strategies have been developed in the context of computer vision,5 only afew generalized techniques are available for fusion of heterogeneous biomedical data types such as imaging andnon-imaging, and structural and functional imaging which present unique challenges. Previous approaches to

Medical Imaging 2014: Biomedical Applications in Molecular, Structural, and Functional Imaging, edited by Robert C. Molthen, John B. Weaver, Proc. of SPIE Vol. 9038, 903805

© 2014 SPIE · CCC code: 1605-7422/14/$18 · doi: 10.1117/12.2043762

Proc. of SPIE Vol. 9038 903805-1

Downloaded From: http://proceedings.spiedigitallibrary.org/ on 09/19/2014 Terms of Use: http://spiedl.org/terms

data fusion can generally be categorized based on the level at which information is combined: (i) raw data level(low level fusion), (ii) feature level (intermediate level fusion) or (iii) decision level (high level fusion).2 Integrationat the raw data level is limited to homogeneous data sources and is thus inapplicable for multiscale, biomedi-cal data. Alternatively, decision level strategies6 bypass challenges associated with fusion of heterogeneous datatypes by combining independently derived decisions from each data source. In doing so, the information availableat the intersection of different data channels may remain unexploited.7, 8 On the other hand, feature level inte-gration involves converting raw data into quantitative feature representations which can then be combined usingconcatenation based,9, 10 kernel based11 or dimensionality reduction based methods.8 These methods transformquantitative features obtained from each data channel into an alternate, joint subspace termed metaspace wherea meta-classifier is applied to distinguish between groups of patients with different diagnosis and/or prognosis.However, feature level fusion is complicated by differences in dimensionality as well as the ‘curse of dimension-ality’.12 For instance, data sources residing at different scales often have significantly different dimensionalitieswhich render simple concatenation of quantitative features sub-optimal as high dimensional modalities such as‘omics’ are likely to dominate the joint-representation on account of the quantity, not necessarily the qualityof data it provides.13 Furthermore, biomedical datasets often comprise small sample size as a result of whichconcatenation based methods that increase data dimensionality are not suitable as they are prone to the curse ofdimensionality.12 Curse of dimensionality states that the sample size required to build a good predictor increasesexponentially with the number of features. Kernel-based methods11, 14, 15 alternatively transform raw data fromthe original space to a high dimensional embedding space where the different data types are more homogeneouslyrepresented thereby making them more amenable for fusion. However, such methods are prone to overfitting16, 17

particularly given the small sample size and the noise associated with each of the biomedical data sources which,if unaccounted for, may drown the increase in signal achievable by fusion. As such, we seek a data fusion methodthat is able to extract information pertinent to the task of interest while accounting for various sources of noiseand reducing dimensionality.

Dimensionality reduction methods have emerged as effective means of fusing data.8, 17 Canonical correlationanalysis (CCA)18 is a linear dimensionality reduction method commonly used for data fusion as it accountsfor relationships between multiple input variables. By capturing correlations between modalities, CCA seeks toidentify the underlying structure common to the two views thereby creating a subspace that is robust to modality-specific noise. As a result, CCA has been popular in the computer vision community for applications in imageretrieval from text query,19 color demosaicing20 as well as imaging and non-imaging data fusion.17 Multi-viewCCA (MVCCA) has emerged as an extension of traditional CCA for more than two views.21 MVCCA generalizesCCA by finding the linear subspace where pairwise correlations between multiple (more than two) modalities canbe maximized. However, both CCA and MVCCA are unsupervised and thus do not guarantee a subspace thatis optimal for class separation. Previously, Golugula et al.17 have attempted to incorporate supervision into theCCA framework as a regularization step, which although was shown to improve class separability in the reducedsubspace, is computationally expensive. Alternatively, previous work has shown that embedding class labelsas one of the two variable sets in CCA is equivalent to the supervised linear dimensionality reduction method,linear discriminant analysis (LDA).22 LDA seeks to find a linear subspace that is optimal for classification bymaximizing euclidean distance between classes and minimizing the distance within each class.23 However, LDA,unlike CCA and MVCCA, is unable to account for relationships between multiple modalities, which may thereforeresult in overfitting.

In this work, we present a novel supervised multiview canonical correlation analysis (sMVCCA) scheme thatcombines properties of both MVCCA and LDA to provide a common, low dimensional subspace representationfor fusing any number of heterogeneous forms of multidimensional, multimodal biomedical data. sMVCCAsimultaneously maximizes correlations between multiple modalities and optimizes class separation by treatingclass labels as one of the views of MVCCA. In doing so, sMVCCA quantitatively transforms data into analternate, reduced dimensional subspace that: (i) is able to ignore modality-specific noise thereby retaininginformation about the object of interest which is (ii) pertinent for the classification task under consideration.As all views capture information pertaining to the same object, information overlap is likely to increase withincreasing number of views. While supervision enhances class discriminability in the joint-space, the correlationbased representation ensures robustness to noise. We demonstrate the utility of sMVCCA in the context oflearning fused predictors of (i) structural MRI and proteomics for early diagnosis of Alzheimer’s disease, (ii)



DCE MRI and T2w MRI for in vivo determination of prostate cancer grade and (iii) histology and proteomicsfor prediction of 5-year biochemical recurrence associated with prostate cancer post surgery.

The rest of this paper is organized as follows. Section 2 reviews the theory and background of CCA andMVCCA which is then followed in Section 3 by detailed description of our methodology, sMVCCA. Section 4describes the experimental design the results of which are presented and discussed in Section 5. We then concludewith a summary of the work and principal findings in Section 6.

2. THEORY AND REVIEW OF CCA AND MVCCAWe briefly introduce canonical correlation analysis (CCA) and its extension multiview canonical correlationanalysis (MVCCA), which provide the theoretical framework for supervised MVCCA (sMVCCA). Table 1 listsall the notations used in subsequent formulations for reference.

Symbol Descriptionn,N samples, total number of samples; n ∈ {1, . . . , N}k,K modalities, total number of modalities; xk, k ∈ {1, . . . ,K}m,Mk features, total number of features in each modality; m ∈ {1, . . . ,Mk}M total number of features over all modalities; M =

∑k Mk

xn×Mk

k sample n described by modality k with Mk featuresxk data vector of all features [x1,x2, . . . ,xM ], R1×M

X concatenated data matrix containing all features from all modalities [x1, . . . ,xK ], Rn×(M1+...+MK )

wk weight vector for modality k, RMk×1

Wk weight matrix for modality k, RMk×n

w concatenated weight vector over all modalities [wT1 ,wT

2 , . . . ,wTK ]T , RM×1

Wx weight matrix for all modalities [W1,W2, . . . ,WM ], RM×n

Y label matrix Rn×g

g number of classesWy notation used in sMVCCA to denote W for all labels Rg×n

Table 1: Summary of Notations

2.1 Canonical Correlation AnalysisProvided a dataset xn×m

k with n ∈ {1, 2, ..., N} samples and k ∈ {1, 2, ...,K} modalities, each of which comprisesm ∈ {1, 2...,Mk} features. Canonical correlation analysis (CCA) considers two sets of variables (K = 2), xn×M1

1and xn×M2

2 , and projects them onto basis vectors, w1 and w2, such that correlation between projections ofvariables onto these basis vectors are mutually maximized. Formally, this can be expressed as

arg maxw1,w2

wT1 C12w2√

wT1 C11w1wT

2 C22w2, (1)

where C12 ∈ RM1×M2 , C11 ∈ RM1×M1 , C22 ∈ RM2×M2 are covariance matrices of x1 and x2, x1 and x1, and x2and x2, respectively.

2.2 Multi-View Canonical Correlation Analysis (MVCCA)Multiview CCA (MVCCA) can be derived by extending the CCA formulation to account for more than two setsof variables (K > 2). Since the joint correlation of more than two variables does not formally exist, MVCCAmaximizes the sum of correlations between each pair of modalities. Thus, MVCCA can be expressed as genericform of Equation 1.

arg maxw1,...wk...,wK

∑∑k 6=j

wTk Ckjwj√

wTk CkkwkwT

j Cjjwj

. (2)



The scaling of w does not affect the arg max solution, allowing Equation 2 to be written as:

arg maxw1,...,wK

∑∑k 6=j

wTk Ckjwj (3)

s.t. wT1 C11w1 = 1, . . . ,wT

KCKKwK = 1.

Previously, Equation 2 has been solved by sequentially considering correlations of each pair of variables.21

However, such an approach is sub-optimal as it requires iterative optimization, which is inefficient and can besusceptible to the order in which pairs of variable sets are chosen. Here, we present an alternative pairwiseMVCCA approach by expressing correlations of all modalities in a combined correlation matrix which can besolved using eigenvalue decomposition method.Letting w = [wT

1 wT2 ...wT

K ]T , w ∈ RM×1 allows us to rewrite Equation 3 in a compact matrix form:

arg maxw

wT Cw

s.t wT Cdw = 1wT

1 C11w1 = . . . = wTKCKKwK , (4)

where

C =

0 C12 · · · C1K

C21 0. . .

......

. . . . . . C(K−1)K

CK1 · · · CK(K−1) 0

,

Cd =

C11 0 · · · 0

0 C22. . .

......

. . . . . . 00 · · · 0 CKK

. (5)

In more general terms where W ∈ RM×n, Equation 4 reduces to

arg maxW

trace(WTx CWx)

s.t WTx CdWx = I

wT1 C11w1 = . . . = wT

KCKKwK ,

where I is an n× n identity matrix and the weight matrix is defined as Wx = [W1,W2, . . . ,WK ] ∈ RM×n

3. SUPERVISED MULTI-VIEW CANONICAL CORRELATION ANALYSIS(SMVCCA)

Although MVCCA subspace provides information about the underlying object, it does not guarantee a repre-sentation that is optimal for class separation. We hereby present supervised MVCCA (sMVCCA) that explicitlyaccounts for class labels and thus attempts to provide fused representation that selectively captures discrimina-tive information of the underlying object. Previous work has shown that LDA is a special case of CCA wherethe correlation between data samples X with corresponding class labels Y are maximized.22 sMVCCA leveragesthis idea with pairwise MVCCA to improve class seperability.



3.1 FormulationWe define our data in a compact matrix form as X ∈ Rn×M . We extend the MVCCA formulation to incorporatean additional term that maximizes the correlation of X with class labels Y, which can be expressed as follows:

arg maxWx,Wy

trace(WTx CWx) + 2× trace(WT

x XT YWy)

= trace([

WTx WT

y

] [ C XT YYT X 0

] [Wx

Wy

])

= trace(WT CW)

s.t. [WT

x WTy

] [ Cd 00 YT Y

] [Wx

Wy

]= I

⇔ WT CdW = I (6)

WT1 C11W1 = . . . = WT

KCKKWK = WTy YT YWy. (7)

where Y is a matrix in which class labels are encoded using Soft-1-of-Class strategy.22

Solving Equation 6 consists of two steps: (i) Ignoring the constraint in (7) leaves us with a quadratic program-ming problem, whose W∗ corresponds to eigenvectors of the n-largest eigenvalues of a generalized eigenvaluesystem: CxyW = λCdxyW; (ii) Imposing constraint (7) upon obtaining the optimal eigenvectors W∗ by nor-malizing the corresponding section of each modality: W∗∗

j = W∗j (W∗T

j CjjW∗j )− 1

2 , j = 1, ..., k.

4. EXPERIMENTAL DESIGNTo evaluate the presented sMVCCA method, we chose three unique datasets that enabled us to address someof the most relevant clinical problems in two different disease domains. Fusion tasks for the three datasetsconsidered in this work can be categorized as (1) Radiology-Proteomics fusion (2) Structural-Functional datafusion and (3) Histomorphometric-Proteomics fusion. In each case, the objective was to develop a fused predictorwith a higher predictive performance as compared to that of individual data streams.

4.1 Dataset 1: Radiology-Proteomics Fusion for Early Diagnosis of Alzheimer’s DiseaseStructural T1w MRI and cerebrospinal fluid (CSF) proteomic measurements were acquired for 30 adults betweenthe ages of 55 and 90, among whom 12 were diagnosed with Alzheimer’s disease while the remaining 18 werehealthy volunteers. This data was obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI)database∗. Provided that Alzheimer’s is an irreversible disease, early detection of the disease may provide anopportunity to develop new treatments that may extend the patient’s life of quality.

Structural T1w MRI scans were acquired from 1.5T scanners at multiple sites across United States andCanada. The imaging sequence was a 3-dimensional saggital magnetization prepared rapid gradient-echo (MPRAGE).Additional details on the acquisition and pre-processing of MRI scans can be found in.25 The pre-processed MRIscans were subjected to FreeSurfer, a documented, freely available image analysis suite†, to extract features from34 cortical ROIs in each hemisphere using the atlas detailed in Morra et al.26For each ROI, the cortical thicknessaverage (TA), standard deviation of thickness (TS), surface area (SA) and cortical volume (CV) were calculatedas features. SA was calculated as the area of the surface layer equidistant between the gray/white matter andgray matter/CSF surfaces. CV at each vertex over the whole cortex was computed by the product of the SA andthickness at each surface vertex. Left and right hemisphere SA and total intracranial volume (ICV) were alsoincluded. For each subcortical structure, the subcortical volume (SV) was extracted. A number of these featureswere previously found to be correlated with neurodegenerative processes associated with Alzheimer’s disease.27

∗http://www.loni.ucla.edu/ADNI†http://surfer.nmr.mgh.harvard.edu/



Additionally, CSF proteomic biomarkers have previously shown promising results for diagnosis of Alzheimer’sdisease.28 As such, we consider patients who have CSF proteomic measurements in addition to T1w MRI scans.For these patients, baseline CSF samples were obtained through lumbar puncture at all participating sites.Detailed protocols of CSF collection and transportation have previously been reported28 and is available onthe ADNI website†. A total of 80 CSF concentrations of different proteins (such as Adiponectin, Angiopoietin,and Cortisol) were collected. Examples of features extracted across T1w MRI and protein expression data areprovided in Table 2.

4.2 Dataset 2: Structural-Functional Imaging Fusion for Prostate Cancer GradingT2w MRI and DCE MRI were acquired prior to radical prostatectomy (RP) for 16 patients with biopsy confirmedprostate cancer. Provided that this dataset comprised intermediate Gleason score patients, the objective was todistinguish between primary Gleason grades 3 and 4 on MRI at a per slice basis. As such, we considered 2-3MRI slices containing the most dominant tumor nodule in each patient, totaling 33 slices over 16 patients.

All MRI studies were performed on a 3T scanner (Verio, Siemens; Erlangen, GE) using a dedicated endorectalcoil (Medrad, Pittsburgh, PA). Axial T1 and T2w imaging was performed with 3 mm slice thickness and 1.0 mmgap. DCE MRI was performed with T1w VIBE imaging at 3 mm slice thickness, with 24 cm FOV and matrix256 by 192. Temporal resolution varied based on the number of prescribed slices. Twenty phases of imaging wereperformed, with IV gadolinium injection beginning 30 seconds after scan initialization. Surgical specimens werefixed in formalin and were subsequently sectioned into 3-4 mm slices, each of which was sectioned into 4 quadrants,stained with H&E and digitized at 20x magnification using Aperio scanner. An expert pathologist provided cancerannotations and determined the Gleason grades on each slice. 2-3 slices with the largest dominant tumor nodulewas selected for analysis in each case. Ground truth cancer annotations were mapped from histologic sectionsonto MRI protocols via co-registration.29 Closest corresponding sections between the histologic and T2w MRIslices were determined by radiologist and pathologist. Manually selected landmarks were used to co-registerhistologic slices with corresponding T2w MRI and DCE MRI slices using thin plate splines (TPS), which thenallowed for mapping of the tumor on MRI.29

Textural features and kinetic features from T2w and DCE MRI, respectively were extracted from tumor voxelsas detailed in Table 2. Previous work3 has shown that textural features are able to distinguish between cancerousand benign voxels on T2w MRI. We extract the same features to distinguish between aggressive and indolenttumors as we anticipate that textural features, which generally capture heterogeneity in local neighborhoods,will reflect the heterogeneity of the tumor, a characteristic known to be associated with aggressive tumors. Tocomplement textural features, we extract kinetic features from signal intensity vs. time curves, which werepreviously shown to be associated with Gleason grades30 and a number of quantitative microvessel attributes.29

4.3 Dataset 3: Histomorphometry-Proteomics fusion for Early Prediction of 5-yearBiochemical Recurrence in Prostate Cancer40 biopsy confirmed prostate cancer patients with intermediate Gleason scores underwent radical prostatectomy.Patients were followed up and monitored for 5 years. Among all the patients, 21 experienced biochemicalrecurrence within 5 years of surgery while the other 19 did not experience biochemical recurrence.

Surgical specimens were sectioned and a representative slice containing the most dominant tumor nodule ineach specimen was digitized at 20x magnification. Representative tumor areas as determined and annotated bya pathologist on H&E sections were collected via needle dissection, and formalin cross-links were removed byheating at 99 degree Celsius. After peptide purification, samples were analyzed using C-18 reverse phase liquidchormatography/tandem mass spectrometry (nLC-MS/MS) on a LTQ Orbitrap mass spectrometer. Followingdata acquisition, a label free MaxQuant peptide identification package was used to extract ion chromatogramsallowing for quantification of protein abundance. Proteins quantifiable in at least 50% of the studies wereconsidered which thereby resulted in 650 proteomic expression values for each patient. Data imputation methodswere used to replace missing values.

Proteomic expression values resulting from MaxQuant analysis of the raw mass spectrometry data was con-sidered for analysis. On histology, previous work31, 32 has shown that quantitative histomorphometric featuresof glands may be able to predict the aggressiveness of tumor. As such, quantitative histomorphometric features



Dataset Modality Feature Type (num) Examples/ Description

D1 T1w MRI 34 ROIs extracted (30) cortical thickness average, standard deviation of surfacearea (SA), cortical volume, left and right hemisphere SA,and total intracranial volume

Proteomics Proteomics obtained fromCSF (83)

Fatty Acid-Binding Protein, Resistin, Interleukin-3, Vascu-lar Endothelial Growth Factor

D2 T2w MRI Gradient & Gray-level statis-tics (25)

Features capturing summary statistics such as mean, stan-dard deviation and derivative features of pixel values withina localized neighborhood computed using Sobel and Kirschfilters.

Gabor wavelet transform(48)

Textural representation obtained via convolution of an im-age with Gabor filter bank, which comprises filters withdifferent frequencies and orientations.

Haralick statistics (39) Statistics of gray-level co-occurrence matrices such as angu-lar second moment, contrast and difference entropy.

DCE MRI Per-voxel kinetic curvestatistics (24)

Statistics such as mean, median and variance from char-acteristics of the signal-intensity vs. time curves includingmaximum uptake and rate of washout computed over alltumor voxels

ROI Modified Standard Lo-gistic Fitted (MSLF) SI-Time Curve (14)

Signal intensity vs. time curves of all pixels within the tu-mor region were averaged and fitted to a modified standardlogistic function. Features including the curve fitting pa-rameters, maximum uptake, rate of washout and initial areaunder the curve were computed from this single summarykinetic curve.

D3 Histology Gland Morphology (100) Statistics of gland area, boundary length, distance, perime-ter, smoothness, fractal dimensions and descriptors of in-variant moments and Fourier transforms.

Gland Architecture (51) Statistics of graphical constructs such as Voronoi diagram,Delaunay Triangulation and Minimum Spanning Tree wheregland centroids serve as nodes thereby capturing character-istics of global glandular distribution.

Co-occurring Gland Tensors(39)

Gland orientation is quantified by measuring the angle ofthe principal axis of segmented gland boundaries following.Statistics of co-occurrence matrices that capture gland di-rectionality in local neighborhoods then serve as features.

Gland Subgraphs (26) Statistics such as eccentricity and connected componentcoefficients of local subgraphs of gland distributions con-structed using probabilistic decay function.

Haralick Texture (26) Second order statistics computed from a symmetric co-occurrence matrix of neighboring pixel intensities within agiven window size around a pixel.

Proteomics Mass Spectrometry proteinmeasurements (650)

Expression values of proteins that were expressed in morethan 50% of the samples which included heat shock protein,40S ribosomal protein and a number of Ras proteins.

Table 2: Summary of features extracted from the various modalities across datasets



0 5 10 150.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

Number of reduced dimensions

Mea

n A

UC

D1: Mean AUC vs. Dimensions

sMVCCAMVCCAPCALDA

(a)

0 5 10 150.5

0.55

0.6

0.65

0.7

0.75

0.8


Mea

n A

UC


(b)

0 5 10 150.54

0.56

0.58

0.6

0.62

0.64

0.66

0.68

0.7

0.72


Mea

n A

UC


(c)Figure 1: Mean AUC as a function of the number of dimensions in reduced subspace for (a) dataset 1, (b) dataset 2, and(c) dataset 3

capturing gland morphology, orientation as well as local and global architecture were extracted.17, 31, 32 Summaryof extracted features is provided in Table 2.

4.4 Feature SelectionWilcoxon rank sum test (WRST) was used to select features from all modalities within each dataset. Featureswere ranked using training samples within each cross validation fold based on their p-values, where the featurewith the lowest p-value was ranked the highest. The number of top features to retain was empirically determinedseparately for each dataset.

4.5 Experimental EvaluationTop ranked features from all modalities in each dataset were transformed into a reduced dimensional subspace viasMVCCA or other comparative strategies, which included MVCCA, LDA and PCA. Note that unlike sMVCCAand MVCCA, PCA and LDA are designed for single feature set. As such, selected features from all modalitieswere concatenated into a single input matrix prior to the application of PCA and LDA. In the reduced subspace,random forest(RF) classifier was used to evaluate the various fused and individual modality representations.RF is a widely used, well-established decision tree ensemble method that combines outputs of multiple decisiontrees. Three fold cross validation was performed for datasets 1 and 2, and ten fold stratified cross validation wasperformed for dataset 3 over 10 trials.

Experiments 1, 2 and 3 were conducted to (i) explore the effect of parameters associated with fused repre-sentations (ii) determine the value of considering the relationship between modalities (as in CCA and MVCCA)as well as relationship with class labels (as in LDA) as is done by our method, sMVCCA, and to (iii) test ourhypothesis that combination of multiple sources of information yields better predictive power than any individualdata source alone, respectively.

4.5.1 Experiment 1: Exploration of predictive performance vs. number of reduced dimensionsDimensionality of reduced subspace is the only parameter that requires tuning to compute sMVCCA, MVCCA,PCA and LDA fused representations. Thus, classification performance was evaluated across a range of dimen-sions.

4.5.2 Experiment 2: Comparing sMVCCA vs. other linear dimensionality reduction methods forfusionAt the dimensionality providing the highest performance in Experiment 1, which we will denote as d∗, sMVCCAwas compared with other supervised and unsupervised linear dimensionality reduction based fusion methodswhich included MVCCA, PCA and LDA. In comparing sMVCCA with MVCCA and LDA, we test our assumptionthat considering associations between modalities as well as with class labels improves predictive power overconsidering either one of the two criteria individually.



sMVCCA MVCCA PCA LDA

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Are

a U

nder

RO

C C

urve

D1: sMVCCA vs. Comparative Methods

(a)sMVCCA MVCCA PCA LDA

0.55

0.6

0.65

0.7

0.75

0.8

Are

a U

nder

RO

C C

urve


(b)sMVCCA MVCCA PCA LDA

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

Are

a U

nder

RO

C C

urve


(c)Figure 2: Box-and-whisker plots of AUC obtained over 10 runs of three, three and ten fold cross validations for datasets1, 2 and 3, respectively using a random forest classifier on sMVCCA, MVCCA, PCA and LDA subspaces. The lowerand upper bounds of each box indicate the 25th and 75th percentile of AUC whereas the red bar indicates the medianAUC values. The dashed lines extend from the box to the maximum and minimum values. The red plus signs refer tooutliers while the blue asterisk indicates statistically significant difference in AUC from that of sMVCCA, as determinedby Tukey honest significance difference criterion.

4.5.3 Experiment 3: Comparing sMVCCA fused representation vs. individual modalitiesAt dimensionality d∗, RF classifier performance in the sMVCCA subspace was compared against that of individualmodalities. For evaluation of individual modalities, raw features from each modality were first reduced to a PCAreduced subspace of d∗ dimensions where the classifier was applied.

4.5.4 Performance MetricsFor all datasets, area under the curve (AUC) were computed over all folds. The mean and standard deviationof AUC were evaluated across 10 trials. AUCs across experimental conditions were compared via one-wayanalysis of variance (ANOVA), which tested the null hypothesis that the means of AUC across all experimentalconditions were equal. An alpha of 0.05 was used to reject the null hypothesis. Following ANOVA, post-hoctest was performed using Tukey’s honest significant difference (HSD) criterion to determine the means that weresignificantly different from that of sMVCCA.

For D3, time to recurrence was available for 30 out of 40 patients. For these patients, Kaplan-Meier (KM)analysis with logrank significance test was used to evaluate the predictability of biochemical recurrence freesurvival using the individual and sMVCCA combined modalities. KM curves provide an alternate, independentmeasure of performance that allowed us to better assess which patients were being misclassified by accountingfor the time to recurrence. In general, we would expect that more errors would occur in predicting the earlyrecurrence patients and thereby would have overlapping recurrence and non-recurrence KM curves at earlier timepoints. The goal however is to correctly predict both early and late recurrence patients which would result innon-overlapping KM curves with significantly different trajectories for the recurrence and non-recurrence groups.

5. RESULTS AND DISCUSSION5.1 Experiment 1: Exploration of predictive performance vs. number of reduceddimensionsFigure 1 shows mean AUC as a function of the number of reduced dimensions. While D1 has a slow trajectoryupward and reaches a plateau after the first few dimensions, AUC values in D2 peak at the first dimension afterwhich they reach a plateau. D3 shows a different trajectory altogether where the AUC peaks within the first fewdimensions after which it quickly drops significantly. PCA closely follows the path of sMVCCA particularly forD3 which indicates that direction of correlation across various modalities is the same as the direction of variancewithin the data, suggesting that the modalities in D3 may be highly redundant.



sMVCCA T1w MRI Proteomics0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

Are

a U

nder

RO

C C

urve

D1: sMVCCA vs. Individual Modalities

(a)sMVCCA T2w MRI DCE MRI

0.5

0.55

0.6

0.65

0.7

0.75

0.8

Are

a U

nder

RO

C C

urve


(b)sMVCCA Histology Proteomics

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

Are

a U

nder

RO

C C

urve


(c)Figure 3: Box-and-whisker plots of AUC obtained over 10 runs of three, three and ten fold cross validations for datasets 1,2 and 3, respectively using a random forest classifier on sMVCCA fused subspace and the individual subspaces. The lowerand upper bounds of each box indicate the 25th and 75th percentile of AUC whereas the red bar indicates the medianAUC values. The dashed lines extend from the box to the maximum and minimum values. The red plus signs refer tooutliers while the blue asterisk indicates statistically significant difference in AUC from that of sMVCCA, as determinedby Tukey honest significance difference criterion.

5.2 Experiment 2: Comparing sMVCCA vs. other linear dimensionality reductionmethods for data fusionFigure 2 summarizes the performance of sMVCCA and comparative fusion strategies at the dimensionality thatprovided the highest mean AUC in Experiment 1. sMVCCA achieves the highest classification AUC for allthree datasets while PCA consistently emerges as the next best performing method. The improved performanceof sMVCCA over that of PCA is evident in D2, as indicated by the blue asterisk which denotes statisticallysignificant difference in AUC values as compared to that of sMVCCA. Although the difference in performancebetween sMVCCA and PCA is not significant in other datasets, we would like to note that, for D1, AUCs derivedfrom PCA comprise outliers which suggest the unreliability of PCA provided fused embedding.

As compared to MVCCA, sMVCCA shows significantly higher performance across all datasets, suggestingthat supervising the construction of correlated subspace is likely to improve class discriminability. At the sametime, we note that the supervised comparative method, LDA, where features from all modalities are concatenatedprior to computing the low dimensional embedding, has has significantly lower AUCs across all datasets. Thisin turn emphasizes the importance of intelligently combining heterogeneous data streams while exposing labelinformation so as to avoid over-fitting.

5.3 Experiment 3: Comparing sMVCCA fused representation vs. individual modalitiesFigure 3 shows the distribution of AUCs achieved by the sMVCCA fused subspace as well the individual views,reduced to the number of dimensions that achieved maximum AUC value in Figure 1. Classification in thesMVCCA fused subspace consistently results in significantly better predictive performance as compared to indi-vidual modalities across all datasets. The three datasets achieve 13.6%, 7.6% and 15.3% increase in mean AUCin the sMVCCA subspace as compared with that of the best performing individual view in D1, D2 and D3,respectively.

Unlike performances of individual views in D2, which appear to be highly different with respect to eachother, the individual views in D1 and D3 show similar performances. Although ANOVA indicated that theperformance of sMVCCA and the individual views were statistically significant, post-hoc pairwise comparisontest using Tukey’s honest significance difference indicated that the performance of the individual views in D1 andD3 were not significantly different from each other. Nonetheless, fusion appears to marginally but significantlyimprove the classification AUC in these datasets. Note that, no significant differences were found between PCAand sMVCCA in the same two datasets which suggests that although sMVCCA is driven by correlation orredundancies across views, it possibly converges to PCA when the views are highly redundant.

Furthermore, Kaplan-Meier analysis of 5 year biochemical recurrence free survival in D3 showed that the fusedrepresentation was better able to distinguish between the biochemical recurrence and non-recurrence groups ascompared to the individual modalities. As shown in Figure 4, close to significant (p=0.05) differences were found

Proc. of SPIE Vol. 9038 903805-10


0 20 40 60 80 100 120 1400

0.2

0.4

0.6

0.8

1

Histologic FeaturesB

ioch

emic

al R

ecur

renc

e F

ree

Sur

viva

l Rat

e

Time (months)

p−value: 0.49

Recurrence

Non−Recurrence

Censored

(a)

0 20 40 60 80 100 120 1400

0.2

0.4

0.6

0.8

1

Proteomic Features

Bio

chem

ical

Rec

urre

nce

Fre

e S

urvi

val R

ate

Time (months)

p−value: 0.84

(b)

0 20 40 60 80 100 120 1400

0.2

0.4

0.6

0.8

1

Fused Histo−Proteomic Features

Bio

chem

ical

Rec

urre

nce

Fre

e S

urvi

val R

ate

Time (months)

p−value: 0.05

(c)Figure 4: Kaplan Meier Analysis of biochemical recurrence free survival rate using histologic features, proteomic featuresand fused features in sMVCCA subspace.

between the Kaplan-Meier curves generated from the predicted biochemical recurrence and non-recurrence groupswhen both histology and proteomic data were fused using sMVCCA, whereas no significant differences were foundwhen features from a single modality were used for classification.

6. CONCLUDING REMARKSIn this work, we introduced a novel supervised multi-view canonical correlation analysis (sMVCCA) method formultimodal data fusion in the context of combining (i) radiology and proteomics for early diagnosis of Alzheimer’sdisease, (ii) structural and functional MRI for prostate cancer grading, and (iii) histomorphometry and proteomicsfor early prediction of 5-year biochemical recurrence post radical prostatectomy. sMVCCA leverages associationsbetween multiple modalities as well as with class labels to provide a fused low dimensional representation thatcaptures the most discriminatory attributes of the underlying biological state, as reflected in the various datachannels. In the experimental evaluation, sMVCCA was compared against other linear dimensionality reductionbased fusion methods to determine the optimal joint-subspace for classification. In addition, we evaluated ifsMVCCA fused subspace provides improved class discriminability as compared to the individual modalities. Thefollowing principal findings were discovered as a result of our experimental evaluation:

• Considering relationships (i) between modalities as well as (ii) with class label, as is done by sMVCCA,yields a more predictive subspace than considering either one of the two criteria alone

• Fused representation provides greater predictive power as compared to any individual modality

Although this work introduces a promising platform for quantitative fusion of heterogeneous data channels,the work is limited in a number of ways. All datasets used have small sample sizes and provide two modalities.One of the strengths of sMVCCA is that it is able to fuse any number of data channels, a property thatremains experimentally unexplored on account of the datasets considered. In addition, sMVCCA representationis dependent on the input features from each modality, which were selected using a feature selection strategy.For datasets with small sample size, it is well known that feature selection strategies provide less than optimalfeatures sets33 which is likely to result in a sub-optimal fused subspace. Despite these limitations, current findingsindicate that sMVCCA provides a promising framework for fusion of multiscale, multimodal data and that itmay be important to incorporate the properties of sMVCCA in future biomedical data fusion strategies.

7. ACKNOWLEDGMENTSResearch reported in this publication was supported by the National Cancer Institute of the National Institutes ofHealth under award numbers R01CA136535-01, R01CA140772-01, and R21CA167811-01; the National Instituteof Diabetes and Digestive and Kidney Diseases under award number R01DK098503-02, the DOD Prostate

Proc. of SPIE Vol. 9038 903805-11


Cancer Synergistic Idea Development Award (PC120857); the QED award from the University City ScienceCenter and Rutgers University, the Ohio Third Frontier Technology development Grant. The content is solelythe responsibility of the authors and does not necessarily represent the official views of the National Institutesof Health. In addition, data collection and sharing for the Alzheimer’s Disease Dataset was funded by theAlzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) andDOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the NationalInstitute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generouscontributions from a number of associations and companies ‡.

REFERENCES[1] Lee, J. W., Figeys, D., and Vasilescu, J., “Biomarker assay translation from discovery to clinical studies in

cancer drug development: quantification of emerging protein biomarkers,” Advances in cancer research 96,269–298 (2006).

[2] Kern, S. E., “Why your new cancer biomarker may never work: recurrent patterns and remarkable diversityin biomarker failures,” Cancer research 72(23), 6097–6101 (2012).

[3] Viswanath, S., Bloch, N., Chappelow, J., Toth, R., Rofsky, N., Genega, E., Lenkinski, R., and Madabhushi,A., “Central gland and peripheral zone prostate tumors have significantly different quantitative imagingsignatures on 3 tesla endorectal, in vivo T2-weighted MR imagery,” Journal of Magnetic Resonance Imaging(2012).

[4] Agner, S. C., Soman, S., Libfeld, E., McDonald, M., Thomas, K., Englander, S., Rosen, M. A., Chin, D.,Nosher, J., and Madabhushi, A., “Textural kinetics: a novel dynamic contrast-enhanced (dce)-mri featurefor breast lesion classification,” Journal of Digital Imaging 24(3), 446–463 (2011).

[5] Khaleghi, B., Khamis, A., Karray, F. O., and Razavi, S. N., “Multisensor data fusion: A review of thestate-of-the-art,” Information Fusion (2011).

[6] Rohlfing, T. and Maurer, C. R., “Multi-classifier framework for atlas-based image segmentation,” PatternRecognition Letters 26(13), 2070–2079 (2005).

[7] Tiwari, P., Viswanath, S., Lee, G., and Madabhushi, A., “Multi-modal data fusion schemes for integratedclassification of imaging and non-imaging biomedical data,” in [Biomedical Imaging: From Nano to Macro,2011 IEEE International Symposium on ], 165–168, IEEE (2011).

[8] Lee, G., Doyle, S., Monaco, J., Madabhushi, A., Feldman, M. D., Master, S. R., and Tomaszewski, J. E., “Aknowledge representation framework for integration, classification of multi-scale imaging and non-imagingdata: Preliminary results in predicting prostate cancer recurrence by fusing mass spectrometry and histol-ogy,” in [Biomedical Imaging: From Nano to Macro, 2009. ISBI’09. IEEE International Symposium on ],77–80, IEEE (2009).

[9] Verma, R. Zacharaki, E. O. Y. a. a., “Multiparametric tissue characterization of brain neoplasms and theirrecurrence using pattern classification of mr images,” Academic Radiology 15(8), 966–977 (2008).

[10] Chan, I., Wells III, W., Mulkern, R. V., Haker, S., Zhang, J., Zou, K. H., Maier, S. E., and Tempany, C. M.,“Detection of prostate cancer by integration of line-scan diffusion, t2-mapping and t2-weighted magneticresonance imaging; a multichannel statistical classifier,” Medical physics 30, 2390 (2003).

[11] Tiwari, P., Kurhanewicz, J., Rosen, M., and Madabhushi, A., “Semi supervised multi kernel (SeSMiK)graph embedding: identifying aggressive prostate cancer via magnetic resonance imaging and spectroscopy,”MICCAI 13(Pt 3), 666–73 (2010).

[12] Bellman, R. E., [Adaptive control processes: a guided tour ], vol. 4, Princeton university press Princeton(1961).

[13] Madabhushi, A., Agner, S., Basavanhally, A., Doyle, S., and Lee, G., “Computer-aided prognosis: predictingpatient and disease outcome via quantitative fusion of multi-scale, multi-modal data,” Comput. Med. Imagingand Graph. 35(7-8), 506–14 (2011).

[14] Lanckriet, G. R., Deng, M., Cristianini, N., Jordan, M. I., Noble, W. S., et al., “Kernel-based data fusionand its application to protein function prediction in yeast.,” in [Pacific symposium on biocomputing ], 9,300–311 (2004).

‡http://adni.loni.usc.edu/about/funding/

Proc. of SPIE Vol. 9038 903805-12


[15] McFee, B., Galleguillos, C., and Lanckriet, G., “Contextual object localization with multiple kernel nearestneighbor,” Image Processing, IEEE Transactions on 20(2), 570–585 (2011).

[16] Lewis, D. P., Jebara, T., and Noble, W. S., “Support vector machine learning from heterogeneous data: anempirical analysis using protein sequence and structure,” Bioinformatics 22(22), 2753–2760 (2006).

[17] Golugula, A., Lee, G., Master, S. R., Feldman, M. D., Tomaszewski, J. E., Speicher, D. W., and Madab-hushi, A., “Supervised Regularized Canonical Correlation Analysis: integrating histologic and proteomicmeasurements for predicting biochemical recurrence following prostate surgery,” BMC Bioinformatics 12,483 (2011).

[18] Hotelling, H., “Relations between two sets of variates,” Biometrika 28(3/4), 321–377 (1936).[19] Hardoon, D. R., Szedmak, S., and Shawe-Taylor, J., “Canonical correlation analysis: An overview with

application to learning methods,” Neural Computation 16(12), 2639–2664 (2004).[20] Hel-Or, Y., “The canonical correlations of color images and their use for demosaicing,” HP Laboratories

Israel, Tech. Rep. HPL-2003-164R1 (2004).[21] Kettenring, J. R., “Canonical analysis of several sets of variables,” Biometrika 58(3), 433–451 (1971).[22] Sun, T. and Chen, S., “Class label versus sample label-based cca,” Applied Mathematics and computa-

tion 185(1), 272–283 (2007).[23] Fisher, R. A., “The use of multiple measurements in taxonomic problems,” Annals of eugenics 7(2), 179–188

(1936).[24] Bartlett, M. S., “Further aspects of the theory of multiple regression,” in [Proceedings of the Cambridge

Philosophical Society ], 34, 33–40, Cambridge Univ Press (1938).[25] Jack, C. R., Bernstein, M. A., Fox, N. C., Thompson, P., Alexander, G., Harvey, D., Borowski, B., Britson,

P. J., L Whitwell, J., Ward, C., et al., “The alzheimer’s disease neuroimaging initiative (adni): Mri methods,”Journal of Magnetic Resonance Imaging 27(4), 685–691 (2008).

[26] J., M., Z., T., L., A., A., G., C., A., S., M., N., P., X., H., A., T., C., J., M., W., and P., T., “Validationof a fully automated 3d hippocampal segmentation method using subjects with alzheimer’s disease mildcognitive impairment, and elderly controls.,” Neuroimage 43, 59–68 (2008).

[27] Dickerson, B. C., Feczko, E., Augustinack, J. C., Pacheco, J., Morris, J. C., Fischl, B., and Buckner, R. L.,“Differential effects of aging and alzheimer’s disease on medial temporal lobe cortical thickness and surfacearea,” Neurobiology of aging 30(3), 432–440 (2009).

[28] Shaw, L. M., Vanderstichele, H., Knapik-Czajka, M., Clark, C. M., Aisen, P. S., Petersen, R. C., Blennow,K., Soares, H., Simon, A., Lewczuk, P., et al., “Cerebrospinal fluid biomarker signature in alzheimer’s diseaseneuroimaging initiative subjects,” Annals of neurology 65(4), 403–413 (2009).

[29] Singanamalli, A., Sparks, R., Rusu, M., Shih, N., Ziober, A., Tomaszewski, J., Rosen, M., Feldman, M., andMadabhushi, A., “Identifying in vivo dce mri parameters correlated with ex vivo quantitative microvessel ar-chitecture: A radiohistomorphometric approach,” in [SPIE Medical Imaging ], 867604–867604, InternationalSociety for Optics and Photonics (2013).

[30] Vos, E. K., Litjens, G., Kobus, T., Hambrock, T., Kaa, C. A., Barentsz, J. O., Huisman, H., and Scheenen,T. W., “Assessment of prostate cancer aggressiveness using dynamic contrast-enhanced magnetic resonanceimaging at 3 t,” European urology (2013).

[31] Lee, G., Sparks, R., Ali, S., Madabhushi, A., Feldman, M. D., Master, S., Shih, N., and Tomaszewski,J., “Co-occurring gland tensors in localized cluster graphs: Quantitative histomorphometry for predictingbiochemical recurrence for intermediate grade prostate cancer,” in [Biomedical Imaging (ISBI), 2013 IEEE10th International Symposium on ], 113–116, IEEE (2013).

[32] Lee, G., Ali, S., Veltri, R., Epstein, J. I., Christudass, C., and Madabhushi, A., “Cell orientation en-tropy (core): Predicting biochemical recurrence from prostate cancer tissue microarrays,” in [Medical ImageComputing and Computer-Assisted Intervention–MICCAI 2013 ], 396–403, Springer (2013).

[33] Sima, C. and Dougherty, E. R., “What should be expected from feature selection in small-sample settings,”Bioinformatics 22(19), 2430–2436 (2006).

Proc. of SPIE Vol. 9038 903805-13


Supervised Multi-View Canonical Correlation Analysis ...

Documents

Transcript of Supervised Multi-View Canonical Correlation Analysis ...