Decoding visual brain states from fMRI using an ensemble of classifiers

11
Decoding visual brain states from fMRI using an ensemble of classifiers Carlos Cabral a , Margarida Silveira a,b, , Patricia Figueiredo a,b a Instituto Superior Tecnico, Technical University of Lisbon, Lisbon, Portugal b Institute for Systems and Robotics, Lisbon, Portugal article info Available online 7 May 2011 Keywords: fMRI Retinotopic mapping Visual localizer Brain decoding Machine learning Ensemble of classifiers abstract Decoding perceptual or cognitive states based on brain activity measured using functional magnetic resonance imaging (fMRI) can be achieved using machine learning algorithms to train classifiers of specific stimuli. However, the high dimensionality and intrinsically low signal to noise ratio (SNR) of fMRI data poses great challenges to such techniques. The problem is aggravated in the case of multiple subject experiments because of the high inter-subject variability in brain function. To address these difficulties, the majority of current approaches uses a single classifier. Since, in many cases, different stimuli activate different brain areas, it makes sense to use a set of classifiers each specialized in a different stimulus. Therefore, we propose in this paper using an ensemble of classifiers for decoding fMRI data. Each classifier in the ensemble has a favorite class or stimulus and uses an optimized feature set for that particular stimulus. The output for each individual stimulus is therefore obtained from the corresponding classifier and the final classification is achieved by simply selecting the best score. The method was applied to three empirical fMRI datasets from multiple subjects performing visual tasks with four classes of stimuli. Ensembles of GNB and k-NN base classifiers were tested. The ensemble of classifiers systematically outperformed a single classifier for the two most challenging datasets. In the remaining dataset, a ceiling effect was observed which probably precluded a clear distinction between the two classification approaches. Our results may be explained by the fact that different visual stimuli elicit specific patterns of brain activation and indicate that an ensemble of classifiers provides an advantageous alternative to commonly used single classifiers, particularly when decoding stimuli associated with specific brain areas. & 2011 Elsevier Ltd. All rights reserved. 1. Introduction Brain mapping refers to the association of perceptual or cognitive states with specific patterns of brain activity and it is central to human neuroscience. The neuroimaging techniques used to measure brain activity include the electro-encephalogram (EEG), magneto-encephalogram (MEG), positron emission tomo- graphy (PET) and, most commonly, functional magnetic reso- nance imaging (fMRI). Because of the advantageous compromise between spatial and temporal resolutions, together with its non- invasiveness, fMRI has become the method of choice in human brain mapping experiments. In fMRI, a blood oxygen level dependent (BOLD) signal is recorded across the brain while subjects undergo an experimental manipulation, such as a visual stimulus or a motor task [1]. Typically, the brain regions activated in association with the stimulus or task are then identified by detecting voxels, where BOLD signal changes are significantly correlated with the experi- mental paradigm. The statistical analysis of fMRI data is most often carried out through a massively univariate approach. In this approach, a general lineal model (GLM) describing the experi- mental manipulation, as well as any confound variables, is adjusted to the time series of each voxel in order to yield a 3D map of model parameter estimates [2]. Following the model estimation, a pattern of activated regions is identified by using an appropriate inference approach. Multivariate methods have also been proposed for the analysis of fMRI brain activation data including parametric [3] and non-parametric approaches such as clustering [4], independent component analysis (ICA) [5] and self-organizing mapping [6]. In recent years, a great effort has been devoted to the inverse problem of identifying the stimulus/task, or brain state, asso- ciated with measured brain activity patterns [7]. A number of pattern recognition tools have been explored to address such brain decoding problem. In particular, it is now popular to use machine learning techniques to train classifiers to decode brain states or stimuli from fMRI data [8]. Although brain decoding encompasses several challenging problems in the analysis of fMRI, in this paper we will focus on Contents lists available at ScienceDirect journal homepage: www.elsevier.com/locate/pr Pattern Recognition 0031-3203/$ - see front matter & 2011 Elsevier Ltd. All rights reserved. doi:10.1016/j.patcog.2011.04.015 Corresponding author at: Instituto Superior Tecnico, Technical University of Lisbon, Lisbon, Portugal. E-mail address: [email protected] (M. Silveira). Pattern Recognition 45 (2012) 2064–2074

Transcript of Decoding visual brain states from fMRI using an ensemble of classifiers

Pattern Recognition 45 (2012) 2064–2074

Contents lists available at ScienceDirect

Pattern Recognition

0031-32

doi:10.1

� Corr

Lisbon,

E-m

journal homepage: www.elsevier.com/locate/pr

Decoding visual brain states from fMRI using an ensemble of classifiers

Carlos Cabral a, Margarida Silveira a,b,�, Patricia Figueiredo a,b

a Instituto Superior Tecnico, Technical University of Lisbon, Lisbon, Portugalb Institute for Systems and Robotics, Lisbon, Portugal

a r t i c l e i n f o

Available online 7 May 2011

Keywords:

fMRI

Retinotopic mapping

Visual localizer

Brain decoding

Machine learning

Ensemble of classifiers

03/$ - see front matter & 2011 Elsevier Ltd. A

016/j.patcog.2011.04.015

esponding author at: Instituto Superior Tec

Lisbon, Portugal.

ail address: [email protected] (M. Silveir

a b s t r a c t

Decoding perceptual or cognitive states based on brain activity measured using functional magnetic

resonance imaging (fMRI) can be achieved using machine learning algorithms to train classifiers of

specific stimuli. However, the high dimensionality and intrinsically low signal to noise ratio (SNR) of

fMRI data poses great challenges to such techniques. The problem is aggravated in the case of multiple

subject experiments because of the high inter-subject variability in brain function. To address these

difficulties, the majority of current approaches uses a single classifier. Since, in many cases, different

stimuli activate different brain areas, it makes sense to use a set of classifiers each specialized in a

different stimulus. Therefore, we propose in this paper using an ensemble of classifiers for decoding

fMRI data. Each classifier in the ensemble has a favorite class or stimulus and uses an optimized feature

set for that particular stimulus. The output for each individual stimulus is therefore obtained from the

corresponding classifier and the final classification is achieved by simply selecting the best score. The

method was applied to three empirical fMRI datasets from multiple subjects performing visual tasks

with four classes of stimuli. Ensembles of GNB and k-NN base classifiers were tested. The ensemble of

classifiers systematically outperformed a single classifier for the two most challenging datasets. In the

remaining dataset, a ceiling effect was observed which probably precluded a clear distinction between

the two classification approaches. Our results may be explained by the fact that different visual stimuli

elicit specific patterns of brain activation and indicate that an ensemble of classifiers provides an

advantageous alternative to commonly used single classifiers, particularly when decoding stimuli

associated with specific brain areas.

& 2011 Elsevier Ltd. All rights reserved.

1. Introduction

Brain mapping refers to the association of perceptual orcognitive states with specific patterns of brain activity and it iscentral to human neuroscience. The neuroimaging techniquesused to measure brain activity include the electro-encephalogram(EEG), magneto-encephalogram (MEG), positron emission tomo-graphy (PET) and, most commonly, functional magnetic reso-nance imaging (fMRI). Because of the advantageous compromisebetween spatial and temporal resolutions, together with its non-invasiveness, fMRI has become the method of choice in humanbrain mapping experiments. In fMRI, a blood oxygen leveldependent (BOLD) signal is recorded across the brain whilesubjects undergo an experimental manipulation, such as a visualstimulus or a motor task [1].

Typically, the brain regions activated in association with thestimulus or task are then identified by detecting voxels, where

ll rights reserved.

nico, Technical University of

a).

BOLD signal changes are significantly correlated with the experi-mental paradigm. The statistical analysis of fMRI data is mostoften carried out through a massively univariate approach. In thisapproach, a general lineal model (GLM) describing the experi-mental manipulation, as well as any confound variables, isadjusted to the time series of each voxel in order to yield a 3Dmap of model parameter estimates [2]. Following the modelestimation, a pattern of activated regions is identified by usingan appropriate inference approach. Multivariate methods havealso been proposed for the analysis of fMRI brain activation dataincluding parametric [3] and non-parametric approaches suchas clustering [4], independent component analysis (ICA) [5] andself-organizing mapping [6].

In recent years, a great effort has been devoted to the inverseproblem of identifying the stimulus/task, or brain state, asso-ciated with measured brain activity patterns [7]. A number ofpattern recognition tools have been explored to address suchbrain decoding problem. In particular, it is now popular to usemachine learning techniques to train classifiers to decode brainstates or stimuli from fMRI data [8].

Although brain decoding encompasses several challengingproblems in the analysis of fMRI, in this paper we will focus on

C. Cabral et al. / Pattern Recognition 45 (2012) 2064–2074 2065

the problem of decoding the stimuli presented to the subject ateach time point during a visual experiment. In fact, it is wellknown that the visual cortex exhibits functional specialization,such that different visual stimuli elicit specific patterns of brainactivation [9]. In particular, the primary visual cortex exhibits aretinotopic organization such that each area of the visual field(and hence the retina) is represented by a well-defined regionwithin the cortex. The left visual field is represented on the righthemisphere and vice-versa [10]. Moreover, it has been shown thatthe human brain presents well-defined distributed activationpatterns in response to different categories of visual stimuli, suchas faces, houses or tools [11]. Such patterns involve a network ofbrain regions, including specialized cortical areas, which are moreactive for the corresponding stimulus category than any other. Inparticular, it is now a well-established finding that a small areawithin the fusiform cortex responds more to images of humanfaces than to images of any other body part or other kinds ofobjects — this is the so-called fusiform face area (FFA). It is also aconsistent finding that pictures of houses activate the parahippo-campal cortex more than pictures of other objects, defining theparahippocampal place area (PPA). Finally, when recognizableobjects are compared with unrecognizable ones, i.e. scrambledimages of the same objects, then a large area in the lateraloccipital cortex (LOC) exhibits greater activation. The alternatedpresentation of images belonging to these different visual categoriesto the subjects is commonly used to allow the localization ofthese specialized brain areas and associated networks — these aretherefore localizer experiments. In this paper, we tackle theinverse problem: identifying the presented stimuli from a set ofpossible categories, by analyzing the patterns of brain activityelicited throughout the experiment.

Brain decoding of fMRI data using machine learning techni-ques involves representing the BOLD volumes as a spatial patternand using a classification algorithm to learn from a number ofexamples. The classification method will then generalize from thegiven examples, in order to be able to predict a brain state fornew, unseen, cases. Since each brain volume contains thousandsof voxels, i.e. features, the feature vectors live in a very highdimensional space. Naturally, the number of subjects is generallymuch smaller than the dimension of this space, thus the classi-fication task suffers from the ‘curse of dimensionality’. Therefore,feature selection and/or dimensionality reduction are criticalsteps in order to achieve good performance in the classification.

Several feature selection methods have been used successfullywith fMRI data. The most common include selecting the n mostactive voxels [12–14], selecting the n most active voxels perregion of interest (ROI) [12–14] or averaging the mean activityover all the voxels of each ROI [13]. Other possibilities have alsobeen explored such as selecting the n voxels which scored thebest when used for training single-voxel classifiers [12] or usingPCA either directly on the BOLD signal [14,15] or on the prob-ability distribution of fMRI intensity calculated within each ROI[16].

On the other hand, the choice of the classifier to be used is alsoof great importance. Methodologies based on a single classifierhave traditionally been used for brain decoding. The classifiersthat have most often been employed in previous works areGaussian Naive Bayes (GNB), k-nearest neighbor (k-NN), Fisherlinear discriminant (FLD) and support vector machines (SVM)[15,13,12,8].

Interestingly, it has been observed in many different contexts[17,18] that a combination or ensemble of classifiers achievesbetter performance than any of the individual classifiers. This isparticularly true if, like with fMRI, the classification task suffers fromthe ‘curse of dimensionality’. Some experiments with ensembleshave already been performed on fMRI data. In fact, [19] used

decision tree ensembles for brain decoding of fMRI connectivity,and [14] used the AdaBoost classifier for separating drug-addictedsubjects from healthy non-drug-using controls. In [20] a largenumber of ensembles were used to identify brain patterns corre-sponding to different visual stimuli, although only for single subjectdata. The ensembles that were used in this work included combina-tions where the base classifiers were trained on different subsets ofthe training data (e.g. Bagging, AdaBoost, Random Forest), as well ascombinations where the base classifiers were trained on differentsubsets of the input features (Random subspace). In [21] RandomForests were also used to decode visual stimuli using differentfeature subsets, but in this case the feature subsets were selectedusing Gini Contrast.

Motivated by the success of these experiments, we propose inthis paper a combination of classifiers for decoding visual stimulifrom fMRI data from multiple subjects. We will focus on trainingbase classifiers on different subsets of the input features. Usually,one randomly chooses the different subsets (Random subspace),as in [20]. In our case, we propose to optimize each feature subsetfor each of the visual stimulus or class. This is known as thefavorite class ensemble method [22] which, to the best of theauthors knowledge, has never been applied to fMRI data. Usingdifferent feature subsets creates diversity in the ensemble whichis a necessary component of ensembles [18] but using class-specific features has additional advantages. Firstly, it allows forthe interpretation of the specific feature patterns associated witheach class. Secondly, it has advantages in terms of classificationaccuracy. Since features are not equally relevant for differentclasses, some may be relevant for a particular class and only‘noise’ for another class, using the smallest subset of featuresrelevant for each class helps to improve accuracy for thatparticular class. Furthermore, if a certain class is highly dominant,a joint feature selection might not include features relevant for thenon-dominant classes thereby reducing accuracy in the classificationof patterns belonging to those non-dominant classes. In addition tooptimizing the feature subset for each class, we also choose in eachcase the optimal feature set size. We expect that such an approachshould be particularly well suited to address the problem of decodingsuch visual stimuli since they are known to be associated withspecific brain areas.

The remainder of this paper is organized as follows: thedatasets used in the experiments and the proposed methods aredescribed in Section 2, the results obtained are presented inSection 3 and we conclude with Section 4.

2. Materials and methods

2.1. Datasets

Three empirical fMRI datasets from multiple subjects performingvisual tasks with four classes of stimuli were used in this study.

The first dataset corresponds to the mapping of each of the fourquadrants of the primary visual cortex, to access the retinotopicorganization of this structure (mapping experiment). Each individualstimulus comprised a black and white checkerboard wedge flashingat 8 Hz, as represented in Fig. 1, and the paradigm consisted on ablock design alternating 16 s periods of stimulation and fixation, inthe order shown in Fig. 1. This dataset was obtained from fourhealthy subjects in two sessions each, on a 1.5 Philips system, usingBOLD imaging to collect 672 brain volumes with TR¼2000 ms and avoxel resolution of 3.750�3.750�5.000 mm3, yielding an imagesize of 64�64�24.

The second and third datasets correspond to a visual localizerexperiment, aimed at identifying the visual brain areas specia-lized in the recognition of objects, faces and houses (localizer

Fig. 1. Stimuli and paradigm used in the mapping experiment: flashing checkerboard on the four visual field quadrants, Q1 , Q2, Q3 and Q4 (top) and sequence of stimuli,

interleaved with fixation (fix) in the block design paradigm (bottom).

Fig. 2. Stimuli and paradigm used in the localizer experiments: examples of pictures of faces, houses, objects and scrambled objects shown to the subjects (top) and

sequence of stimuli, interleaved with fixation (Fix) in the block design paradigm (bottom). In localizer experiment1, the complete sequence was used, while in localizer

experiment2 only the first row of the sequence was used.

C. Cabral et al. / Pattern Recognition 45 (2012) 2064–20742066

experiment1 and localizer experiment2). The paradigm consisted ona block design alternating 18 s periods of faces, houses, objectsand scrambled (noisy) objects, as well as fixation periods, as

illustrated in Fig. 2. Localizer experiment1 was obtained from 10healthy subjects, on a 3.0 Philips system, using BOLD imaging tocollect 118 brain volumes with TR¼3000 ms, 38 slices and a voxel

C. Cabral et al. / Pattern Recognition 45 (2012) 2064–2074 2067

resolution of 2.875�2.875�3.200 mm3, yielding an image size of80�80�38. Localizer experiment2 was obtained from eighthealthy subjects, on a 7.0 Siemens system, using BOLD imagingto collect 112 brain volumes with TR¼3000 ms, 40 slices and avoxel resolution of 2.019�2.019�2.000 mm3, yielding an imagesize of 104�104�40. The two datasets therefore correspond tothe same paradigm, but were acquired at different field strengths,with different spatial resolutions and a different total number ofbrain volumes, which yields different SNR’s.

In all experiments, a high-resolution structural image wascollected from each subject using a T1-weighted imaging sequence,for anatomical reference and co-registration purposes.

2.2. Pre-processing

Several pre-processing operations are necessary before thedata are fit to be fed to a classifier. Firstly, the datasets wereindependently pre-processed and analyzed using the FSL softwarepackage (http://www.fmrib.ox.ac.uk/fsl). The following pre-pro-cessing steps were performed on each BOLD time series: motioncorrection [23]; non-brain removal [24]; mean-based intensitynormalization of all volumes by the same factor; spatial smooth-ing (Gaussian kernel, 5 mm FWHM) and high-pass temporalfiltering (Gaussian-weighted least squares straight line fitting,50 s cut-off). Secondly, since the datasets are multisubject, theimages were registered to the MNI standard space [25] using thetool FLIRT from the FSL software [23]. The functional images werefirst registered to the corresponding high-resolution structuralimage of the same subject using a rigid body transformation(six degrees of freedom). This structural image was then registeredto the MNI standard space, using a full affine transformation (12degrees of freedom). Finally, the composition of the two transfor-mation matrices was applied to register the functional images tothe MNI space. For datasets mapping experiment and localizer

experiment1 the high-resolution image of the MNI standard brainavailable in the FSL package was re-sampled to the resolution of thedataset prior to registration.

2.3. Feature extraction and selection

Our features are based on the percentage signal change (PSC)between the fMRI BOLD signal and its mean value during thebaseline condition, corresponding to the fixation periods. Featuresare extracted by volume averaging the PSC in the temporaldomain, after removing the first volume of each block. Onefeature per paradigm block is thus obtained. This is a way toincrease SNR and simultaneously reduce the impact of thehemodynamic delay.

After the features have been extracted we are left with anextremely high dimensional dataset (27 851 for the mapping

experiment, 60 922 for the localizer experiment1 and 123 494 forthe localizer experiment2). Therefore, feature selection is requiredin order to obtain a more compact dataset containing the mostinformative features and also to reduce the otherwise prohibitivecomputation time.

We considered two different feature selection methods: selectingthe most active voxels in the brain and selecting the most dis-criminant voxels. In the first case, which we denote by MA, theactivity of a voxel is measured for each visual stimulus relative tobaseline by a t-test. Since our data are normalized to PSC, the t-testtests the hypothesis that the mean is different from 0 [8]. A separatet-test is performed for each class, the absolute value of the statistic isused for ranking, and the final score of a voxel or feature is its bestposition in any class. In the second case, our criterion to find themost discriminant features was based on the mutual information(MI) between each feature and the class label variable. Mutual

information between two discrete variables x and y is defined as

MIðx,yÞ ¼X

xAw

X

yAgpðx,yÞlog

pðx,yÞ

pðxÞpðyÞð1Þ

In our case, the class variable y is discrete but the features arecontinuous variables. Therefore, we divide the continuous inputfeature space into N¼10 partitions and calculate the mutualinformation between each feature and the label using the expres-sion given above for the discrete case. The probability densityfunctions involved in the MI calculation are approximated usinghistograms. Although there is an error inherent to discretization,we do not have to take it into account because we are notinterested in the true estimate values but only in how they aresorted.

Subsequently, we either select a fixed number of featureshaving the highest score (activity or MI) or a variable number offeatures yielding a score value above a pre-defined threshold.When a variable number of features is used in the ensemble, thebase classifiers may use different numbers of features.

Feature selection is applied both in the case of a single classifierand in the case of an ensemble, for each individual classifier of theensemble. In this second case, for classifier Dj, optimized for class wj,the class label variable will have value 1 for all the fMRI data in classwj and value 0 for all the fMRI examples not belonging to class wj.

This method greedily selects the features that are individuallymore informative. Although multivariate feature selection mightbe more effective, the drawbacks of such an approach are itsextremely high computational cost and sensitivity to overfitting.

2.4. Ensemble of classifiers

When we are dealing with a high dimensional feature spaceand the number of examples is small, as in the case of fMRI,accuracy depends crucially on the features that are used. There-fore, we used a favorite class ensemble [22], where each baseclassifier in the ensemble has a favorite class and uses an opti-mized feature set for that particular class. Consequently, ourensemble consists of L¼c classifiers, where c is the number ofclasses. All of the individual base classifiers use the same trainingexamples but different feature subsets, which may be overlappingor disjoint, and were allowed to have different sizes. To find thefeature subset for classifier Dj, with favorite class wj, we find thebest features that discriminate class wj from all the other classes.This procedure is detailed in Section 2.3.

2.4.1. Individual base classifiers

The individual base classifiers used in this study were theGaussian Naive Bayes (GNB) and the k-nearest neighbor(k-NN).

Let x¼ x1, . . . ,xn denote an fMRI pattern and wj, j¼ 1, . . . ,cdenote the different visual stimuli or classes.

The GNB classifier estimates the probability of stimulus wj

given the fMRI pattern x using the Bayes rule and assuming thatall the features xi are conditionally independent:

PðwjjxÞ ¼PðwjÞ

QiPðxijwjÞP

kPðwkÞQ

iPðxijwkÞð2Þ

Probabilities PðxijwjÞ are estimated from the training data.Since they are obtained separately for each feature, GNB isparticularly suitable for high dimensional data.

The k-NN classifier is a simple non-parametric method thatconsiders the fMRI pattern x as a point in Rn space and defines theneighbors of x in terms of a distance measure. The class chosen bythe k-NN classifier will be the most common class among the ktraining examples nearest to pattern x. A well-known limitationof the k-NN classifier is its sensitivity to noise and irrelevantfeatures that results in high values of the distance measure even

C. Cabral et al. / Pattern Recognition 45 (2012) 2064–20742068

if the distance between the relevant features is small. In ourapproach this limitation is alleviated by the selection of the morerelevant features (see Section 2.3).

Since this classifier, unlike GNB, does not output the posteriorprobability estimates but only a class label, and the final classifierrequires these values (see Section 2.4.2), we estimate PðwjjxÞusing distances between x and its neighbors from class wj, assuggested in [18]. Let xi be the ith nearest neighbor of x, anddðx,xiÞ be the distance between x and xi. Then,

PðwjjxÞ ¼

Pxj Awj

1

dðx,xjÞ

Pki ¼ 1

1

dðx,xiÞ

ð3Þ

2.4.2. Final classifier

After the individual base classifiers have been trained, the finalclassification is obtained without further training. Different com-bination methods could be used, either based on selection or onfusion of the individual outputs. We assume that classifier Dj,which has favorite class wj, is the best to classify patterns of classwj, so we chose selection. Therefore, in order to classify pattern xwe select the classifier Dj, with the highest posterior PðwjjxÞ.However, since our assumption that classifier Dj is the best forpatterns of class wj may not hold, for instance if base classifiers

0 500 1000 1500.85

0.9

0.95

1

Map

ping

Exp

erim

ent

GNB Classifier

0 500 1000 1500.5

0.6

0.7

0.8

0.9

Loca

lizer

Exp

erim

ent 1

0 500 1000 1500.5

0.6

0.7

0.8

Loca

lizer

Exp

erim

ent 2

Acc

urac

y

Number of

Single Most Active

Ensemble Most Active

Single Mututal Information

Ensemble Mutual Informa

Fig. 3. Testing accuracy obtained with the ensemble of classifiers and the single classifi

k-NN classifiers and the MI and MA feature selection methods.

use overlapping feature subsets, a fusion method, namely theaverage rule, was also tested.

2.5. Evaluation

In order to evaluate the generalization performance of themethods, we used leave-one-subject-out cross-validation, wherein each round of cross-validation the training folds are created byusing data from all but one of the subjects and the correspondingtesting folds are created with data from the subject left out. Thenthe testing set accuracy is averaged over all the folds.

Since feature selection is performed within the cross-valida-tion, there will be different feature selection patterns in eachtraining fold. In order to analyze the stability of this pattern wecalculated the overlap between the features selected in all thefolds. Commonly used overlap measures only compare pairs ofbinary images. For instance the Tannimoto coefficient (TC) calcu-lates the ratio between the intersection and the union of twoimage patterns. In our case, the number of folds ranges from 4 inthe mapping experiment to 10 in the localizer experiment1. There-fore, we have to measure overlap between multiple images. Forthis purpose, we use the metric proposed in [26] that generalizesTC to multiple images, by calculating the ratio of the totalintersection between all pairs of images to the total unionbetween the pairs. Let Ak and Bk, k¼ 1, . . . ,K , denote sets of binary

0 0 500 1000 15000.85

0.9

0.95

1

0 0 500 1000 15000.5

0.6

0.7

0.8

0.9

kNN Classifier

0 0 500 1000 15000.5

0.6

0.7

0.8

Selected Features

tion

er, as a function of the number of features, for all experiments, using the GNB and

C. Cabral et al. / Pattern Recognition 45 (2012) 2064–2074 2069

images, and let AkiðBkiÞ denote the value of image AkðBkÞ at voxel i.Then, the generalized Tannimoto coefficient is defined (GTC) as

GTC ¼

Ppairs,kbk

Pvoxels,iminðAki,BkiÞP

pairs,kbk

Pvoxels,imaxðAki,BkiÞ

ð4Þ

where bk is a pair-specific weighting factor, in our case bk ¼ 1.Note that this overlap measure is not as conservative as calculat-ing the ratio of the joint intersection over the joint union for allthe images in the set.

3. Results

This section presents results of the application of the proposedensemble of classifiers to empirical fMRI data collected from multi-ple subjects in three different experiments, each using paradigmswith four classes of visual stimuli. Two types of classifiers (GNB andk-NN) and two methods of feature selection (MA and MI) weretested. The classification was performed in the following conditions:(1) for the GNB classifier we modeled PðxijwjÞ as univariate Gaussiandistributions and therefore the mean mi and variance si areestimated for each feature; (2) for the k-NN classifier we used theEuclidean distance and varied the number of neighbors, k, between5 and 10. Since the differences were not significant, only the resultsfor k¼9 are shown. Firstly, a fixed number of features wasconsidered. According to the results obtained, a varying number offeatures was then considered based on selecting the features abovea pre-specified MI/MA threshold. For each condition, the ensembleof classifiers was compared with the corresponding single classifier.The performance of each classification approach was evaluated interms of the classification accuracy achieved (Subsection 3.1) andthe patterns of selected features (Subsection 3.2).

0 500 1000 15000.85

0.9

0.95

1

Map

ping

Exp

erim

ent

0 500 1000 15000.5

0.6

0.7

0.8

0.9

Loca

lizer

Exp

erim

ent 1

Acc

urac

y

GNB classifier

0 500 1000 15000.4

0.6

0.8

Loca

lizer

Exp

erim

ent 2

Number of Sele

Q1/Faces Q2/Houses

Fig. 4. Testing accuracy obtained with each of the classifiers in the ensemble, as a fu

classifiers and the MI feature selection method.

3.1. Accuracy results

For each condition (fixed or varying number of features) andfor each dataset (mapping experiment, localizer experiment1, loca-

lizer experiment2), classification accuracy differences were testedfor statistical significance by analysis of variance (ANOVA) overboth classifier types (k-NN, GNB), feature selection methods (MI,MA), classification approaches (single, ensemble) and all levelsof the number of features (fixed number of features) or MI/MAthreshold (varying number of features). The accuracy resultsobtained by using fixed numbers of features will first be pre-sented, which will motivate the presentation of the resultsobtained using varying numbers of features. Finally, the accuracyresults will be described in detail for each experiment.

3.1.1. Results for fixed number of features

The testing accuracies obtained by ensemble and correspond-ing single classifiers, as a function of the number of features, areshown in Fig. 3, for each experiment and type of classifier, as well asfor both feature selection methods. In general, the ensemble ofclassifiers yields improved overall accuracy relative to the singleclassifier, for both types of classifiers and feature selection methods.

A significant main effect of the classification approach wasfound ðpo0:001Þ, with the ensemble clearly outperforming thesingle classifier. Main effects were also found for classifier typeðpo0:001Þ, but not for feature selection method. As expected,significant interactions were found between the classificationapproach and the number of features ðpo0:050Þ, with greaterdifferences between ensemble and single classifiers beingobserved for smaller numbers of features. Significant interactionswere also found between the classification approach and the

0 500 1000 15000.85

0.9

0.95

1kNN classifier

0 500 1000 15000.5

0.6

0.7

0.8

0.9

0 500 1000 15000.4

0.6

0.8

cted Features

Q3/Objects Q4/Scramble

nction of the number of features, for all experiments, using the GNB and k-NN

0.2 0.4 0.6 0.8

200

400

600

800

1000

Map

ping

Exp

erim

ent

0.2 0.4 0.6 0.8

1000

2000

3000

4000

Loca

lizer

Exp

erim

ent 1

0.2 0.4 0.6 0.8

5000

10000

15000

Threshold [fraction of the maximum]

Loca

lizer

Exp

erim

ent 2

Num

ber o

f Fea

ture

s

All ClassesQ1/FacesQ2/HousesQ3/ObjectsQ4/Scramble

Fig. 5. Number of selected features as a function of MI feature selection threshold, for

each classifier in the ensemble as well as the single classifier, for all experiments.

C. Cabral et al. / Pattern Recognition 45 (2012) 2064–20742070

classifier type ðpo0:050Þ, for the mapping experiment but not theother experiments. The results for the mapping experiment do notclearly show an advantage for the classifier ensemble, probablydue to the fact that the classification is in this case very close to100%, producing a ceiling effect. Nevertheless, all datasets, includingthis one, show a particularly large improvement with ensemblesrelative to single classifiers for very small numbers of features.

The previous figure showed the average performance of theclassifiers considering all the stimuli. However, if we observe theperformance obtained for each distinct classifier in the ensemble,shown in Fig. 4, for both feature selection methods, we can see thatthe accuracy varies considerably with the stimulus type associatedwith the classifier. In addition, the maximum in accuracy for eachstimulus is obtained for different numbers of features. This findingsuggests that the ensemble of classifiers can be improved byselecting a different number of features to be used by each baseclassifier. One way to allow different base classifiers to use differentnumbers of features is to select all the features yielding a MI/MAvalue above a pre-defined threshold. The results obtained in theseexperiments will be presented in the next section.

3.1.2. Results for varying number of features

Different sized sets of features were identified for eachindividual classifier, by selecting the voxels yielding a MI/MAvalue above a pre-defined threshold. The dependence of theclassification accuracy on the threshold used was investigatedby systematically varying this value between 10% and 95% of itsmaximum value, in steps of 5%. The number of features selectedas a function of this threshold, for each classifier in the ensemble,as well as the single classifier, are shown in Fig. 5, for eachexperiment, using the MI feature selection method (similarbehavior was observed using the MA method). As expected, thenumber of features decreases with the threshold used in all cases.The corresponding classification accuracies are shown in Fig. 6,for both classifiers and feature selection methods. The accuracylevels achieved by using a varying number of features are notoverall different from the ones obtained using a fixed number offeatures. However, a dependence on the feature selection methodcan now be observed.

Again, the ensemble of classifiers yields improved overall accu-racy relative to the single classifier, for both types of classifiers andfeature selection methods. A significant main effect of the classifica-tion approach was found ðpo0:001Þ, with the ensemble clearlyoutperforming the single classifier. Main effects were also found forthe classifier type ðpo0:050Þ and the feature selection methodðpo0:001Þ. As expected, significant interactions were foundbetween the classification approach and the feature selectionthreshold ðpo0:001Þ, with greater differences between ensembleand single classifiers being observed for higher thresholds. Signifi-cant interactions were also found between the classificationapproach and the classifier type ðpo0:050Þ and the feature selectionmethod ðpo0:001Þ, for the mapping experiment but not the otherexperiments. Again, the results obtained for the mapping experiment

do not show exactly the same trend as for the other experiments,due to a ceiling effect. Nevertheless, the advantage of the ensembleclearly appears for high thresholds.

3.1.3. Combination of ensemble classifiers

The two different methods tested for the combination of theclassifiers in the ensemble were evaluated in terms of the poster-ior probabilities obtained for each class, for the mapping experi-

ment and the localizer experiment1. In Fig. 7, the posteriorprobabilities obtained for each stimuli with the proposed combi-nation versus the ones obtained with the average rule are shown.The superiority of the proposed rule is clear since, for the large

majority of the samples, the posterior probabilities of each classare much higher for the patterns of that class than for all theothers and are well above the straight line where both combina-tion methods are equal. This indicates that the individual classi-fiers are not equally able to classify all classes, but are insteadbest at classifying the class that they were trained for. This isparticularly evident for the mapping experiment, where eachstimulus is spatially separate from the others, in contrast with thelocalizer experiments, which exhibit some degree of overlap ofthe brain areas activated by each stimulus. This observationsuggests that our proposed method for the combination ofclassifiers into an ensemble therefore provides insights into thedistributed nature of the visual representations of objects inthe brain.

3.1.4. Results by experiment

In the mapping experiment, the testing accuracy obtained bythe single classifiers and by the ensembles is close to 100%. Thisprobably precludes the detection of an improvement using theensemble of classifiers, as a result of a ceiling effect. Possibleexplanations for the fact that the results are best for this dataset,compared to the two other datasets analyzed in this study, are itshigher SNR, the smaller number of subjects yielding reducedvariability and the clearer distinction between the brain activa-tion patterns produced by the four types of stimuli in this case.Nevertheless, a reduction of approximately 350 was observedin the number of selected features providing maximum classi-fication accuracy between single and ensemble classificationapproaches.

0 0.2 0.4 0.6 0.8 10.5

0.6

0.7

0.8

0.9

Map

ping

Exp

erim

ent

GNB classifier

0 0.2 0.4 0.6 0.8 10.5

0.6

0.7

0.8

0.9

Acc

urac

y

kNN classifier

0 0.2 0.4 0.6 0.8 1

0.4

0.6

0.8

Loca

lizer

Exp

erim

ent 1

0 0.2 0.4 0.6 0.8 1

0.4

0.6

0.8

0 0.2 0.4 0.6 0.8 1

0.3

0.4

0.5

0.6

0.7

Loca

lizer

Exp

erim

ent 2

0 0.2 0.4 0.6 0.8 1

0.3

0.4

0.5

0.6

0.7

Threshold [fraction of the maximum]

Single Most Active

Ensemble Most Active

Single Mututal Information

Ensemble Mutual Information

Fig. 6. Testing accuracy obtained with the ensemble of classifiers and the single classifier, as a function of the feature selection threshold, for all experiments, using the

GNB and k-NN classifiers and the MI and MA feature selection methods.

C. Cabral et al. / Pattern Recognition 45 (2012) 2064–2074 2071

In the localizer experiment1, the ensembles outperformed thesingle classifiers, with difference between accuracies of up toapproximately 20%. The best performance was obtained by theensemble of k-NN classifiers (83%). On average, a reduction of 175was observed in the number of selected features providingmaximum classification accuracy from single to ensemble classi-fication approaches. We can also notice that the performance ofthe classifiers is not the same for predicting different stimuli. Infact, for the GNB ensemble, the greatest increase in performancewas achieved for the faces stimulus and the smallest for theobjects.

Like with the previous experiment, in the localizer experiment2,the ensembles outperformed the single classifiers, with differencebetween accuracies of up to approximately 20%. The best perfor-mance was again obtained by the ensemble of k-NN classifiers(81%). In this case, an average reduction of 2365 was observed inthe number of selected features providing maximum classifica-tion accuracy from single to ensemble classification approaches.Comparing these classification results with those obtained withthe previous dataset at 3 T, it can be noticed that they are slightlypoorer. This could be explained by the reduced SNR of thisdataset. In fact, although a higher field strength was used, whichwould increase the SNR more than twice, the voxel size wasreduced by approximately 3 and the number of repetitions by 2,which is expected to reduce the SNR.

3.2. Feature selection results

In this subsection, we first analyze the patterns of selectedfeatures in terms of their stability, by comparing their overlapsacross folds. These patterns are then further analyzed and inter-preted in terms of their neuronal significance, i.e. in terms of theirrelation with the localization of the relevant brain functionalanatomy, for each experiment.

3.2.1. Feature selection overlap

The overlap of the selected features across folds is shown inFig. 8, for all experiments, using the MI feature selection method(similar results were obtained using the MA method). Weobserved maximum overlaps of approximately 80% for themapping experiment and localizer experiment1 and approximately70% for the localizer experiment2, which indicates good stability ofthe brain activity patterns selected with our approach. As expected,the overlap decreased with the feature selection threshold, andhence with decreasing numbers of selected features. The slightlysmaller overlap observed for localizer experiment2 may beexplained by the smaller voxel size and reduced SNR in this case,leading to noisier maps. The degree of overlap also varied withthe stimulus type, depending on the size of the associated brainareas. In fact, the four quadrants in the mapping experiment have

0 0.5 10

0.5

1

Prop

osed

Rul

e

Class Faces

0 0.5 10

0.5

1

Class Houses

0 0.5 10

0.5

1

Class Objects

0 0.5 10

0.5

1

Average Rule

Class Scrambled

Faces Houses Objects Scrambled

0 0.5 10

0.5

1

Prop

osed

Rul

e

Class Q1

Q1 Q2 Q3 Q4

0 0.5 10

0.5

1

Posterior ProbabilityClass Q2

0 0.5 10

0.5

1Class Q3

0 0.5 10

0.5

1

Average Rule

Class Q4

Fig. 7. Scatterplots of the posterior probabilities obtained for each class, as a function of two methods of combination of the classifiers in the ensemble, our proposed rule

and the average rule, for the mapping experiment and localizer experiment1.

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90

0.5

1

Map

ping

Exp

erim

ent

Ove

rlap

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

Loca

lizer

Exp

erim

ent 1

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10

0.5

1

Threshold [fraction of the maximum]

Loca

lizer

Exp

erim

ent 2

SingleFaces(Ensemble)Houses(Ensemble)Objects (Ensemble)Ecrambled (Ensemble)

Fig. 8. Overlap of selected feature across folds, as a function of the feature selection threshold, using the MI method, for all experiments.

C. Cabral et al. / Pattern Recognition 45 (2012) 2064–20742072

C. Cabral et al. / Pattern Recognition 45 (2012) 2064–2074 2073

similar sizes and all exhibit the same overlap, while the fourobjects in the localizer experiments are represented in networksof brain areas with considerably different areas and exhibitdifferent degrees of overlap. In particular, the highest overlap isachieved for the face stimuli, which is consistent with the smalland well-localized brain area associated with the response tofaces — the FFA.

3.2.2. Feature selection pattern analysis

In Fig. 9, the features selected for the single classifier and theensemble of classifiers are shown, for each experiment, using theoptimal feature selection threshold in each case, for the MImethod and the GNB classifier (the results using the MA featureselection method and the k-NN classifier were similar from thispoint of view).

For the mapping experiment, we can see that the voxelsselected as features for the single classifier cover the primaryvisual cortex, as expected. When examining the voxels selectedfor each individual classifier, however, slightly distinct sets ofvoxels are found. In each case, the selected voxels are mostlylocated in the region of the primary visual cortex associated withthe corresponding visual field quadrant according to its retino-topic organization. In particular, the inferior right and left quad-rants (quadrants 1 and 2 in Fig. 1) are known to activate the leftand right hemispheres, respectively. Accordingly, the featuresselected for the corresponding classifiers are biased to the leftand right primary visual cortex (blue and light blue in Fig. 9).

For both localizer experiment1 and localizer experiment2, theselected features are mainly localized in the visual cortex and arespread out into distributed representations of the visual stimuli,as expected. It can be observed that the feature maps for theensemble are more compact, which attenuates overfitting andexplains the improvements in performance. The features selectedfor the individual classifiers exhibit distributed patterns revealingthe specific network of brain areas associated with each visual

Fig. 9. Features selected for each experiment, for an optimal threshold using the

MI feature selection method and the GNB classifier. Classes: Q1/faces (blue), Q2/

houses (red), Q3/objects (green) and Q4/scrambled objects (yellow). (For inter-

pretation of the references to color in this figure legend, the reader is referred to

the web version of this article.)

stimulus. In particular, the FFA, PPA and LOC brain areas can beidentified. For the localizer experiment2, the features selected forclassification exhibit a noisier pattern than those obtained for thelocalizer experiment1. This results from the lower SNR, as wellas the smaller voxel size used. Nevertheless, a similar trend isobserved in terms of the distinction across the four classifiers.

4. Conclusions

This paper analyzed an ensemble of classifiers for decodingvisual stimuli from fMRI data, where each base classifier in theensemble has a favorite class or stimulus and uses an optimizedfeature set for that particular stimulus. This optimization wasbased on selecting for each base classifier the subset of the inputfeatures that best discriminates each stimulus from all of theothers and also by choosing the optimal feature set size. Themethod was applied to three fMRI datasets of multiple subjectsperforming visual tasks with four classes of stimuli. Despite ourgreedy univariate feature selection method, the ensembles out-performed the single classifiers. Our results indicate that this kindof ensemble of classifiers may provide an advantageous alter-native to commonly used single classifiers, particularly whendecoding stimuli associated with specific brain areas.

Acknowledgments

This work was supported by FCT (ISR/IST plurianual funding)through the PIDDAC Program funds and Project PTDC-SAU-BEB-65977-2006. We acknowledge support by GinoEco for collectionof the 1.5T data, support by Sociedade Portuguesa de RessonanciaMagnetica (SPRM) for collection of the 3.0T data and support bythe Centre d’imagerie biomedicale (CIBM) of the UNIL, EPFL,UNIGE, CHUV and HUG and the Leenaards and Jeantet Founda-tions for collection of the 7.0T data. We are grateful to theanonymous reviewers for their helpful comments andsuggestions.

References

[1] P. Jezzard, P.M. Matthews, S.M. Smith, Functional Magnetic ResonanceImaging: An Introduction to Methods, Oxford Medical Publications, 2006.

[2] K.J. Friston, A.P. Holmes, K.J. Worsley, J.B. Poline, C. Frith, R.S.J. Frackowiak,Statistical parametric maps in functional imaging: a general linear approach,Human Brain Mapping 2 (1995) 189–210.

[3] M. Silveira, P. Figueiredo, Joint fMRI brain activation detection and segmen-tation using level sets, in: 32th IEEE EMBS Annual International Conference,2010.

[4] M.J. Mckeown, S. Makeig, G.G. Brown, T.-P. Jung, S.S. Kindermann, A.J. Bell,T.J. Sejnowski, On clustering fMRI time series, NeuroImage 9 (3) (1999)298–310.

[5] C. Goutte, P. Toft, E. Rostrup, F.A. Nielsen, L.K. Hansen, Analysis of fMRI databy blind separation into independent spatial components, Human BrainMapping 6 (3) (1998) 160–188.

[6] S.-C. Ngan, X. Hu, Analysis of functional magnetic resonance imaging datausing self-organizing mapping with spatial connectivity, Magnetic Resonancein Medicine 41 (5) (1999) 939–946.

[7] K.A. Norman, S.M. Polyn, G.J. Detre, J.V. Haxby, Beyond mind-reading: multi-voxel pattern analysis of fMRI data, Trends in Cognitive Sciences 10 (9)(2006) 424–430.

[8] F. Pereira, T. Mitchell, M. Botvinick, Machine learning classifiers and fMRI: atutorial overview, NeuroImage 45 (Suppl. 1) (2009) S199–S209.

[9] K. Grill-Spector, R. Malach, The human visual cortex, Annual Review ofNeuroscience 27 (1) (2004) 649–677.

[10] R.B.H. Tootell, N.K. Hadjikhani, W. Vanduffel, A.K. Liu, J.D. Mendola,M.I. Sereno, A.M. Dale, Functional analysis of primary visual cortex inhumans,Proceedings of the National Academy of Sciences of the United States ofAmerica 95 (3) (1998) 811–817.

[11] J.V. Haxby, M. Ida Gobbini, M.L. Furey, A. Ishai, J.L. Schouten, P. Pietrini,Distributed and overlapping representations of faces and objects in ventraltemporal cortex, Science 293 (5539) (2001) 2425–2430.

C. Cabral et al. / Pattern Recognition 45 (2012) 2064–20742074

[12] T.M. Mitchell, R. Hutchinson, R.S. Niculescu, F. Pereira, X. Wang, M. Just,S. Newman, Learning to decode cognitive states from brain images, MachineLearning 57 (1) (2004) 145–175.

[13] X. Wang, R. Hutchinson, T.M. Mitchell, Training fMRI classifiers to detectcognitive states across multiple human subjects, in: NIPS03, 2003.

[14] L. Zhang, D. Samaras, D. Tomasi, N. Volkow, R. Goldstein, Machine learningfor clinical diagnosis from functional magnetic resonance imaging, in: IEEEInternational Conference on Computer Vision and Pattern Recognition, 2005,pp. I:1211–I:1217.

[15] J.M. Miranda, A.L.W. Bokde, C. Born, H. Hampel, M. Stetter, Classifyingbrain states and determining the discriminating activation patterns:support vector machine on functional MRI data, Neuroimage 28 (4) (2005)980–995.

[16] Y. Fan, D. Shen, C. Davatzikos, Detecting cognitive states from fMRI images bymachine learning and multivariate classification, in: Proceedings of theConference on Computer Vision and Pattern Recognition Workshop, 2006.

[17] T.G. Dietterich, Ensemble methods in machine learning, in: MCS ’00:Proceedings of the First International Workshop on Multiple ClassifierSystems, 2000, pp. 1–15.

[18] L.I. Kuncheva, Combining Pattern Classifiers: Methods and Algorithms,Wiley-Interscience, 2004.

[19] J. Richiardi, H. Eryilmaz, S. Schwartz, P. Vuilleumier, D. Van-de-Ville, Decodingbrain states from fMRI connectivity graphs, NeuroImage 56 (2) (2011) 616–626.

[20] L.I. Kuncheva, J.J. Rodrıguez, Classifier ensembles for fMRI data analysis: anexperiment, Magnetic Resonance Imaging 28 (4) (2010) 583–593.

[21] G. Langs, B.H. Menze, D. Lashkari, P. Golland, Detecting stable distributedpatterns of brain activation using Gini contrast, NeuroImage 56 (2) (2010)497–507.

[22] N.C. Oza, K. Tumer, Input decimation ensembles: decorrelation throughdimensionality reduction, pp. 238–247, 2001.

[23] M. Jenkinson, P. Bannister, M. Brady, S. Smith, Improved optimization for therobust and accurate linear registration and motion correction of brainimages, NeuroImage 17 (2) (2002) 825–841.

[24] S.M. Smith, Fast robust automated brain extraction, Human Brain Mapping17 (3) (2002) 143–155.

[25] J.L.L. Lancaster, D. Tordesillas-Gutierrez, M. Martinez, F. Salinas, A. Evans,K. Zilles, J.C. Mazziotta, P.T. Fox, Bias between MNI and Talairach coordinatesanalyzed using the ICBM-152 brain template, Human Brain Mapping 28 (11)(2007) 1194–1205.

[26] W.R. Crum, O. Camara, D.L.G. Hill, Generalized overlap measures for evalua-tion and validation in medical image analysis, IEEE Transactions on MedicalImaging 25 (11) (2006) 1451–1461.

Carlos Cabral received the masters degree in Biomedical Engineering, from the Technical University of Lisbon, Portugal in 2010. His research interests are in the area ofpattern recognition and machine learning.

Margarida Silveira received the E.E. and Ph.D. degrees from the Technical University of Lisbon, Portugal, in 1994, and 2004, respectively. Currently, she is an AssistantProfessor with the Electrical Engineering Department, Instituto Superior Tecnico and Researcher at the Institute for Systems and Robotics, Lisbon, Portugal. Her researchinterests are in the areas of image processing, computer vision and pattern recognition.

Patricia Figueiredo graduated in Physics and Engineering from the Technical University of Lisbon, Portugal, in 1997, and received the D.Phil. degree in Clinical Medicinefrom the University of Oxford, U.K., in 2003. She then joined the Faculty of Medicine of the University of Coimbra, Portugal, as a postdoctoral researcher and, since 2007,she is an Assistant Professor with the Department of Bioengineering, Instituto Superior Tecnico and Researcher at the Institute for Systems and Robotics, Lisbon, Portugal.Her research interests are in the areas of brain mapping, magnetic resonance imaging and biophysics.