Decision Support Algorithm for Diagnosis of ADHD Using ...nin/Courses/Seminar14a/EEG_ADHD.pdf · of...

J Med Syst (2012) 36:2675–2688DOI 10.1007/s10916-011-9742-x

ORIGINAL PAPER

Decision Support Algorithm for Diagnosis of ADHDUsing Electroencephalograms

Berdakh Abibullaev · Jinung An

Received: 22 February 2011 / Accepted: 23 May 2011 / Published online: 15 June 2011© Springer Science+Business Media, LLC 2011

Abstract Attention deficit hyperactivity disorder is acomplex brain disorder which is usually difficult todiagnose. As a result many literature reports about theincreasing rate of misdiagnosis of ADHD disorder withother types of brain disorder. There is also a risk ofnormal children to be associated with ADHD if prac-tical diagnostic criteria are not supported. To this endwe propose a decision support system in diagnosing ofADHD disorder through brain electroencephalogra-phic signals. Subjects of 10 children participated in thisstudy, 7 of them were diagnosed with ADHD disorderand remaining 3 children are normal group. Our maingoal of this sthudy is to present a supporting diagnostictool that uses signal processing for feature selection andmachine learning algorithms for diagnosis.Particularly,for a feature selection we propose information theo-retic which is based on entropy and mutual informationmeasure. We propose a maximal discrepancy criterionfor selecting distinct (most distinguishing) features oftwo groups as well as a semi-supervised formulationfor efficiently updating the training set. Further, sup-port vector machine classifier trained and tested foridentification of robust marker of EEG patterns foraccurate diagnosis of ADHD group. We demonstratethat the applicability of the proposed approach pro-

B. Abibullaev · J. An (B)Daegu Gyeongbuk Institute of Science and Technology,Sangri 50-1 Hyonpung Dalseon-Gun, Daegu, 711-873, Koreae-mail: [email protected]

B. Abibullaeve-mail: [email protected]

vides higher accuracy in diagnostic process of ADHDdisorder than the few currently available methods.

Keywords EEG · ADHD · Mutual information ·Shannon theory · SVM · ADHD diagnosis

Introduction

Attention-deficit hyperactivity disorder (ADHD) isone of the most common neurological disorders thataffect 3–7(%) percent of school-age children, char-acterized by developmentally inappropriate levels ofinattention, impulsivity, and/or hyperactivity. It is con-sidered to be a chronic condition with 3 to 5(%) ofindividuals diagnosed continuing to have symptomsinto adulthood [1, 2]. It is also considered a develop-mental disorder accompanied by learning disabilities,depression, anxiety and conduct disorder. The etiologyof ADHD is still unknown, and the disorder may haveseveral different causes. Scientists have studied, forexample, the relation of ADHD to the abnormal brainfunction, morphologic brain differences, and electroen-cephalograph (EEG) patterns [3–5]. The diagnosis ofADHD is based on the presence of particular behav-ioral symptoms that are judged to causes significantimpairment in an individual’s functioning and not onthe results of a specific task. Unfortunately, there’sno objective laboratory tests (urine, blood, x-ray orpsychological analysis) that can support the diagnosisof children as ADHD or not [6]. Therefore, most of-ten the behavioral symptoms of ADHD can be easilyconfused, even by mental professionals with the routineactions of children. diagnosed as ADHD. Recently, onestudy reported that approximately one million children

2676 J Med Syst (2012) 36:2675–2688

in the USA are misdiagnosed when they receive anADHD diagnosis [7]. Quantitative EEG studies havebeen conducted to address the problems mentionedand support the diagnosis of ADHD. The studies oftenanalyze EEG abnormalities such as slow wave activity,epileptiform or specific EEG frequency bands, eventrelated potentials and coherence measures. It is mainlybased on finding that individuals with ADHD havedistinctive pattern of brain electrical activity that ischaracterized by and change of low/high frequencywaves of EEG. The changes in EEG frequency is laterassociated with ADHD diagnosis. Accordingly, manystudies have been published; for example recent re-search findings reported that ADHD children’s EEGshow fairly consistent difference in their brain electricalactivity when compared to normal children, particularlyregarding frontal and central theta activity, which is as-sociated with underarousal and indicative of decreasedcortical activity [3, 5]. Most studies find excess slowbrain activity (theta) and a decreased fast brain activity(beta). Theta EEG activity is often associated with an“inattentive” or a dreamy state, and beta activity isoften seen when the brain is very busy with for in-stance solving a cognitive task [8–13]. It was found thatchildren with ADHD showed increased theta power,slight elevations in frontal alpha power, and diffusedecreases in beta mean frequency. Increased power isthe most consistent findings in this ADHD EEG litera-ture, indicating that cortical hypo-arousal is a commonneuropathological mechanism in ADHD. All of thesestudies require long term analysis of EEG signals (e.g.visual inspection, analysis) and subject to variations.Although the use of machine learning methods havebeen used in the analysis of EEG signals for decades,there are only few studies reported to support the diag-nosis of ADHD with practical application. For instance,Mueller, A. et al. analyzed the event related potentials(ERP) of EEG signal in the automatic discriminationof ADHD group from normal group subjects. Theauthors proposed an Independent component analysisbased feature extraction method and further supportvector classification algorithm for discrimination. Thestudy reported the classification accuracy of their ap-proach reaches in average 92%, with 90%—sensitivityand 94%—specificity [14]. Another study conductedby Anduradha, J. et al. also reported of applicabilityof SVM method in automatic diagnosis of ADHD.They used the same classification scheme with differentpreprocessing and feature extraction stage [15]. Re-cently, Ahmadlou, M. et al. proposed a new approachby using wavelet neural networks approach.The studydemonstrates the efficiency of using wavelet transformsfor feature extraction and artificial neural networks for

classification between ADHD and normal group [16].Most of these studies use fixed number of featuresof limited number of ADHD or normal subjects. Asyou have noticed the studies in general, employs fourstages in research: (1) data acquisition (2) preprocess-ing (denoising, filtering, etc) (3) feature extraction andfinally (4) classification. The third step is perhaps themost significant, since it determines in a high degreethe overall performance of the classifier algorithm. Afeature extraction method can be considered successfulif the resulted features describe the object uniquelyin the analyzed signal. As a result one can achieveefficiency in classification accuracy. It is usually hardto find most informative or discriminative features ofADHD children due to the variability of the disorderand the limitation of the available subjects in the study.Besides one should decide the type of features to seekin the research (such as, EEG abnormal waves, ERP,Frequency) which is usually done by an expert in thedomain. Sometimes, even a human expert may not beable to construct the features that describe ADHDdifferent from normal group. There’s a need for anadaptive algorithm which partially is informed with theavailable information (feature) about ADHD and fur-ther tries to find the good choice for feature selectionbased on the information available. In this paper wepropose a novel adaptive algorithm using informationtheoretic and a statistical learning theory in order todetect the robust EEG features and classify accordinglyto ADHD or normal group. Particularly we use mutualinformation measure to extract the dominant featuresof ADHD and normal groups further we extend our al-gorithm for semi-supervised feature selection method.In semi-supervised algorithm the previously selectedtraining set can be updated provided if new useful EEGcharacteristics are available (which is not included inthe training set). The other advantage is that we tryto reduce the redundant features of EEG that is as-sociated with both groups and minimize relevance ofADHD and normal group. In the final step supportvector machines are implemented for classification oftwo groups. To our knowledge this our first study toreport in providing such an intelligent algorithm. Allof the EEG recordings were obtained from 10 children7 of them were diagnosed with ADHD according tothe Diagnostic and Statistical Manual of Mental Dis-orders (DSM-IV; APA, 1994), and 3 normal children.Our ultimate goal is to explore the robust predictivefeatures of EEG that will minimize the misdiagnosis ofADHD with other types of disorder of normal children.Specifically, we report good sensitivity and specificitymeasures. This is our preliminary attempt to use EEGto support the diagnosis of ADHD.

J Med Syst (2012) 36:2675–2688 2677

Fig. 1 Flowchart of themethodology

Methods and materials

The methodology proposed in this work is summa-rized in Fig. 1. It consists of selecting features anddefining maximal discrepancy subsets of two class dataset; namely ADHD and Normal. Further, a semi-supervised feature selection is performed from thedata introduced to the training set. In semi-supervisedfeature selected we propose a new algorithm whichupdates the previously selected training set with newunknown data set. In the final step, a support vectormachine classifier is implemented to classify the datainto Normal and ADHD groups.

Participants

In this study subjects of 10 children, ages of 7 and 12,participated. Participants had a full-scale WISC-III IQscore of 85 or higher. Decision to include children inthe ADHD groups was based on a clinical evaluationby pediatricians and psychologists and their mutualagreement on the diagnosis. Clinical interviews incor-porated information from as many sources as wereavailable. These included a history given by a parentor school reports for the past 12 months, reports fromany other health professionals, and behavioral obser-vations during the assessment. Children were excludedfrom the ADHD groups if they had a history of aproblematic prenatal, prenatal or neonatal period, adisorder of consciousness, a head injury resulting incognitive deficits, a history of central nervous systemdiseases, convulsions or a history of convulsive disor-ders, paroxysmal headaches or tics, or an anxiety ordepressive disorder. All children were required to meetthe diagnostic criteria for ADHD of the Diagnosticand Statistical Manual of Mental Disorders [17], includ-ing current symptoms and a retrospective diagnosis of

childhood ADHD. In addition we used Korean versionof the Child Behavior Checklist (K-CBCL)[18], WISC-III evaluation methods [19]. Participants took K-CBCLfor checking externalized behavior problems and in-ternalized behavior problem. We set cut off score to‘70T’ in all subtests and we exclude the participants hadover 70T in delinquent rule breaking behavior, somaticcomplaints, withdrawn, anxious/depressed subtitles inK-CBCL. The control group consisted of children fromlocal schools and community groups. Control partici-pants took part in a clinical interview and completeda self-report rating on ADHD and interview with aparent and teacher similar to that of ADHD group.

Data acquisition

EEG measurements of participants were obtained us-ing multichannel G-tec. EEG acquisition system. Thesampling rate of the acquired data is 256 Hz andthe data is digitized using the 16-bit A/D converter.The 9-mm thin disk Ag/AgCl electrodes were mountedinside the cap with bipolar references behind the ears.Impedance levels were kept less than 5 kOhm. Eachsignal was amplified and filtered using a 1–40 Hz band-pass filter. A four-pole Butterworth filter was used asa low-pass filter and as an anti-aliasing scheme. Therecordings were obtained from the frontal region of theprefrontal cortex according to the 10–20 internationalsystem. The frontal regions were the left frontal (Fp1,F3, F7), midline frontal (Fpz, Fz), right frontal (Fp2, F4,F8), midline central (Cz).

Experimental tasks consisted of performing a cogni-tive task to evaluate the focused attention. In particu-lar, each child performed a focused attention to selectone object among multiple choices. Our tasks weredesigned to evaluate the ability of the participants todiscriminate relevant from irrelevant information (i.e.,

2678 J Med Syst (2012) 36:2675–2688

the ability to focus attention). The task is considered asa variant of a widely used CPT. During the task, fourobjects are presented on the screen. The objects are thepictures of fruits, such as apple, watermelon, strawberryand banana. The tasks are presented in three differentforms as follows: (1) the voice is activated randomlyproviding the names of the fruit ( while fruits appearon the screen) and the participant should select it ac-cordingly. This task is designed to evaluate the visualfocused attention; (2) the same procedure is performedbut the difference is the figures appear on the screenafter the name of the fruit is asked. (3) This cognitivetask is designed to evaluate the focused selective atten-tion of participants. Participants hear the names of thefruit and in four blocks the names of the fruits againpronounced without providing any particular figure.Then, a participant should select the correct block ofthe given task. The duration of the tasks consists of 12seconds of pre-task and the task (stimulus) durationwas 1200 ms. Each task was performed 10 times atdifferent days. This is actually the part of our long termresearch which lasted during 6 months. Our cognitivetasks include other types of tasks such as, focused,selective and sustained tasks. However, in this researchwe have selected the data from focused attention tasksonly in this study.

Feature extraction

For the feature selection, our problem can be formal-ized as follows. Let {(x1, y1), (x2, y2), ..., (xn, yn)}, bea set of training patterns, where x is the supervisedselected input feature vector and y—is the randompattern x, y ∈ X both taken from the nonempty set.Our goal is to find the maximal relevant (predictive)features of y-patterns with x-patterns and assign tothe labeled class accordingly. For example, EEG fea-tures patterns of predefined ADHD/NORMAL chil-dren with randomly recorded EEG patterns. This willresult in the more classification accuracy of the usedclassifier. Our approach for feature selection is basedon power spectrum features since the EEG powerbased features are the most informative and commonin use in clinical ADHD diagnosis. Specifically, forthe training set, we defined power spectrum densityfeatures of EEG each channel obtained by Fast FourierTransform (FFT) with a Han window using 256 datapoints with 50(%) overlap. The EEG signal length is 2minute epochs from each channel. The resulting powerspectrum density function P(f) is then divided into thefour frequency bands: delta (0.5–3.5 Hz), theta (3.5–7.5Hz), alpha (7.5–13 Hz) and beta (13–30 Hz), for both

absolute and relative power, as well as the total powerof the EEG (1.5–30 Hz). Ratios were also calculatedbetween frequency bands by dividing the power ofthe slower frequency band by the power of the fasterfrequency band. These were calculated for theta/alphaand theta/beta frequencies. According to the studiesreported in literature the following Table 1 was ex-tracted. We have selected each frequency band and thegoal is to find the most dominant signal features inADHD children or vice versa. Therefore, our analysisshould employ a hybrid method that would extract onlymost informative and relevant features with minimumredundancy with respect to the known dataset.

In the next section, we introduce the proposed mu-tual information technique to extract important fea-tures of the EEG signal.

Mutual information feature selection

Mutual information measures the statistical depen-dency between two random variables [20]. The de-pendency can be related to the information measureof relevance of random variables. Here we apply mu-tual information method for efficient feature selec-tion.Particularly we aim to build a semi-supervisedalgorithm based on the information theoretic approachthat measures the information exchange, dependencyor relevance between EEG patterns.

Shanon’s entropy provides a powerful formalizationof uncertainty of a random variable. Let X be a randomvariable and its entropy H(X) is defined as,

H(X) � −∫

Xp(x)log(x)dx, (1)

whereas the conditional entropy H(X|Y) is defined as

H(X|Y) � −∫

X

∫Y

p(x, y)log(x|y)dxdy

= −∫

Yp(y)

(∫X

p(x|y)logp(x|y)dx)

dy. (2)

Note that here Y is a discrete binary random vari-able representing the class labels and X is a particularfeature x j corresponding to a dimention of the inputvector. As a result, the conditional entropy H(X|Y) canbe expressed as

H(x j|y) = −∑

y∈{0,1}p(y)

∫z∈range(x j)

p(z|y)logp(z|y)dz

(3)

where P0 and P1 are the class priors. From equationabove we can measure the dependency of two random

J Med Syst (2012) 36:2675–2688 2679

Table 1 Selected predictive power features based on literature and the major differences found between ADHD and normal group

Frequency Delta (0.5–3.5 Hz) Theta (3.5–7.5 Hz) Alpha (7.5–13 Hz) Beta (13–30 Hz) Theta/alpha Theta/betaband (Hz)

Associated Deep sleep Unfocused Eyes closed, Mental activity, Good indicator Good indicatorcognitive states drowsiness alert restfulness concentration

Findings Increased and Increase Reduced relative Lower relative Barry et al. [3] Increased [9]in ADHD reduced delta [4] in absolute alpha [8] beta [3]

and relative & increased ortheta [3] reduced [4]

variables X and Y. If X depends on Y, the uncertaintyon X is reduced when Y is known and this formaliza-tion is obtained through conditional entropy. Then, themutual information can be both defined in terms of thejoint and conditional entropy (see also Fig. 2):

I(X, Y) = H(X) − H(X|Y)

= H(Y) − H(Y|X)

= H(X) + H(Y) − H(X, Y) (4)

Note in order to compute the entropy and mutualinformation we need to evaluate the probability dis-tributions. One can compute the entropy of a randomvariable in closed form. In particular, for a Gaussianrandom variable X N(X; μ, σ 2 the entropy is H(X) =1/2(1 + log 2πσ 2). In more general case, even for amixture of Gaussians, we can no longer compute theentropy in closed form. If the p(x) can be calculatedin closed form e.g., if we make a certain assumptionsabout the parametric form of the p(x) and estimate theparameters, the we can still evaluate H(X) fairly easily,

Fig. 2 Mutual information and entropy relationships in VennDiagram

by numerically approximating the integral in equationof the entropy. If no parametric assumption are maderegarding the form of p(x) but we have set of observa-tion drawn from it, we can estimate the density at anygiven value of X using the kernel density estimate asdescribed in class:

p(x) = 1/NN∑

i=1

K(x, xi) (5)

where K(x, xi) is the kernel function which itself is avalid probability density function and xi are observa-tions drawn from P(X). In this problem, rather thanassuming a parametric form for the class-conditions, weapply a Gaussian kernel density estimator to model thedistributions directly from the observations, which isgiven by,

p(x) = 1

N√

2πσ

N∑i=1

exp(−(x − xi)

2

2σ 2

)(6)

here σ is user defined standard deviation of theGaussian kernel function in our case σ = 0.4.At this point lets recall that the mutual informationbetween random variables is given by I(X; Y) and for-mulate it to our feature selection purpose. The mutualinformation measures the reduction in uncertainty onX resulting from the knowledge of Y. The mutual infor-mation satisfies the bound 0 ≤ I(X, Y) ≤ H(X). Thelower bound is reached if and only if X and Y are in-dependent hence H(X|Y) = H(X). The upper boundis achieved when X and Y are fully dependent, orP(X|Y) = 1, H(X|Y) = 0 and I(X, Y) = H(X) whichmeans we can get all information of X from Y. There-fore we can consider the mutual information measureas a relevance or dependency measure between tworandom variables. Moreover, I(X, Y) can be selectedas a similarity measure or a likelihood function of twoor more random variables, based on the dependency.We therefore intuitively formalize the applicability ofentropy and mutual information in our feature selec-tion problem. Let (xi, yi)

ni=1, be a two random variables

which can be described with p(x, y) and H(x, y) andwhere xi is random EEG patterns and yi is known

2680 J Med Syst (2012) 36:2675–2688

features classes for ADHD and NORMAL children. Ifx and y have relevant patterns then mutual informationbetween them tends to be larger value of 0 ≤ I(X, Y) ≤H(X) and if they are irrelevant (or independent)thenmutual information tend to I(X, Y) ≈ 0. The mutualinformation between given random variable and knownvariable then can be assumed as an estimation of a simi-larity measure or relevance. Therefore our formulationwill be as follows, given (xi)

ni=1 a random EEG signal

and (yi)i∈0,1 where y = 0 is known NORMAL and y =1 is ADHD signal patterns. The task of MI featureselection is to find the (xi) which is most relevant to(yi)i∈0,1 in other words find k j = argmax{I(xi, yi)}.

Maximal discrepancy criterion

We introduce another formulation of mutual infor-mation in our work where we consider to work onthe training dataset (previously defined set). Givendataset xi with m features and n instances, where F0,1 =( f1, ..., fm) and D = (d1, ...dn) are sets of features andinstances, respectively. The F here corresponds to se-lected features given in Table 1, while D is everyinstance in a single feature. We have two classes ofdataset (yi)i∈0,1 as mentioned already. It is necessaryto find subsets of these two class dataset with maxi-mum discrepancy in order to achieve higher decisionaccuracy. As noticed earlier maximizing the mutual in-formation maximized the relevance between variables,here we do the reverse of it. We try to extract subsetof F with minimum information measure between F0

and F1 for maximal discrepancy. The set F0 can beconsidered as having two parts F ′

0 and F∗0 where F ′

0 isthe subset relevant to F1 and F∗

0 subset that is irrele-vant to F1 (non-overlapping part, see Figs. 2c and 3)then F0 = F ′

0 + F∗0 . We minimize relevant part (F ′

0) asfollows:

I(F0, F1) = H (F0) + H (F1) − H (F0, F1)

= H(F ′

0, F∗0

) + H (F1) − H(F ′

0, F∗0 , F1

)= H

(F ′

0

) + H(F∗

0

) + H(F1) − H(F ′

0

)− H

(F∗

0 , F1)

= H(F0) + H(F1) − H (F0, F1)

= I(F∗

0 , F1) ≈ 0; (7)

By elimination of F∗0 we minimize dependence of X ∩

Y ≈ null as the result we reduce the number of featureinstances that are most likely to degrade the classifiersaccuracy.

Fig. 3 Feature selection problem: we observe two random vari-ables trying to maximize the discrepancy of the variables findingthe minimal mutual information measure and in third stage asemisupervised algorithm updates the training set

Semi-supervised feature selection

After selection set of features with maximum discrep-ancy we further proceed to semi-supervised featureselection. The selected features from training datasetmay not have the ideal (marker) informative featuresets of ADHD group. It is usually the case when thesize of EEG data number of subjects are limited andin the case when the there is a subject variability con-straints. In order to make our approach more adaptiveto out of the scope of the available training dataset(EEG data, subjects 10) we propose the following al-gorithm. Generally, we provided in our training datasetinitially defined as F0,1 for NORMAL and ADHD case.However,when a new subjects (probably with differentsymptoms) are tested with the proposed method, thealgorithm may fail (or output low accuracy) to analyzewhether the subject belong to ADHD or NORMALgroup. Then the only thing should be done is to up-date the training set F0,1 by providing new information(N1,0) related to both group where N ∈ (yi)i∈0,1. Whennew information is available there’s a possibility thatsubset of N are already in F and in the other caseit is completely new feature set (information) whichis not included in F. Hence we propose the followingformulation which measures the interactions betweenpreviously selected training set and the new input in-formation. We have now three feature set (variables) as(F0, F1, N0,1) and we can ask how we learn about N a

J Med Syst (2012) 36:2675–2688 2681

random input feature by observing the related class y0,1

ADHD/NORMAL when we already have the featureset F0,1, then the answer can be explained by con-ditional mutual information. If the new input featureN is independent of F1 and y1 we have I(y1, N|F) =I(y1, F). In other words, we can ask how much infor-mation N contains about group y0,1 after we conditionon F:

I (y1, N|F) ≡ H (y1, F) − H (y1|N, F) (8)

Notice that this is the average of

H (y1, F = f ) − H (y1|N, F = f ) ≡ I (y1, N|F = f )(9)

over possible feature values of f . For each f , wehave an ordinary mutual information, which is non-negative, so the average is also non-negative. In fact,I(y1, N|F) = 0 if and only if y and N are conditionallyindependent given F. Again, the output of I(y1, N|F)

is nonnegative, it can be bigger than, smaller than orequal to I(y1|N). When it is not equal, we can saythere’s a interaction (relevance) between F and N—as far as their information about y. It is a positive in-teraction if I(y1, N|F) > I(y1, |N), and negative whenthe inequality goes the other way. If the interaction isnegative ,then we say that some of the information inN about y is redundant given F. We can see how thisconnects to our feature selection: we will select featuresets containing non-reduandant information about y.

We conclude this section with the following semi-supervised feature selection formulation:

n(i) = argmaxI(y(0,1), Ni

)(10)

k(i) = argmaxH(y(0,1), Fi

) − H(y(0,1)|Ni, Fi

)(11)

Select those features which satisfies the eqality I(y1,

N|F) > I(y1, |N), non redundant and relevant with re-spect to F and y.

Classification

The final step of our research is to use supervisedlearning method to classify ADHD or Normal group,based on the EEG patterns extracted in the previoussections. In particular we use a Support vector machine(SVM) classifier a learning machine that can be usedfor classification problems as well as for regression andnovelty detection. Important features of SVMs are theabsence of local minima, the well-controlled capacity ofthe solution, and the ability to handle high-dimensionalinput data efficiently. It is conceptually quite simple,but also very powerful: in its infancy, it has performedwell against other popular classifiers and has been ap-plied to problems in several fields. The EEG patterns

can be thought of as points in an n-dimensional space.SVM is then trained to classify the data points of severalclasses. Particularly, SVM chooses the hyperplane thatprovides maximum margin between the plane surfaceand the positive and negative points. The separatinghyperplane becomes optimal when the distance, fromthe closest data points, is maximized. These data pointsare called the support vectors. We briefly review SVMalgorithm here for more detailed information, one canfind it in other literatures such as [21, 22] or elsewhere.For non-separable case the SVM is constructed bysolving a following dual optimization problem,

argmax{α}

⎛⎝ N∑

i=1

αi − 1/2N∑

i=1

N∑j=1

αiα jyi y jK(xi, x j

)⎞⎠

(12)

subject to

N∑i=1

αi yi = 0, 0 ≤ αi ≤ C∀i = 1, ...N.

where the Lagrangian multiplier are given by α =(α1, ..., αN)T„ the training samples are (xi, yi) and theirrespective labels given by y = (y1, ..., yN), and C is thepenalty parameter for slack variable that should beminimized. In equation above K(xi, x j) is the kernelfunction that is used to embed the training samples inton-dimensional space. The accuracy of SVM classifieralso strongly depends on the type of the Kernel func-tion used. For example, there are several available ker-nel function for non-linear mapping of input patterns.In this research we use radial basis Gaussian kernelfunction to train and test the SVM. It is given as

K(xi, x j

) = exp(1/2σ 2 (−|xi − x j|2

))(13)

There are free parameters namely sigma, of the SVMkernel function and margin-loss trade-off C, whichshould be determined to find the optimal solution. Theobjective is to obtain best C and so that the classifiercan accurately predict unknown data (testing data).Inour case the optimum values of the parameters areobtained with the 5 − f old cross-validation using gridsearch algorithm. Here the data is partitioned into 5equally sized subsets. Subsequently 5 iterations of train-ing and validation are performed such that within eachiteration a different subset of the data is held-out forvalidation while the remaining 4 subsets are used forlearning. We rearranged the data to ensure that eachsubset is good representative of the whole data. In otherwords, for a classification of ADHD and Normal each

2682 J Med Syst (2012) 36:2675–2688

subset contains 50% of feature patterns where classcomprises around half the instances.

After finding the optimal parameters, the classifi-cation performance of SVMs was measured using stan-dard criteria as follows: where a true positive (T P)

outcome was registered when both SVM and physi-cians classified a EEG pattern as ADHD. False posi-tive (F P), true negative (T P) and false negative (F N)

outcomes were similarly defined [Ref]. Using thesedefinitions, True Positive Rate (T Prate) and False Pos-itive Rates (F Prate) were calculated using formulasbelow,

1. True positive rate (also referred to as sensitivity)is the percentage of positive examples which arecorrectly classified

T Prate = T PT P + F N

(14)

2. False positive rate—is the percentage of negativeexamples which are misclassified

F Prate = F PT N + F P

(15)

We analyzed the classification performance usinga Receiver Operating Characteristic (ROC) curves,which plots as sensitivity (true positive rate) versus oneminus specificity (false positive rate). The area underthe ROC curve (AUC) as a measure of the discrim-inatory power of a classifier, which is insensitive toclass distributions and the costs of misclassifications; forinstance AUC = 1 indicates perfect classification, whileAUC = 0.5 means that the classifier does not performbetter then random guessing.

Experimental results

For the feature selection we have tried various com-binations of feature sets mainly given in Table 1and compared the robustness of each feature typefor our problem. As stated earlier our feature se-lection steps are given by: (1) finding maximal mu-tual information (see Section “Mutual informationfeature selection”); (2) maximizing discrepancy oftwo group (Section “Maximal discrepancy criterion”)and finally (3) semi-supervised feature selection (inSection “Semi-supervised feature selection”). The goal

Fig. 4 Comparison of mutualinformation measure of thewhole data from allparticipants for the followingspecific frequency bands,a delta, b theta, c alpha andd beta. Theta and betafrequency bands are found toprovide more predictingfeatures for ADHD ornormal

J Med Syst (2012) 36:2675–2688 2683

is to find the most predictive or the most distingushingand robust features that is used to recognize ADHDgroup. We organized the data of 10 children from all 8channels and computed its respective power spectrumdensities. Further the power spectrum were divided tospecific frequency band. We applied proposed mutualinformation to the each frequency band to explore howgood predictor is each band for all children. Figure 4demonstrates the obtained results and the comparativeview of the Delta, Theta, Alpha and Beta frequencybands. It was noticed that Theta features were moreconsistent among all children (see Fig. 4b), whichmeans the more mutual information the more betterfeature for the specific problem. In contrast, Delta andAlpha frequency bands are found to be less informa-tive than the remaining bands. One can notice thatDelta features show absence in some children whichachieves the minimum of mutual information value.Similarly Alpha features have less mutual informationmeasure in most cases. However, in average Thetafeature sets are good indicator of both class ADHD andNORMAL.

In our next step, we demonstrate the comparativeresults of power ratio features for each children. As

shown in Fig. 5, we have evaluated 4-types of powerratios such as Theta/alpha, Theta/Beta and Relative −Theta finally Relative − Beta. Many studies have re-ported that power ratio features are the best indicatorsof ADHD by finding variability of a selected powerratios. In our study, we don’t analyze the variability butto explore the most robust features for both groups.First, it has been found that Relative − Theta providemaximal mutual information in average among all chil-dren, compared to other ratio features (see, Fig. 5a).Second, Relative − Beta features as shown in (b) planeof Fig. 5 were slightly better than Theta/Beta but inboth cases some children has shown the less featuresin these ratios. The information measure in last featureset, namely Theta/Alpha is less relevant but commonin all children. Here, feature index represents the num-ber of children while in Fig. 4 it represents the numberof features in rows (8 channels × 10 children).

Significant findings were obtained when evaluatingthree different approach in feature selection. For exam-ple, Fig. 6 shows the results of selected features whenusing mutual information of power features, the powerratios when with maximal discrepancy features andsemi-supervisedly selected features of the same power

Fig. 5 Mutual informationmeasures of power ratiosa theta/alpha, b theta/beta,c relative theta and d relativebeta. Most of the childrenshowed similar distinguishingfeatures relative theta andrelative beta

2684 J Med Syst (2012) 36:2675–2688

features. The power features consists of all combinationof the features and the results show us the best candi-date feature among the selected features. Particularly,as shown in (b) plot of Fig. 6, the features consistsof following sets [α, β, θ, θ/α, θ/β, relativeθ, relativeβ].The results are actually different from the ones whichwe analyzed in Figs. 4, and 5. Here, for example in (a)plane one can notice the minimum value of delta andtheta in contrast to the results when they were foundto be good features in in Fig. 4. It is due to the factthat mutual information changes as the number infor-mation about the feature increases, because the mutualinformation measures interaction information betweenrandom variable. Then from Fig. 6a we can concludewhen all information are considered the most relevantfeature sets to both group are ratio coefficients such as{θ/beta, Rel.θ, Rel.β}. Subsequently the same featuresets are selected by with maximal discrepancy criterionbetween each feature subset. The output is shown inFig. 6b, in this case it can be noticed that appearance of{β, δ, θ} feature sets. However the most predicting onesare found as the power ratios including the β feature setwhen using this criterion. The final case demonstratesmore promising approach by semi-supervised featureselection. In this step, we analyze the same data withmaximal discrepancy criterion and randomly providevarious subsets of the features. Then, as explained ear-

lier, the semi-supervised algorithm updates the featureset based on the available information. In addition tothe dataset of 10 children we provided other data col-lection which were obtained from the same children butfrom different EEG recording sessions. It is noticablethat in this feature selection method the relative powerratios remain stable in other words there’s no updatein the subset. However, the other specific bands alsoincrease in mutual information measure. One thing tonotice here is that this result have some informationrelation with the results obtained in Fig. 4 earlier wherewe assumed that good predictive features are theta, betafollowed by delta. Compared to (b) plane of Fig. 6 weexactly notice the update of feature set in these fre-quency bands. However, the question is why alpha setincreases (or updates)?; the cognitive state associatedwith alertness which is usually less in ADHD children.One of the answer would be the reference [8] whereauthors reported the difference in alpha state betweenADHD and normal group. In addition our goal is not toanalyze the difference, increase or decrease of powersinstead we are trying to find the most distinguishingfeatures of ADHD from normal children (or viceversa)for the classification purposes.

Subsequently, the performance of the proposed clas-sification method is validated for selected data fea-ture set. The dataset contains a variety of feature sets

Fig. 6 Overall findings andcomparison of the mostpredictive power among8-analyzed features (seeTable 1) and relative powers.Three types of featureselection methods are usedand compared where a rawpower features, b maximaldiscrepancy criterion based,and c semi-supervisedselected features

J Med Syst (2012) 36:2675–2688 2685

Table 2 Types of input features provided to the SVM classifier

Input features No. of features

ADHD NORMAL

Raw power spectrum 223,560 3,160Maximal discrepancy 120,246 4,620Semi-supervised 902,462 11,354Theta/alpha 2,530 365Theta/beta 3,105 201Relative powers 5,462 232

(considered as predictive powers). We tested the SVMon various combinations of input features (training set)which are organized as follows (also shown in Table 2);(1) Raw power spectrum feature sets of each class(ADHD or not); (2) Mutual information based featuremaximum discrepancy feature sets; (3) Semi-supervisedselected feature sets; (4) Power ratios and; (5) Relativepowers. The performance of SVM were obtained con-sidering all frequency features and their combinations.

Our first step is the exhaustive search of optimalparameter of the classifier, particularly C and σ . Foreach types of the feature set we performed a grid-searchon C and σ using cross-validation. Various pairs of(C;σ ) values are tried and the one with the best cross-validation accuracy is selected. We noticed that tryinggrowing sequences of C and sigma is a practical methodto identify good parameters. Experimental results showthat the minimum test error rate for the classificationis acquired with the following parameters given inTable 3.

One can analyze that the training errors rates areminimized in some parameters but they vary. For in-stance the difference in the test error among selectedfour features can be seen. Especially, if you notice whenusing semi-supervised features that performance of theclassifier achieves minimum error rate for the test data.Conversely raw power features provide less accuracy orhigher error rate while trying to classify the two groups.

It was also found that Theta/alpha ratios demonstratemore stable features however, the accuracy is degraded.Though our approach we achieved best minimum errorrate of (5.2%) when using a semi-supervised algorithm.

Figure 7 demonstrates the overall classification re-sults in terms of cross validation error, minimum crossvalidation error and most importantly minimum testerror. However, using maximal discrepancy criterionthe minimum test error rate (5.2%) was relatively goodcompared to using raw power features. Specific featuressets such as theta/alpha and theta/beta ratios providedin average 8% of test error while the minimum trainingerror was less 8%. Many studies report that relativetheta ratios are best marker of the ADHD children. Inthis study we obtained best cross validation error usingthe raw relative theta features with 4.6%, however theminimum training error was comparatively less than theone obtained through other features. Since our goal isto find such a features that minimizes the cross vali-dation error along with equally test error. In our nextstep we demonstrate the classification performance un-der the ROC curve for various types of input featuresto SVM.

Figure 8 shows the ROC curves of the SVM classifiertogether with the area under the ROC curve (AUC)measure. For the optimal classification we should ob-tain maximal T Prate and minimum F Prate that oc-curs at various instances of the threshold. Moreoverclassification power of a classifier is measured by thearea under the ROC curve; AUC = 1 indicates perfectclassification, while AUC = 0.5 means that the clas-sifier does not perform better then random classifier.Figure 8a shows the SVM peformance for the test setwith two types of feature selection method as well asraw sets consisting of power spectral features sets. Wenote that maximal classification performance obtainedwhen using semi-supervisedly selected features AUC =0.954. Since it is easy to see that the curve here stretchesalmost into the top left corner that represents the

Table 3 Cross validationresults with the optimalparameters of SVM for5—types of feature data set

Input features SVM parameter Kernel function Min train error Test errorC, σ

Raw power spectrum C = 8.20 σ = 0.2727 0.1096 0.1163C = 10.0 σ = 0.1194 0.1194 0.1163

Maximal discrepancy C = 0.10 σ = 0.4010 0.4233 0.5104C = 0.63 σ = 0.6589 0.1656 0.5768

Semi-supervised C = 4.98 σ = 2.8070 0.0532 0.05104C = 10.9 σ = 0.8369 0.0459 0.5268

Theta/alpha C = 6.72 σ = 0.3340 0.0813 0.0856C = 100 σ = 0.3340 0.4452 0.8063

Relative ratio C = 55.0 σ = 0.4503 0.05785 0.0747C = 70.0 σ = 0.4503 0.03254 0.0456

2686 J Med Syst (2012) 36:2675–2688

Fig. 7 SVM cross validationresults using four types ofEEG input features. a Rawpower spectral features,b maximal discrepancy,c theta/alpha ratios andd theta/beta ratios, e relativetheta, and f semi-supervisedlyselected features consisting ofall power spectral and ratiofeatures

0 10 20 30 40 50 60

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

Number of Trials Number of Trials



−test error :11.6%

0 10 20 30 40 50 600

0.05

0.1

0.15

0.2

0.25

0.3

0.35

Err

or R

ate

Err

or R

ate

Err

or R

ate

Err

or R

ate

Err

or R

ate

Err

or R

ate

Test error (for min CV)

Test error

Min CV error

CV error

−test error : 5.7%

0 10 20 30 40 50 600.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

0.2

0.22


0 10 20 30 40 50 600.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

−test error: 8.1%

0 10 20 30 40 50 600.04

0.05

0.06

0.07

0.08

0.09

0.1

0.11

0.12

0.13


0 10 20 30 40 50 600

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

−test error 5.2%

ba

dc

fe

performance of the superior model (this model com-mits virtually no false positives). One can compare theTPrate in (x=0.13,y=1) achieves 100% classificationbut with the tradeoff 13% of the times FPrates. Theideal case would be to obtain (x=0,y=1) for perfectclassification. When using raw features without anyfeature selection method we noted the classificationperformance is equal to AUC = 0.7 which is betterthan any random classifier. Lastly, maximal discrepancyfeature selection approach results superior in AUC =0.90 compared to selecting raw features.The remainingparts of the figure analyzes the power ratio input fea-tures and the effects of the classifier performance. Forinstance, Fig. 8b when using only power ratio featuresto the SVM one can find better features by analyzingthe ROC curve and AUC. In this case, Relative beta

found to be more robust in discrimination betweenADHD and Normal than other features plotted in thefigure. Simply, if a classifier has greater area (AUCvalue) it has a better average performance. SVM withTheta/alpha features becomes inferior to the remain-ing input features. Going further to the Fig. 8c resultsin more liberal performance by using maximally dis-crepant features. In this case we noted that all of thecurves are virtually identical illustrating no big discrep-ancy in classifier performance. However, the maximumAUC value is obtained using again using Relative betainput features as in the previous Fig. 8b. Finally inFig. 8d we can observe the classifier performance whenusing semi-supervisedly selected features. It’s interest-ing to observe that the performance better given powerratio features where the maximal AUC = 0.97 achieved

J Med Syst (2012) 36:2675–2688 2687

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Tru

e p

osi

tive

ra

te

Maximal DiscrepancyAUC =0.9037

Raw FeaturesAUC =0.70327

Semi SupervisedAUC =0.95444

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Tru

e po

sitiv

e ra

te

Theta/BetaAUC =0.81413

Theta/AlphaAUC =0.71822

Rel.ThetaAUC =0.87253

Rel.BetaAUC =0.91173

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Tru

e p

osi

tive r

ate

Theta/BetaAUC =0.89653

Theta/AlphaAUC=0.90809

Rel.ThetaAUC=0.92329

Rel.BetaAUC=0.93316

0 0.2 0.4 0.6 0.8 10

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

False positive rate

Tru

e po

sitiv

e ra

te

Theta/BetaAUC =919398

Theta/AlphaAUC =0.95661

Rel.ThetaAUC =0.97189

Rel.BetaAUC =0.93322

a b

c d

Fig. 8 ROC curve of the SVM performance when the inputinstances consisted of different combinations of EEG featuressuch as: a power spectral features using three methods, b raw

power ratio features, c power ratio features obtained throughmaximal discrepancy criterion, d power ratio features using semi-supervised method

using relative theta. This means that the classificationrate is almost 97% times true. Besides, we see thatthe SVM only begins to commit no false positive er-rors after it has almost reached a true positive rate of30–50%. The curve is in a near perfect classificationmodel however this will not happen until the curve hasalmost reached the ‘perfect performancepoint. Com-pared to other methods (Fig. 8a, b, c), the modelin Fig. 8d is not only superior because its curve isthe closest to the ‘perfect performance curve but wecan observe that for large ranges of the ranking val-ues the model commits more true positive rates thanit provides false positive classifications. We tried to

demonstrated a classifier perfomance assessment de-pending on types of the input featues, that is, giventwo or more input features, we need to pick one inorder to be deployed. The following criterion to val-idate one classifier with random features sets overthe other(s) is considered: (a) our classifier shouldbe general enough to describe its performance over abroad range of possible feature set and (b) it shouldbe able to discern whether the performance differencebetween input features are statistically significant. Itturns out that ROC curve analysis provides the val-idation for both of these criteria in a highly visualmanner.

2688 J Med Syst (2012) 36:2675–2688

Conclusion

We demonstrated a decision support system in diagno-sis of ADHD disorder. The contributions of the currentstudy include developing an adaptive feature selectionmethod in particular, a semi-supervised feature selec-tion algorithm which is based on mutual informationtheory. We have shown the potential use of the al-gorithm is when the input available data is limitedand one can define new features, based on which thealgorithm updates its training set features. Comparedto other studies that use a learning algorithm, thisstudy is the first step in formulating the semi-supervisedapproach for ADHD feature selection. Through thisapproach it was possible to define new features ofEEG signal which provide important information indiscriminating between ADHD and normal group. Themaximal accuracy of the SVM classifier was above 97%when when using a semi-supervisedly selected features.Particularly, we found that power ratio features aredominant features of EEG signal for ADHD, thoughother features provided relatively good accuracies. Webelieve the current method would assist the physiciansin diagnostic process of ADHD disorder as well aslessen their workload. In our further work, we planto increase the computational efficacy of the algorithmand test it on large numbers of datasets.

Acknowledgements We thank our colleagues who providedassistance in data acquisition sessions. Especially, we appreciatethe efforts of Dr. Yunhee Shin and her research group duringthe research collaboration (Daegu Univ., ROK) and their fruitfuldiscussions about the ADHD children. We are also immenselygrateful to reviewers for their useful comments on an earlierversion of the manuscript. This work was supported by the DaeguGyeongbuk Institute of Science and Technology and fundedby the Ministry of Education, Science and Technology of theRepublic of Korea.

References

1. Biederman, J., and Faraone, S. V., Attention-deficit hyperac-tivity disorder. Lancet 366:237–248, 2005.

2. Castellanos, F. X., Toward a pathophysiology of attention-deficit/hyperactivity disorder. Clin. Pediatr. 36(7):381–393,1997.

3. Barry, R. J., Clarke, A. R., and Johnstone, S. J., A review ofelectrophysiology inattention-deficit/hyperactivity disorder:I. Qualitative and quantitative electroencephalography. Clin.Neurophysiol. 114:171–183, 2003.

4. Loo, S. K., and Barkley, R. A., Clinical utility of EEG inattention deficit hyperactivity disorder. Appl. Neuropsychol.12(2):64–76, 2005.

5. Barry, R. J., Johnstone, S. J., and Clarke, A. R., A review ofelectrophysiology in attentiondeficit/hyperactivity disorder:II. Event-related potentials. Clin. Neurophysiol. 114(2):184–198, 2003.

6. Barkley, R. A., Attention deficit hyperactivity disorder: Ahandbook for diagnosis and treatment, 2nd edn. GuilfordPress, 1999.

7. Elder, T. E., The importance of relative standards in ADHDdiagnoses: Evidence based on exact birth dates J. HealthEcon. 29:642–656, 2010.

8. Barry, R., Kirkaikul, S., and Hodder, D., EEG alpha activityand the ERP to target stimuli in an auditory oddball para-digm. Int. J. Psychophysiol. 39:39–50, 2000.

9. Chabot, R. J., and Serfontein, G., Quantitative electroen-cephalographic profiles of children with attention deficit dis-order. Biol. Psychiatry 40:951–963, 1996.

10. Clarke, A. R., Barry, R. J., McCarthy, R., and Selikowitz, M.,EEG analysis in attention-deficit/hyperactivity disorder: Acomparative study of two subtypes. Psychiatry Res., 81:19–29,1998.

11. Clarke, A. R., Barry, R. J. McCarthy, R., and Selikowitz,M., EEG-defined subtypes of children with attention-deficit/hyperactivity disorder. Clin. Neurophysiol., 112:2098–2105, 2001.

12. Lazzaro, I., Gordon, E., Whitmont, S., Plahn,M., Li, W.,Clarke, S., et al., Quantified EEG activity in adolescent atten-tion deficit hyperactivity disorder. Clin. Electroencephalogr.29:37–42, 1998.

13. Mann, C. A., Lubar, J. F., Zimmerman, A. W., Miller, C. A.,and Muenchen, R. A., Quantitative analysis of EEG inboys with attention-deficit-hyperactivity disorder: Controlledstudy with clinical implications. Pediatr. Neurol., 8(1):30–36,1992.

14. Mueller, A., Candrian, G., Kroptov, J. D., Ponomarev, V.,and Baschera, G. M., Classification of ADHD patients usinga machine learning system. Nonlinear Biomed. Phys., 4(1):1–12, 2010.

15. Anduradha, J., Tisha, Ramachandran, V., Arulalan, K. V.,and Tripathy, B. K., Diagnosis of ADHD using SVM algo-rithm. In: COMPUTE2010, ACM Proceedings. Bangalore,India, January 22–23, 2010.

16. Ahmadlou, M., and Adeli, H., Wavelet-synchronizationmethodology: A new approach for EEG-baseddiagnosis of ADHD. Clin. EEG Nurosci. 41(1):1–10,2010.

17. American Psychiatric Association, American Psychiatric As-sociation: Diagnostic and statistical manual of mental disor-ders DSM-IV., 4th edn., 1994.

18. Oh, K., Hong, K. M., Lee, H., and Ha, E., K-CBCL.Seoul, Korea: Chung Ang Aptitude Publishing Co.,1997.

19. Anastopoulos, A. D., Spisto, M., and Maher, M. C., TheWISC-III freedom from distractibility factor: Its utiliity inidentifying children with attention deficit hyperactivity dis-order. Psychol. Assess. 6:368–371, 1994.

20. Cover, T. M., and Thomas, J. A., Elements of informationtheory. Wiley: Chichester, 2006.

21. Cristianini, N., and Shawe-Taylor, J., Support vector ma-chines and other kernel based learning methods. Cambridge:Cambridge University Press, 2000.

22. Cortes, C., and Vapnik, V. N., Support-vector networks.Mach. Learn. 20:273–297, 1995.

Decision Support Algorithm for Diagnosis of ADHD Using ...nin/Courses/Seminar14a/EEG_ADHD.pdf · of...

Documents

Transcript of Decision Support Algorithm for Diagnosis of ADHD Using ...nin/Courses/Seminar14a/EEG_ADHD.pdf · of...