Significance testing of single class discrimination models

8
Chemometrics and intelligent laboratory systems Chemometrics and Intelligent Laboratory Systems 23 (1994) 205-212 Significance testing of single class discrimination models John Wood *, Valerie S. Rose Wellcome Research Laboratories, South Eden Park Road, Beckenham, Kent BR3 3BS, UK Halliday J.H. MacFie AFRC Institute of Food Research, Earley Gate, Whiteknights Road, Reading RG6 ZEF, UK (Received 9 August 1993; accepted 22 November 1993) Abstract Single class discrimination @CD) has recently been described for the analysis of multivariate embedded data. It is a method for determining informative axes in the data space which promote clustering of the embedded, or principal, class about the model origin and dispersal of the non-embedded class. Significance testing of the eigenvalues obtained in a model has been carried out by randomizing the class membership vector and recalculating the SCD model 500 times. These random simulations enable the determination of the permutation distribution under the null hypothesis of no association, and hence can be used to determine the significance of the first eigenvalue. A method is described to estimate the permutation distribution of the second and subsequent eigenvalues conditional on the fact that the previous eigenvectors in the SCD model have been accepted as significant. 1. Introduction Recently [1,2], we described a set of super- vised, multivariate statistical analysis methods for modelling embedded data, generically termed single class discrimination @CD). Embedded data occur where members of a class of interest are clustered within a more diverse class. SCD was originally developed for the discrimination of two classes, where members of the embedded class are similar to each other with respect to certain * Corresponding author. characteristics, whilst members of the non-em- bedded class lack this characteristic pattern of similarity for one of a variety of reasons. Subse- quently, the methods have been generalized to accommodate a continuous measure of class membership, enabling a gradual transition in class membership to be incorporated into the algo- rithm. In general, current discrimination techniques do not deal well with an embedded situation, often being based on differences between class means with a common variance matrix assumed. SCD, in contrast, works from a common mean and bases discrimination on ratios of variances. 0169-7439/94/$07.00 0 1994 Elsevier Science B.V. All rights reserved SSDI 0169-7439(93)E0074-E

Transcript of Significance testing of single class discrimination models

Page 1: Significance testing of single class discrimination models

Chemometrics and intelligent laboratory systems

Chemometrics and Intelligent Laboratory Systems 23 (1994) 205-212

Significance testing of single class discrimination models

John Wood *, Valerie S. Rose

Wellcome Research Laboratories, South Eden Park Road, Beckenham, Kent BR3 3BS, UK

Halliday J.H. MacFie AFRC Institute of Food Research, Earley Gate, Whiteknights Road, Reading RG6 ZEF, UK

(Received 9 August 1993; accepted 22 November 1993)

Abstract

Single class discrimination @CD) has recently been described for the analysis of multivariate embedded data. It is a method for determining informative axes in the data space which promote clustering of the embedded, or principal, class about the model origin and dispersal of the non-embedded class. Significance testing of the eigenvalues obtained in a model has been carried out by randomizing the class membership vector and recalculating the SCD model 500 times. These random simulations enable the determination of the permutation distribution under the null hypothesis of no association, and hence can be used to determine the significance of the first eigenvalue. A method is described to estimate the permutation distribution of the second and subsequent eigenvalues conditional on the fact that the previous eigenvectors in the SCD model have been accepted as significant.

1. Introduction

Recently [1,2], we described a set of super- vised, multivariate statistical analysis methods for modelling embedded data, generically termed single class discrimination @CD). Embedded data occur where members of a class of interest are clustered within a more diverse class. SCD was originally developed for the discrimination of two classes, where members of the embedded class are similar to each other with respect to certain

* Corresponding author.

characteristics, whilst members of the non-em- bedded class lack this characteristic pattern of similarity for one of a variety of reasons. Subse- quently, the methods have been generalized to accommodate a continuous measure of class membership, enabling a gradual transition in class membership to be incorporated into the algo- rithm.

In general, current discrimination techniques do not deal well with an embedded situation, often being based on differences between class means with a common variance matrix assumed. SCD, in contrast, works from a common mean and bases discrimination on ratios of variances.

0169-7439/94/$07.00 0 1994 Elsevier Science B.V. All rights reserved SSDI 0169-7439(93)E0074-E

Page 2: Significance testing of single class discrimination models

206 J. Wood et al. /Chemometrics and Intelligent Laboratory Systems 23 (1994) 205-212

Formally speaking, the aim of SCD is to effi- ciently represent a given ‘principal’ class of units within its complement, where class membership is given on a sliding scale from 0 to 1, and many variates (some with little relevance to class mem- bership) are measured on each unit. The method was developed for the analysis of quantitative structure-activity relationship (QSAR) data, where the units are compounds, the variates physicochemical properties and the principal class is usually the biologically active class. Embedded activity data arise in QSAR analysis, for example, when activity is dependent on compounds lying within a limited range of values for certain dis- criminatory properties, such as size and pK,, whilst compounds outside this range are inactive.

In SCD, new axes are determined in descriptor space as linear combinations of the original prop- erties which maximise the focussing of actives about the model origin with consequent dispersal of the inactives. This is achieved by first centring the data to the actives mean and then determin- ing axes which maximise the ratio of inactives to actives variance. Several closely related ap- proaches were suggested to fulfill this criterium, including three methods based on principal com- ponent analysis (PCA) algorithms, namely SCD- PCA I, II and III, and one based on the canonical variate analysis (CVA) algorithm, SCD-CVA. As SCD-PCA methods II and III give identical eigenvectors to SCD-CVA for well-conditioned problems (where there are many more samples than properties), they are not considered here as they are more complex to program and automate than the SCD-CVA approach.

First attempts [l] at determining the signifi- cance of a component were based on cross-vali- dation, The Euclidean distance of a sample from the model origin, termed the model vector length (MVL), can be calculated to give a measure of class membership; active compounds being closer to the origin and hence having a smaller MVL. A list of sample MVLs sorted in ascending order should place the active compounds at the top of the list and inactives at the bottom. This simple relationship can be tested by cross-validation to obtain a predicted class membership for each compound in a model generated in its absence.

This can be used to provide a measure of the predictive power of the model and help select an appropriate number of components (eigenvec- tors) to include. In practice, we have observed that the MVL may not be wholly suitable as an indicator of activity in data with SARs containing both linear and embedded relationships.

Since cross-validation can only be used to as- sess the performance of a predictive model, there is value in having other methods available to validate the model. These can then be used either where there is doubt about the best way to pre- dict activity, or indeed where it is preferred to construct a descriptive, rather than predictive, model. This is a perfectly sensible aim: poten- tially informative axes are identified by the model and initial interest may centre round trying to attach scientific meaning to these axes rather than making formal predictions of activity.

One such alternative to cross-validation is a significance test of a SCD component based on the size of its associated eigenvalue, which di- rectly measures the ‘discriminatory power’ of the axis. To carry this out, the distribution of the eigenvalue under the null hypothesis of no associ- ation is required. An approach to this through the permutation distribution of the eigenvalues is described here, based on repeatedly randomizing the class membership (or activity) vector and re- calculating the SCD model a large number of times. This leads directly to a significance test for the first eigenvalue and is a similar approach to that described for significance testing in multiple linear regression and variable selection by Klop- man and Kalos [3] and discussed for significance testing in PLS models by Wakeling and Morris [4]. However, the permutation distribution of the second and subsequent eigenvalues cannot be used directly for significance tests of the second and subsequent components, as the null hypothe- sis has already been rejected if the first compo- nent is accepted. What is required is their distri- bution conditional on having accepted the com- ponents that precede them in the full SCD model.

The algorithms and significance testing proce- dures are described below with reference to the SCD-PCA I and SCD-CVA methods using classi- fied activity data.

Page 3: Significance testing of single class discrimination models

J. Wood et al. /Chemometrics and Intelligent Laboratory Systems 23 (1994) 205-212 207

2. The QSAR data set

A QSAR data set was constructed by using computational chemistry methods to graphically model 130 variously substituted phenols in the uncharged state using CONCORD [.5]. The graphical molecular structures were then energy minimized and orientated with the phenyl-hy- droxyl bond along the X axis and the phenyl ring in the X-Y plane using SYBYL [51. Ten compu- tational chemistry properties were calculated us- ing PROFILES [6]. They were: partial atom charge on the phenolic oxygen, partial atom charge on the phenolic hydrogen, dipole moment, the X component of the dipole vector, the abso- lute value of the Y component of the dipole vector, the energy of the highest occupied molec- ular orbital (HOMO), molar refraction (CMR), log P (CLOGP), molecular weight, and planarity, expressed as molecular dimension in the 2 axis.

The electronic properties were calculated us- ing MNDO in MOPAC [7]. CLOGP and CMR were calculated using MEDCHEM software [8]. This is a reasonably typical QSAR descriptor set, consisting of electronic, steric and distributive properties. It contains collinearity and non-ideal data distributions, thus was deemed more appro-

x x

x

100 200 300 400

Molecular welgnt

Fig. 1. Plot of CLOGP against molecular weight showing

location of active compounds. (X ) Inactives; (0) actives.

priate than using Monte Carlo data for the simu- lations.

A simple, known structure to the activity was favoured to demonstrate the method, thus the activity vector was artificially constructed, with compounds classified as active or inactive de- pending on two simple rules. Compounds were defined as ‘active’ if they had a CLOGP in the range 1.57 to 2.60 and a molecular weight less than 180. This resulted in 43 actives and 87 inactives. CLOGP has an embedded relationship with activity, while molecular weight has a non- embedded one. Both types of relationship were included to demonstrate the ability of the method to cope with mixed relationships. A plot of CLOGP against molecular weight is given in Fig. 1 showing the location of the active cluster.

3. Methods

For clarity we have used classified activity data (i.e., active or inactive) to exemplify the approach to significance testing. Whilst the same general principles may be expected to operate for contin- uous activity data, there could be differences of detail for this case induced by the differences in the algorithms. This possibility has not yet been investigated.

The methodology for the classified case is given below with reference to a QSAR problem.

3.1. SCD-PCA I

A matrix X consists of n samples, character- ized by p descriptors. 12, of the samples are active, with the remaining Izi being inactive. Thus X can be sub-divided into two sub-matrices, X, and X,, where X, is of dimension n, xp and contains property data on the active compounds and X, is of dimension ni X p and contains the inactive compound set. The column means and standard deviations are determined for X,. Both X, and X, are centred to the means of X, and divided by the standard deviations of X, to give X,, and X,, respectively. X&X, is thus propor- tional to the correlation matrix of the active com- pounds.

Page 4: Significance testing of single class discrimination models

208 .I. Wood et al. / Chemometrics and Intelligent Laboratory Systems 23 (1994) 205-212

Principal component analysis is then used to determine axes which maximise the variance in X,,, by solving the eigenvalue equation

(XTsXIs-AzI)g,=O for i=l,...,p

Where Ai are the eigenvalues ordered from largest to smallest, gi are the corresponding eigenvectors and I is the identity matrix. When X,, is mapped to this space it clusters about the origin.

3.2. SCD-CVA

As in SCD-PCA I, the matrix X is divided into two sub-matrices X, and X,. The column means of X, are determined and both X, and X, are centred using these means to give X,, and Xi,, respectively. Canonical variate analysis can then be used to determine axes, gi, which maximise Ai, the ratio of the sums of squares of Xi, to X,,:

Ai= gTxTcx*Cgi

g,TxiCx.4Cgi

for i=l,..., p

This is done by solving the eigenvalue equation

(X&Xrc-AfXicXAc)gi=O for i= l,...,p

SCD-PCA I can be regarded as a modification of SCD-CVA with the off-diagonal terms of X:,X,, set to zero and thus the difference between the algorithms is the way they are affected by correla- tion in X.

3.3. Significance testing by random simulations

A simple procedure was used to generate the permutation distribution and hence test the sig- nificance of the first axis in SCD-PCA I and SCD-CVA. The rows of X were randomised to give Z. Z was divided into two sub-matrices Z, and Z,, where Z, contained the top n, rows of Z and Z, contained the bottom ni rows of Z. SCD- PCA I and SCD-CVA were carried out as de- scribed above, but substituting Z, for X, and Z, for Xi. The p eigenvalues produced were recorded. The entire procedure was repeated 500 times, giving samples of size 500 from the permu- tation distribution of each eigenvalue, from the first to the pth. For each sample, the mean and

the 25th largest value was determined, the latter being an estimate of the critical point for a signif- icance test at the 5% level for each eigenvalue.

This simple approach gives a valid test for the first eigenvalue but produces too many false sig- nificances for the second and subsequent eigen- values. This is because the samples were gener- ated assuming the null hypothesis of no associa- tion between activity and descriptors. In practice, one would only test lower eigenvalues if all previ- ous eigenvalues had been accepted as statistically significant. Rather than simply extracting the eigenvalues in order, as described above, it is better to condition on the previous eigenvectors from the ‘real’ SCD model. This is done by orthogonalizing the columns of the matrix Z,, associated with a given permutation, to the (sets of) scores associated with the previous ‘real’ eigenvectors at each stage. This procedure will necessarily result in larger eigenvalues at all posi- tions bar the first than the simple extraction of the eigenvalues in order. This can easily be seen if the latter process is also regarded as a se- quence of orthogonalisations, only this time to the eigenvalues actually associated with Z rather than the set of fixed directions (the eigenvalues associated with X). The whole scheme is sum- marised below:

1.

2. 3.

4.

5. 6.

7.

Calculate the full set of p eigenvectors, gi, of length p, using SCD-PCA I or SCD-CVA analysis on the real data, X. Randomise the rows of X to give Z. Divide Z into the two sub-matrices Z, and

z,. Carry out SCD-PCA I or SCD-CVA, substi- tuting Z, for X,and Z, for Xi. Save the first eigenvalue. Calculate the p vectors of scores, sci of length ni, as

sci=Z,gi for i=l,...,p

Model each column of Z,, independently, by multiple linear regression omitting the inter- cept, using sci as the independent variable, to obtain estimates for the fitted values of Z,, Z I,fitted*

Page 5: Significance testing of single class discrimination models

J. Wood et al. / Chemometrics and Intelligent Laboratory Systems 23 (1994) 205-212 209

8.

9.

10.

11.

12.

13.

14.

Calculate the new value of Z, as the residuals of Z,, where Z, = Z, - ZI,fitted. This step or- thogonalizes the data to the previous eigen- vector of the real analysis. Carry out SCD-PCA I or SCD-CVA using the new Z, and save the first eigenvalue. This gives the second eigenvalue for this simula- tion. Model each column of Z,, independently, by multiple linear regression omitting the inter- cept as in step 7, but use the first and second scores vectors, sci and scz, as the indepen- dent variables, to obtain estimates for the fitted value Z,, Z,,ritted. Calculate the new value of Z, as the residuals of Z,, where Z, = Z, - Z,,ritted. Carry out SCD-PCA I or SCD-CVA using this new Z, and save the first eigenvalue. This gives the third eigenvalue for this simu- lation. Keep cycling through these three steps (10 to 12) including an additional scores vector in the multiple linear regression each cycle. Stop when all eigenvalues have been extracted. Repeat the entire simulation (steps 2 to 13) 500 times, to enable calculation of the mean eigenvalue for each eigenvector and the up- per 95% confidence limit.

4. Results and discussion

Fig. 2 shows the results of carrying out SCD- PCA I on the QSAR data set. Close clustering of the actives about the model origin is apparent, with dispersal of inactives. The similarity to the plot of the hvo informative properties in Fig. 1 is marked when the signs of the eigenvectors are reversed. The property vectors are shown on the biplot [9] and clearly the long vectors associated with CLOGP and molecular weight depict their importance to the model. Projection of the com- pounds onto the CLOGP vector portrays the em- bedded relationship of activity with this property. The molecular weight axis shows clustering of actives towards one end of the vector, which is consistent with the cut-off rule which defined a maximum allowed weight for active compounds.

-6

-15 -12 -9 -6 -3 0 3 6

SCD-PCA I axis 1

Fig. 2. Biplot depicting the results of the SCD-PCA I model, showing clustering of the active compounds about the model origin and the 10 property vectors. Only the CLOGP and molecular weight (MOLWT) vectors are labelled. The short vectors have not been labelled to improve the clarity of the plot. (X 1 Inactive; (0) active; ( -_) property vectors.

In Fig. 3, the clustering of active compounds achieved by the SCD-CVA algorithm is shown. A biplot representation showing the contribution of

5 x

t

x

x-

[ x I

-4 0 4 8 12 16 20

SCO-CVA axis 1

Fig. 3. Results of the SCD-CVA model showing clustering of the active compounds about the origin. (x) Inactive; (0)

active.

Page 6: Significance testing of single class discrimination models

210 J. Wood et al. / Chemometrics and Intelligent Laboratory Systems 23 (1994) 205-212

individual properties to the axes has not been used for this method as the axes are not orthogo- nal with respect to the underlying properties. Interpretation of the model with respect to the physicochemical properties is practically best achieved by fitting each of the original properties (the dependent variables) to the scores vectors (the independent variables) using multiple linear regression omitting the intercept to determine the percentage variance explained by the inclu- sion of subsequent axes. The plot showing the variance explained for the individual properties with increasing components is shown in Fig. 4. The first two axes explain 94% and 93% of CLOGP and molecular weight respectively, and they are clearly the most relevant variables to the model.

Both SCD algorithms have therefore identified the important discriminatory properties and give a good visual representation of clustering in the scores plots, which closely resemble the variable- variable plot in Fig. 1.

loo.o~~

A

2 4 6 8 10

SCD-CVA axis

Fig. 5. Plot showing eigenvalue against axis number for the

real data and the results for the mean and upper 95%

confidence limit for the simple randomised simulations and

the re-orthogonalised simulations for SCD-CVA. The eigen-

value axis has been displayed as a log scale to improve clarity

at higher axis numbers. (m) Results on real data; (A) mean

from simulations; (. . ) upper 95% confidence limit; (v)

mean of re-orthogonalised data; (- .- .-) upper 95% confi- dence limit of re-orthogonalised data.

I I

2 4 6 0 10

SCO-CVA ~~15

Fig. 4. Plot showing the cumulative increased variation of the

properties explained by the increasing number of axes in the

SCD-CVA model. ( W) Partial atom charge on the oxygen;

(A ) partial atom charge on the hydrogen; ( v ) X component of the dipole vector; (A ) absolute Y component of the dipole;

(v) dipole moment; (0) energy of the HOMO; (0) CLOGP; (0) CMR; ( +) molecular weight; ( q ) Z dimension.

Fig. 5 shows the result of the eigenvalues gen- erated by the randomized activity simulations on the SCD-CVA model compared with those of the real model. A log scale has been used for the y axis to improve the clarity of the plot. The values from the set of 500 simulations where the data were orthogonalised to the preceding eigenvector generated by the real model (hereinafter referred to as the re-orthogonalised set for brevity), show quite a flat profile. This is in contrast to those obtained by simple extraction, which would clearly lead to inappropriate significance tests as they suggest six axes are significant. Even with re-or- thogonalisation, the indication is that three axes are significant, which might appear strange since only two properties were used to define activity. A possible explanation is the existence of non-lin- ear associations between the ‘important’ proper- ties and the others. SCD suggests linear combina- tions of properties that may be informative and does not guarantee to produce the most parsimo- nious model possible. Thus ‘statistical signifi-

Page 7: Significance testing of single class discrimination models

.I. Wood et al. / Chemometrics and Intelligent Laboratory Systems 23 (1994) 205-212 211

cance’ is (as is generally the case) not the only consideration in choosing the number of compo- nents to include in the SCD model.

The flatness of the curves for the re-orthog- onalised set suggests that the estimated critical value for the permutation test of the first eigen- value can be used as a simple, conservative ap- proximation for testing the lower eigenvalues if desired.

Interestingly, the situation is different for the SCD-PCA I algorithm. This is because the dis- criminatory axes determined by this method are also influenced by correlation between the de- scriptors. Since the eigenvalues determined in the simple manner during the permutation runs re- flect correlation based on random samples of the whole data matrix, X, they are likely to be closer to those determined in the unpermuted ‘real’ analysis than their counterparts in SCD-CVA. Thus the effect of the re-orthogonalisation will be less here. In Fig. 6, the eigenvalues for the real

r 1000

z A

T : 0)

100 _ w

10

2 4 6 8 10

SCD-PCA I axis

Fig. 6. Plot showing eigenvalue against axis number for the

real data and the results for the mean and upper 95%

confidence limit for the simple randomised simulations and

the re-orthogonalised simulations for SCD-PCA I. The eigen-

value axis has been displayed as a log scale to improve clarity

at higher axis numbers. (H) Results on real data; ( * ) mean

from simulations; ( ...) upper 95% confidence limit; (v)

mean of re-orthogonalised data; (-. -.-) upper 95% confi- dence limit of re-orthogonalised data.

2 4 6 8 10

SCD-PCA I axis

Fig. 7. Plot showing the similarity of curve shape for eigen-

value plotted against axis number for the upper 95% confi-

dence limit in the re-orthogonalised simulation set and the

PCA on the correlation matrix of X. Note: the eigenvalue axis

is on a linear scale in this plot. (-.- .-) Upper 95% confi-

dence limit of re-orthogonalised data; ( . ) PCA on correla-

tion matrix of X.

model and the simulations are compared. Again, a log scale has been used for the y axis for clarity. It can be seen that the eigenvalues extracted simply during the permutations are indeed closer to the re-orthogonalised ones than was the case in Fig. 5, and that the simple approximation suggested for SCD-CVA, i.e., using the permuta- tion distribution of the first eigenvalue for testing all components, is inappropriate.

To confirm that this is a consequence of corre- lation between the descriptors, we carried out a PCA on the entire autoscaled descriptor matrix (actives and inactives combined) including the re-orthogonalisation steps as applied to the ran- domized SCD simulations. Fig. 7 compares these PCA eigenvalues with those obtained from the upper 95% confidence limits of the SCD-PCA I re-orthogonalised simulation, after having scaled the former so the first eigenvalues matched. The correspondence is excellent, supporting the hy- pothesis that the observed pattern of the re-or- thogonalised simulations is a result of descriptor covariance.

Page 8: Significance testing of single class discrimination models

212 J. Wood et al. / Chemometrics and Intelligent Laboratory Systems 23 (1994) 205-212

5.

1.

2.

3.

4.

5.

6.

7.

Conclusions References

Both SCD-PCA I and SCD-CVA produced good two-dimensional plots showing clustering of the active compounds. A biplot aided the identification of important discriminatory properties for the SCD-PCA I model.

111 VS. Rose, J. Wood and H.J.H. MacFie, Single class

discrimination using principal component analysis (SCD-

PCA), Quantitative Structure - Activity Relationships, 10

(1991) 359-368.

Multiple linear regression analysis could be used to identify important properties in the SCD-CVA model.

121 VS. Rose, J. Wood and H.J.H. MacFie, Generalized

single class discrimination (GSCD). A new method for the

analysis of embedded structure-activity relationships,

Quantitative Structure- Activity Relationships, 11 (1992)

492-504.

Randomising the activity vector and running simulations is proposed as a suitable method for estimating the permutation distribution and hence significance of the first SCD axis. By re-orthogonalising to the previous SCD model axis during the randomised simulations, it is possible to estimate the permutation dis- tribution of the second and subsequent eigen- values, conditional on the previous dimensions extracted by the ‘real’ model. A simple approximation for testing the second and lower axes has been proposed for the SCD-CVA algorithm, i.e., just use the critical value obtained for the first eigenvalue. Correlation between the descriptors limits the use of a similar approximation for SCD-PCA I.

131 G. Klopman and A.N. Kales, Causality in structure-activ-

ity studies, Journal of Computational Chemistry, 6 (1985) 492-506.

141 I.N. Wakeling and J.J. Morris, A test of significance for

partial least squares regression, Journal of Chemometrics, 7 (1993) 291-304.

iSI

161

Tripos Associates Inc., 1699 South Hanley Road, Suite

303, St. Louis, MO 63144, USA.

R.C. Glen and V.S. Rose, Computer program suite for the

calculation, storage and manipulation of molecular prop-

erty and activity descriptors, Journal of Molecular Graph- ics, 5 (1987) 79-86.

[71

181

[91

QCPE, University of Indiana, Bloomington, IN 47405,

USA.

Daylight Chemical Information Systems, 3951, Claremont

Street, Irvine, CA 92714, USA.

K.R. Gabriel, The biplot display of matrices with applica-

tion to principal component analysis, Biometrika, 58 (1971) 453-467.