Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department...

1
Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department of Structural and Functional Biology, University of Insubria, Varese, Italy. E-mail: [email protected] Web: http://fisio.varbio2.unimi.it/dbsf/home.html INTRODUCTION Phenols are chemicals widespread in the environment and widely used as precursors for many products. It is well known that phenols exert effects on human health at concentrations commonly encountered in the environment. For this reason, the toxicity of these compounds has been extensively studied on different end points, but obviously data are not available for all phenols and organisms. Thus, reliable estimation methods are required. QSAR studies are useful for a simple and fast prediction of such data DATA SET The compounds used in this work are the 109 phenols described by Schultz [2] . Toxicity data, available only for 103 chemicals, are expressed in mM/l and in logarithmic scale as log of the inverse of the IGC 50 (percent inhibitory growth concentration) on Tetrahymena pyriformis strain. Three phenols (2-aminophenol, cathecol and 4-nitrophenol) that have been shown as outliers by several models, have been excluded from the data set. [2] T.W.Schultz et all. Quantitative structure-activity relationships for the Tetrahymena piryformis population growth end-point: a mechanism of action approach. Practical Applications of Quantitative Structure-Activity Relationships (QSAR) in Environmental chemistry and toxicology, 241-262 (1990). CHEMOMETRIC METHODS Several chemometric analyses were applied to the compounds (represented by molecular descriptors) for the selection of an optimal training set for the QSAR models. The analyses performed are: Principal Component Analysis (PCA): Principal Component Analysis (PCA): this analysis was used to calculate just a few components from a large number of variables. These components allow the highlighting of the distribution of the compounds according to their structure; only the significant components were used in Cluster Analysis and Kohonen Maps to avoid the redundancy of the information. Hierarchical Cluster Analysis: Hierarchical Cluster Analysis: hierarchical clustering was performed using the significant components of the molecular descriptors as variables. Different distance metrics (Euclidean and Manhattan) and different linkages (Complete, average, etc.) were used and compared to find the best way to cluster these compounds. Kohonen Maps: Kohonen Maps: this is an additional way that allows the mapping of similar compounds by using the so-called “self-organised topological feature maps”, which are maps that preserve the topology of a multidimensional representation within the new two-dimensional representation. The position of the compounds in the cells of this map shows the similarity level of the structure of the studied phenols. The centroids of each cell have been selected as the most representative compounds in order to create a training set constituted of the more different phenols. CONCLUSION The present investigation confirms that the toxic response of phenols in the Tetrahymena system can be modelled by a logKow- dependent QSAR. The models developed starting from a wide set of various molecular descriptors identify the hydrophobicity as the single most important variable, as the logKow alone gives a good enough prediction model with a Q2(LOO)= 72.14; other structural parameters, such as electronic and connectivity ones play a role of secondary but useful relevance, at least for this set of compounds. Moreover this study demonstrates that theoretical molecular descriptors are an effective and useful alternative of LogKow. The internal and external validation procedures have confirmed the high predictive capability of the models developed. Phenols toxicity distribution N. of obsrvations 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 <= -1 (-1;-.5] (-.5;0] (0;.5] (.5;1] (1;1.5] (1.5;2] > 2 Training Test LogIGC 50 -1 = -1.66+0.34nX-0.74nHD+0.65CHI1v+2.07Gs experimental response calculated response 31 65 67 104 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 training set = 56 comp. test set Log IGC 50 -1 = -1.06+0.2nX-1.46ACMO2m+0.58logKOW experimental response calculated response 17 58 65 66 67 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 -1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 training set = 56 comp. test set O BJN . V A R N . V A R IA BLES Q 2 LOO Q 2 LMO Q 2 ext R 2 SD EP SD EC 100 4 nX nH D CH I1v ACM O1p 82.8 82.6 84.7 0.31 0.29 100 4 nO H M or04u ACM B2v LogK ow 83.0 82.7 85.2 0.31 0.29 100 1 LogK ow 70.8 71.1 72.3 0.41 0.39 56 4 nX nH D CH I1v Gs 84.4 84.1 83.8 88.0 0.32 0.28 56 3 nX ACM O2m LogK ow 82.7 82.6 82.2 84.8 0.34 0.32 56 1 LogK ow 72.1 71.6 78.3 74.4 0.43 0.41 REGRESSION MODELS The selection of the best subset variables for modelling toxicity was done by a Genetic Algorithm (GA-VSS) approach, where the response is obtained by ordinary least square regression (OLS). All the calculations have been performed by using the leave- one-out (LOO) and leave-more-out (LMO) procedures and the scrambling of the responses for the validation of the models. QSAR MODELLING AND PREDICTION OF PHENOL TOXICITY QSAR MODELLING AND PREDICTION OF PHENOL TOXICITY MOLECULAR DESCRIPTORS The molecular structures of the studied compounds have been described by using several molecular descriptors, calculated by a software developed by R.Todeschini ([email protected]; http://www.disat.unimib.it/chm) Sum of atomic properties descriptors (6) Count descriptors (45) Empirical descriptors (2) Information indices (16) [1 ]R.Todeschini and P.Gramatica, 3D-modelling and prediction by WHIM descriptors. Part 5. Theory development and chemical meaning of the WHIM descriptors, Quant.Struct.-Act.Relat., 16 (1997) 113-119. Autocorrelation descriptors (252) Directional WHIM descriptors (66) [1] Non directional WHIM descriptors (33) [1] Topological descriptors (58) Topographic descriptors (7) Geometric descriptors (170) Quanto-chemicals descriptors (6) test set training set Selection of training set THE NUMBERED COMPOUNDS ARE OUTLIERS

Transcript of Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department...

Page 1: Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department of Structural and Functional Biology, University of.

Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department of Structural and Functional Biology, University of Insubria, Varese, Italy.

E-mail: [email protected] Web: http://fisio.varbio2.unimi.it/dbsf/home.html

INTRODUCTION

Phenols are chemicals widespread in the environment and widely used as precursors for many products. It is well known that phenols exert effects on human health at concentrations commonly encountered in the environment. For this reason, the toxicity of these compounds has been extensively studied on different end points, but obviously data are not available for all phenols and organisms. Thus, reliable estimation methods are required. QSAR studies are useful for a simple and fast prediction of such data

DATA SET

The compounds used in this work are the 109 phenols described by Schultz [2] . Toxicity data, available only for 103 chemicals, are expressed in mM/l and in logarithmic scale as log of the inverse of the IGC50 (percent inhibitory growth concentration) on Tetrahymena pyriformis strain.Three phenols (2-aminophenol, cathecol and 4-nitrophenol) that have been shown as outliers by several models, have been excluded from the data set.

[2] T.W.Schultz et all. Quantitative structure-activity relationships for the Tetrahymena piryformis population growth end-point: a mechanism of action

approach. Practical Applications of Quantitative Structure-Activity Relationships (QSAR) in Environmental chemistry and toxicology, 241-262 (1990).

CHEMOMETRIC METHODS

Several chemometric analyses were applied to the compounds (represented by molecular descriptors) for the selection of an optimal training set for the QSAR models. The analyses performed are:

Principal Component Analysis (PCA): Principal Component Analysis (PCA): this analysis was used to calculate just a few components from a large number of variables. These components allow the highlighting of the distribution of the compounds according to their structure; only the significant components were used in Cluster Analysis and Kohonen Maps to avoid the redundancy of the information.

Hierarchical Cluster Analysis: Hierarchical Cluster Analysis: hierarchical clustering was performed using the significant components of the molecular descriptors as variables. Different distance metrics (Euclidean and Manhattan) and different linkages (Complete, average, etc.) were used and compared to find the best way to cluster these compounds.

Kohonen Maps: Kohonen Maps: this is an additional way that allows the mapping of similar compounds by using the so-called “self-organised topological feature maps”, which are maps that preserve the topology of a multidimensional representation within the new two-dimensional representation. The position of the compounds in the cells of this map shows the similarity level of the structure of the studied phenols. The centroids of each cell have been selected as the most representative compounds in order to create a training set constituted of the more different phenols.

CONCLUSION

The present investigation confirms that the toxic response of phenols in the Tetrahymena system can be modelled by a logKow- dependent QSAR. The models developed starting from a wide set of various molecular descriptors identify the hydrophobicity as the single most important variable, as the logKow alone gives a good enough prediction model with a Q2(LOO)= 72.14; other structural parameters, such as electronic and connectivity ones play a role of secondary but useful relevance, at least for this set of compounds.

Moreover this study demonstrates that theoretical molecular descriptors are an effective and useful alternative of LogKow.

The internal and external validation procedures have confirmed the high predictive capability of the models developed.

Phenols toxicity distribution

N. o

f obs

rvat

ions

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

<= -1 (-1;-.5] (-.5;0] (0;.5] (.5;1] (1;1.5] (1.5;2] > 2

Training

Test

LogIGC 50-1 = -1.66+0.34nX-0.74nHD+0.65CHI1v+2.07Gs

experimental response

ca

lcu

late

d r

esp

on

se

31

6567

104

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

3.0

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

training set = 56 comp.test set

Log IGC 50-1 = -1.06+0.2nX-1.46ACMO2m+0.58logKOW

experimental response

calc

ula

ted r

esp

onse

17

5865

66

67

-1.5

-1.0

-0.5

0.0

0.5

1.0

1.5

2.0

2.5

-1.5 -1.0 -0.5 0.0 0.5 1.0 1.5 2.0 2.5

training set = 56 comp.test set

OBJ N. VAR N. VARIABLES Q 2LOO Q 2

LMO Q 2ext R 2 SDEP SDEC

100 4 nX nHD CHI1v ACMO1p 82.8 82.6 84.7 0.31 0.29100 4 nOH Mor04u ACMB2v LogKow 83.0 82.7 85.2 0.31 0.29100 1 LogKow 70.8 71.1 72.3 0.41 0.3956 4 nX nHD CHI1v Gs 84.4 84.1 83.8 88.0 0.32 0.2856 3 nX ACMO2m LogKow 82.7 82.6 82.2 84.8 0.34 0.3256 1 LogKow 72.1 71.6 78.3 74.4 0.43 0.41

REGRESSION MODELS

The selection of the best subset variables for modelling toxicity was done by a Genetic Algorithm (GA-VSS) approach, where the response is obtained by ordinary least square regression (OLS).All the calculations have been performed by using the leave-one-out (LOO) and leave-more-out (LMO) procedures and the scrambling of the responses for the validation of the models.

QSAR MODELLING AND PREDICTION OF PHENOL TOXICITYQSAR MODELLING AND PREDICTION OF PHENOL TOXICITY

MOLECULAR DESCRIPTORS

The molecular structures of the studied compounds have been described by using several molecular descriptors, calculated by a software developed by R.Todeschini ([email protected]; http://www.disat.unimib.it/chm)

Sum of atomic properties descriptors (6)

Count descriptors (45)

Empirical descriptors (2)

Information indices (16)

[1 ]R.Todeschini and P.Gramatica, 3D-modelling and prediction by WHIM descriptors. Part 5. Theory development and chemical meaning of the

WHIM descriptors, Quant.Struct.-Act.Relat., 16 (1997) 113-119.

Autocorrelation descriptors (252)

Directional WHIM descriptors (66) [1]

Non directional WHIM descriptors (33) [1]

Topological descriptors (58)

Topographic descriptors (7)

Geometric descriptors (170)

Quanto-chemicals descriptors (6)

test set

training set

Selection of training set

THE NUMBERED COMPOUNDS

ARE OUTLIERS