Solvents

download Solvents

of 11

Transcript of Solvents

  • TRAC 2552 25-6-99

    Classication of organic solvents andmodelling of their physico-chemicalproperties by chemometric methods usingdifferent sets of molecular descriptorsP. Gramatica*QSAR Research Unit, Department of Structural and Functional Biology, University of Insubria,Via Dunant 3, I-21100 Varese, Italy

    N. NavasDepartment of Analytical Chemistry, University of Granada, E-18071 Granada, Spain

    R. TodeschiniMilan Chemometric Research Group, Department of Environmental Sciences,University of Milano-Bicocca, Via Emanueli 15, I-20126 Milan, Italy

    Different sets of molecular descriptors usingthe k-nearest neighbor classication methodwere used to make a general classication of152 organic solvents, the classication beingfurther improved by performing the counter-propagation articial neural network. Anextensive investigation was made of thephysico-chemical properties of 152 solventsin a search for quantitative structure^prop-erty relationships (QSPR). Wide sets ofmolecular descriptors were tested andregression models were obtained by select-ing the best descriptor subset by genetic algo-rithm in order to optimize their predictionpower. z1999 Elsevier Science B.V. Allrights reserved.

    Keywords: Organic solvents; Molecular descriptors;Classication; K-nearest neighbor; Counter-propagationarticial neural network; Genetic algorithm-variable subsetselection

    1. Introduction

    The choice of a good solvent is important in chem-istry because of the solvent's crucial role in many

    chemical processes. The complexity of solute^solventinteractions and the huge number of solvents availableoften make this selection difcult. Usually, it is thephysico-chemical properties, such as melting and boil-ing points, vapor pressure, heat of vaporization, indexof refraction, density, viscosity, surface tension,dipole moment, dielectric constant, polarizability,specic conductivity, etc., that dictate the choice.Physico-chemical properties can be used to character-ize solvents and classication is commonly madeaccording to these properties. Unfortunately manysuch classications take only a few of these propertiesinto account, resulting in very poor classication con-sidering the large number of properties that can beused to characterize a solvent. With chemometrictools it is possible to consider many of these propertiessimultaneously.

    In recent years multivariate statistical methods havebeen applied to classify and select organic solvents.Martin et al. [ 1 ] rst applied factor analysis to solventclassication, and found a classication similar to thatof Parker [ 2 ], which was mainly based on chemicalintuition. Carlson et al. [ 3 ], using a two-principal-components model, proposed different strategies fora systematic solvent selection for chemical reaction,and Snyder [ 4 ] proposed one of the most popular andwidely used solvent classications, well known in thechromatographic eld. More recently, Poole andPoole [ 5 ] proposed a chemometric classication ofthe solvent properties of commonly used gas chroma-tography stationary phases, Molinero et al. [ 6 ] clas-

    0165-9936/99/$ ^ see front matter 1999 Elsevier Science B.V. All rights reserved.PII: S 0 1 6 5 - 9 9 3 6 ( 9 9 ) 0 0 1 1 5 - 6

    *Corresponding author. Tel.: +39 (332) 421 573;Fax: +39 (332) 421 554.E-mail: [email protected]

    trends in analytical chemistry, vol. 18, no. 7, 1999 461

  • TRAC 2552 25-6-99

    sied the organic solvents into different groups (char-acterized according to tolerance in plasma operation)and applied multivariate regression to obtain predic-tive equations of the limiting aspiration rate in terms ofthe main physical variables. A solvent classicationbased on solvatochromic parameters was recently pro-posed by de Juan et al. [ 7 ] who made a comparisonwith the Snyder approach [ 4 ]. However, the mostambitious approach to a general classication of sol-vents using chemometric tools is that proposed byChastrette et al. [ 8 ]. Their classication is based onthe representation of 83 solvents as points in an eight-dimensional property space, using the Kirkwood func-tion, molar refraction, Hildebrand's N parameter,refraction index, boiling point, dipole moment andthe energies HOMO and LUMO as solvent descrip-tors. By using principal component analysis the orig-inal eight-dimensional space is reduced to a three-dimensional space dened by the rst three principalcomponents; the 83 organic solvents are grouped intonine classes by clustering principal component values,using a non-hierarchical multivariate taxonomy toprogressively classify solvents by means of the dis-criminating power of the eight descriptors.

    Although this partition is the best general classi-cation at the moment, it has certain evident limitations[ 9 ], e.g., some solvents are unsoundly clustered, andthe absence of an apolar, aprotic and non-aromaticclass results in some alkanes being included in thewrong groups. On the other hand, the unsupervisedclassication proposed by Chastrette et al. [ 8 ] is notobviously a classication model; it is a deep and exten-sive study of the space of the properties of the 83solvents studied using a clustering technique.

    The goal of the present article is to propose a generalclassication model for organic solvents that improveson the approach proposed by Chastrette et al. [ 8 ]; ourmodel is based on theoretical molecular descriptorsinstead of physico-chemical experimental properties.Using a modied and corrected Chastrette trainingset, the classication performance of the various meth-ods was checked with a wide set of molecular descrip-tors: structural (1D), topological (2D) and empirical.The general classication proposed here was itera-tively obtained by rst applying k-nearest neighbor(KNN) classication and then improving the resultsby performing counter-propagation articial neuralnetworks (CP-ANN) on a wider data set of 152 sol-vents.

    In addition, in this study on organic solvents regres-sion models were obtained, which are able to predictseveral of their important physico-chemical proper-

    ties. The use of the genetic algorithm allows the selec-tion of the best subset of descriptors from among awide set of molecular descriptors: structural, empiri-cal, topological and weighted holistic invariant molec-ular (3D-WHIM).

    2. Experimental data

    The studied physico-chemical properties of 152organic solvents (Table 1) are taken from references[ 8,10^19 ]. In Table 1 DEGDEE is the acronym fordiethylene glycol diethyl ether, DEGDME for dieth-ylene glycol dimethyl ether (diglyme), DMEU forN,NP-dimethyl ethylene urea, DMPU for N,NP-dimethyl propylene urea, HMPTA for hexamethylphosphoric triamide and TEGDME for triethyleneglycol dimethyl ether ( triglyme).

    3. Methods

    The minimum energy conformations of all 152 sol-vents were obtained by the molecular mechanicsmethod of Allinger (MM+), using the package Hyper-chem [ 20 ].

    The structural, empirical, topological and WHIMdescriptors were calculated using our packageWHIM-3D /QSAR for Windows /PC [ 21].

    The classication models were obtained using thepackage SCAN [ 22 ] for Windows /PC. The bestclassication method was found to be KNN. Thismethod was applied to autoscaled data with a prioriprobability proportional to the size of the classes; thepredictive power of the model was checked for Kvalues between 1 and 10. Cross-validated non-errorrate (NERCV %) was compared to no model errorrate (NOMER %).

    CP-ANN was performed with the software Koala,version 1.0, recently developed by our research group[ 23 ].

    The CP-ANN method is a classication method andconsists of two blocks: the rst is a Kohonen articialneural network (K-ANN) which works on the inputvariables and the second is the output block for clas-sication. The belonging of each object to one amongG classes is represented by a G-dimensional vector of0 values except in correspondence of the true class,represented by a value equal to 1. This procedure iscalled class unfolding.

    K-ANN is constituted by NUNUp weights, wherep is the number of input variables and each p-dimen-

    462 trends in analytical chemistry, vol. 18, no. 7, 1999

  • TRAC 2552 25-6-99

    Table 1Solvents used for multivariate classication

    ID Solvent A B C ID Solvent A B C ID Solvent A B C

    1* acetic acid 4 4 4 52* diethylene glycol 4 4 4 103* nitrobenzene 1 1 12* acetic anhydride 1 1 1 53* dietyl ether 3 3/5 3 104 nitroethane 1 1 13* acetone 1 1 1 54* di-isopropyl ether 5 5 3 105* nitromethane 1 1 14* acetonitrile 1 1 1 55 1,2-dimethoxyethane 3 1/3 3 106* n-octane 5 5 55* acetophenone 2 2 2 56* N,N-dimethylacetamide 1 1 1 107* 1-octanol 4 4 46 acetylacetone 2 1 1 57 N,N-dimethylaniline 1 1 1 108* n-pentane 5 5 57 2-aminoethanol 4 4 4 58 3,3-dimethyl-2-butanone 1 1 1 109* 1-pentanol 4 4 48 aniline 1 1 1 59* N,N-dimethylformamide 1 1 1 110* 2-pentanol 4 4 49* anisole 2 2 2 60 2,4-dimethylpyridine 1 1 1 111* 3-pentanol 4 4 410 benzaldehyde 2 2 2 61 2,6-dimethylpyridine 1 1 1 112 2-pentanone 1 1 111* benzene 2 2 2 62* dimethylsulfoxide 1 1 1 113* 3-pentanone 1 1 112* benzonitrile 1 1 1 63 2,6-dimethyl-4-heptanone 1 1 1 114 pentyl acetate 1 1 113* benzyl alcohol 2 2 2 64 2,4-dimethyl-3-pentanone 1 1 1 115 phenetole 2 2 214* bromobenzene 2 2 2 65* 1,4-dioxane 1 1 3 116 phenol 2 2 215 1-bromobutane 3 3 3 66* diphenyl ether 2 2 2 117 3-picoline 1 1 116 bromoethane 3 1/3 3 67 di-n-propyl ether 5 5 3 118 4-picoline 1 1 117* 1-butanol 4 4 4 68 DMEU 1 1 1 119* piperidine 4 4 418* 2-butanol 4 4 4 69 DMPU 1 1 1 120* 1-propanol 4 4 419* 2-butanone 1 1 1 70* ethanol 4 4 4 121* 2-propanol 4 4 420 n-butyl acetate 1 1 1 71* ethyl acetate 1 1 1 122* n-propylamine 4 4 421 n-butylamine 4 4 4 72* ethyl benzoate 2 2 2 123 propyl formate 1 1 122* butyronitrile 1 1 1 73 ethyl formate 1 1 1 124* propylene carbonate 1 1 123 carbon disulde 2 2/5 5 74 ethyl propionate 1 1 1 125* propionitrile 1 1 124 carbon tetrachloride 5 4 4 75 ethylenediamine 4 4 4 126* pyridine 1 1 125* chlorobenzene 2 2 2 76* ethylene glycol 4 4 4 127 pyrrolidine 4 4 426 1-chlorobutane 3 3 3 77* uorobenzene 2 2 2 128 quinoline 1 1 127 chloroform 4 4/1/3 3 78* formamide 4 4 4 129* styrene 2 2 228 1-chloropropane 3 3/1 3 79 furfuryl alcohol 1 2/4 2 130* sulfolane 1 1 129 2-chloropropane 3 3/1 3 80 glycerol 4 4 4 131* tert.-butyl alcohol 4 4 430 m-cresol 2 2 2 81* n-heptane 5 5 5 132 tert.-butyl methyl ether 3 5/3 331* cyclohexane 5 5 5 82* HMPTA 1 1 1 133 TEGDME 1 1/3 332* cyclohexanol 4 4 4 83* n-hexane 5 5 5 134 1,1,2,2-tetrachloro-

    ethane1 3 1

    33* cyclohexanone 1 1 1 84 1-hexanol 4 4 4 135 tetrachloroethylene 5 4/2 534 cyclohexene 2 5/2 5 85* iodobenzene 2 2 2 136 tetraethylene glycol 4 4 435* cyclopentane 5 5 5 86 iodoethane 3 1/3 3 137* tetrahydrofuran 3 3 336 cyclopentanone 1 1 1 87* isobutyl alcohol 4 4 4 138 1,1,3,3-tetramethyl urea 1 1 137 cis-decaline 5 5 5 88* iso-octane 5 5 5 139* toluene 2 2 238* n-decane 5 5 5 89* mesitylene 2 2 2 140 tributylamine 3 3 339 DEGDEE 3 3/1 3 90* methanol 4 4 4 141 1,1,1-trichloroethane 1 1 140 DEGDME 1 1/3 3 91* methyl acetate 1 1 1 142 trichloroethylene 1 1 141 dibenzyl ether 2 2 2 92 methyl benzoate 2 2 2 143* triethylamine 3 3 342* di-n-butyl ether 5 5/3 3 93* 2-methyl-2-butanol 4 4 4 144 triethylene glycol 4 4 443* m-dichlorobenzene 2 2 2 94* 3-methyl-1-butanol 4 4 4 145* triuoroacetic acid 4 4 444* o-dichlorobenzene 2 2 2 95 3-methyl-2-butanone 1 1 1 146* 2,2,2-triuoroethanol 4 4 445* 1,2-dichloroethane 1 1 1 96 methyl formate 1 1 1 147 trimethylene glycol 4 4 446* 1,1-dichloroethane 1 1 1 97 4-methyl-2-pentanone 1 1 1 148 2,4,6-trimethylpyridine 1 1 147 1,1-dichloroethylene 1 1 1 98* 2-methoxyethanol 4 4 4 149 m-xylene 2 2 248 Z-1,2-dichloroethylene 1 1 1 99 N-methylacetamide 1 1 1 150* o-xylene 2 2 249* dichloromethane 1 1 1 100* N-methylformamide 1 1 1 151* p-xylene 2 2 250* diethylamine 4 4 4 101* N-methyl-pyrrolidin-2-one 1 1 1 152* water 4 4 451 diethyl carbonate 1 1 1 102 morpholine 1 4/1 4

    *Solvent training set for KNN. A: KNN classication. B: CP-ANN-1 classication. C: CP-ANN-2 classication. 1: Class 1, aprotic polar(AP); 2: Class 2, aromatic apolar or lightly polar (AALP); 3: Class 3, electron pair donors (EPD); 4: Class 4, hydrogen bond donors(HBD); 5: Class 5, aliphatic aprotic apolar (AAA).

    trends in analytical chemistry, vol. 18, no. 7, 1999 463

  • TRAC 2552 25-6-99

    sional vector is a neuron. NUN is the number of cellsof a squared map: the values of each input variable aredistributed in the map, i.e. each input variable is rep-resented by a layer.

    Due to class unfolding, the output block consists ofa number of layers equal to the number of classes (G ),where each layer corresponds to a class and the layergeometry is also a NUN squared map. The number ofweights of the output block is NUNUG. All theweights are between 0 and 1.

    During training, the n objects are presented to thenet ^ one at a time ^ a xed number of times (epochs );each object is assigned to the cell for which the dis-tance between the object vector and the neuron is min-imal; the weights of the cell to which an object isassigned and the topologically nearest cells are mod-ied in such a way as to reproduce the object prole[ 24 ].

    Once an object is assigned to a cell in K-ANN, theweights of the corresponding cells of the classicationlayers are modied in such a way that the weight of thecell of the layer corresponding to the true object classis pushed toward 1 and those of the remaining layersare pushed toward 0. Once the training is nished,each object is assigned to a cell of the network andeach cell is assigned to the class having a weight clos-est to 1 among the G weights of the output layers.

    Following a standard procedure, the maximum andminimum correction parameters for the weights werechosen as 0.5 and 0.05, respectively. Moreover, toavoid extreme values before starting the training,all ANN weights were initialized by randomizationin the range 0.2^0.8. After trials to evaluate thestability of the classication results, the number ofiterations during the training was xed at 500 epochsin all cases.

    The selection of the best subset variables (VSSmethod) for modelling the selected properties wasdone using the genetic algorithm (GA-VSS) approach[ 25 ] where the response is obtained by ordinary leastsquare regression (OLS), using our package MobyDigs for variable selection for Windows /PC [ 26 ].All the calculations were performed using the leave-one-out procedure of cross-validation, maximizing thecross-validated R2 (Q2LOO). To avoid multicolinearitywithout prediction power, the QUIK rule is adopted(see Glossary ). Each model in the nal population wasfurther validated by the leave-more-out (Q2LMO) pro-cedure with 30% perturbation randomly repeated5000 times. For other information regarding chemo-metric terms, see the Glossary.

    4. Molecular descriptors

    For the one-, two- and three-dimensional represen-tation of the solvent molecules, empirical, structural,topological and WHIM descriptors were used.

    Empirical descriptors are descriptors that accountfor a particular aspect of the molecule. Two of thesedescriptors, the unsaturation index (UI ) and the hydro-philicity factor (Hy), recently proposed by ourresearch group, were selected to classify and alsomodel some physico-chemical properties.

    The UI is calculated as information index by theequation:

    UI log21 bb 2C 23H X N P 2OS3SO3=23nR

    C, H, X, N, P, OS, and SO3 are the number of carbon,hydrogen, halogen, nitrogen, phosphorous, sulfur-bonded oxygen atoms and sulfate groups, respec-tively, and nR is the number of rings. It is 0 for satu-rated compounds and increases with the unsaturationlevel.

    Hy is the hydrophilicity index [ 27 ] based on atom /group counting. It is an information index calculatedby the following expression:

    Hy

    1 nHyWlog21 nHy nCW 1=AWlog21=A

    nHy=A2

    plog21 A

    where nHy, nC and A are the number of hydrophilicgroups (-OH, -NH, -SH, etc. ), of carbon atoms and thetotal number of non-hydrogen atoms, respectively.The minimum Hy value tends to31 for large aliphaticcompounds and has maximum values for H2O2 (3.64)and H2O (3.44).

    Structural descriptors are descriptors dened bysimply counting the atoms, functional groups or char-acteristic fragments of the studied structure, i.e. thenumber of different kinds of atoms ( for example,nH, nN, nX are the number of hydrogens, nitrogen,halogens, respectively ), the number of some func-tional groups ( for example, nOH, nSO, nNO are thenumber of OH, SO and NO groups, respectively ).Moreover, the number of atom acceptors and donorsof H bonds (nHA and nHD) are also considered.

    Topological descriptors, from the two-dimensionalrepresentation of the molecular structure, are thesemore widely used in the literature ( information andconnectivity indices ) [ 28 ]. The chemical meaning

    464 trends in analytical chemistry, vol. 18, no. 7, 1999

  • TRAC 2552 25-6-99

    of this kind of descriptor was investigated in a pre-vious paper [ 29 ].

    WHIM descriptors are three-dimensional molecularindices calculated from the (x, y, z ) atomic coordinatesthat represent different sources of chemical informa-tion; they have been widely and successfully appliedin QSAR modelling [ 30^32 ]. A detailed descriptionof their chemical meaning and of the WHIM theory isreported in [ 30 ].

    Thus the solvents are represented by a wide set of atotal number of 174 molecular descriptors: 31 struc-tural, 9 empirical, 35 topological, 66 directionalWHIM and 33 non-directional WHIM.

    5. Results and discussion

    5.1. Building of the training set

    Starting from the Chastrette classication [ 8 ],where 83 solvents were divided into nine classes(Table 2), a new training set was constructed to obtainclasses with a larger number of solvents and a morehomogeneous number of data that permits easier mod-elling and corrects the more evident misclassicationerrors. In our training set, the nine groups proposedinitially by Chastrette [ 8 ] have been reduced to fourclasses and a new class has been added:

    Class 1: The aprotic dipolar (AD), aprotic highlydipolar (AHD) and aprotic highly dipolar and

    polarizable (AHDP) classes have been regroupedinto a single class, named aprotic polar class (AP)(25).

    Class 2: The aromatic apolar (ARA) and thearomatic relatively polar (ARP) classes have beengrouped into a class named aromatic apolar orlightly polar (AALP) class (20).

    Class 3: The electron pair donors (EPD) classremains as in Chastrette's classication (10).

    Class 4: The hydrogen bonding strongly associated(HBSA) class has been joined to the class of thehydrogen bonding donors (HBD) (23).

    Class 5: This is a new class that has been added tothe training set. This class consists of the alkanesselected from the overall data set: n-pentane,cyclopentane, n-heptane, n-octane, iso-octane andn-decane, and is called aliphatic aprotic apolar(AAA).

    The class named MISC (for miscellaneous sol-vents ) in Chastrette's classication, which consistedof only four solvents of high polarizability, was elim-inated from the training set but was used in the testset.

    In addition, the solvents classied wrongly in Chas-trette's classication, cited by Reichardt [ 9 ], havebeen corrected in our training set: tetrahydrofuranhas been shifted from class 4 to class 3; n-octanoland benzyl alcohol from 2 to 4; triuoroacetic acidfrom 1 to 4; cyclohexane and n-hexane from 3 to thenew class 5. Carbon tetrachloride and trichloroethyl-

    Table 2Classes of the training set

    Chastrette classication This work

    Class n Class n

    1: aprotic dipolar (AD) 14 1: aprotic polar (AP) 242: aprotic highly dipolar (AHD) 93: aprotic highly dipolar and polarizable (AHDP) 2

    4: aromatic apolar (ARA) 8 2: aromatic apolar or lightly polar (AALP) 165: aromatic relatively polar (ARP) 12

    6: electron pair donors (EPD) 10 3: electron pair donors (EPD) 9

    7: hydrogen bonding (HB) 18 4: hydrogen bonding donors (HBD) 258: hydrogen bonding strongly associated (HBSA) 5

    9: miscellaneous (MISC) 45: aliphatic aprotic apolar (AAA) 8

    Total solvents 83 Total solvents 82

    trends in analytical chemistry, vol. 18, no. 7, 1999 465

  • TRAC 2552 25-6-99

    ene (erroneously in class 2) have not been consideredin the training set.

    Our training set consists of 82 solvents ( indicatedby asterisks in Table 1) divided into ve classes (Table2). The other 70 solvents from our data set were usedfor prediction purposes.

    5.2. Classication performance of KNN on thetraining set

    Wide sets of structural, empirical and topologicalmolecular descriptors have been used as independentvariables, checking the performance of different clas-sication methods. After an extensive comparison, theKNN method was found to be the most appropriatemethod to classify the solvents. LDA stepwise wasalso used to select the subset of the most relevantmolecular descriptors.

    The best model, obtained using autoscaled data,with K values between 1 and 10, is shown in Table 3.

    The model, obtained as described above, takes intoaccount four molecular descriptors: one topological,the average atomic composition (AAC), one struc-tural, the number of nitrogen atoms in the molecularstructure (nN), and the two empirical descriptors (UIand Hy), described above. The discriminant power ofthese variables has also been highlighted by other clas-sication methods.

    The study of the confusion matrix for this modelshowed only two solvents classied wrongly: dioxanewas included in class 1 instead of class 3, and benzylalcohol was included in class 2 instead of class 4. Therst error is a serious mistake because dioxane is a verylow polar compound (W = 0.45 debye). The secondclassication error could be considered a formalerror as benzyl alcohol has both a hydroxyl groupand an aromatic ring in its structure.

    5.3. Prediction of KNN for test solvents

    The model shown in Table 3 has been applied toclassify all the solvents from the data set. The resultsare collected in column A of Table 1. Although theclassication was obtained by a theoretical description

    of the molecular structure, it is in good accordancewith the physico-chemical properties of the solvents.A study of this classication highlights the following.

    The model assigns to class 1 (previously the classof the aprotic polar solvents) ketones, aliphaticesters, nitro compounds, nitriles, pyrimidine bases,N-substituted ureas, dimethylsulfoxide, sulfolane,HMPTA, and some polyethers. Most of thechlorinated solvents are also included in this class,while trichloroethylene represents an outlier as it isan apolar solvent (0.8 debye). Thus, this class cancorrectly be named as dened in the training set(AP).

    Class 2 (previously the class of the aromatic apolarsolvents or lightly polar solvents) (AAL) collectsexclusively such solvents.

    Class 3 consists mainly of ethers and tertiaryamines, in addition to the outlier halogenatedalkanes. In any case, this class can be named theclass of electronic pairs donor solvents (EPD).

    Class 4 contains the hydrogen bond donor solvents:alcohols, water, glycols, aminoglycols, aliphaticprimary and secondary amines, acetic acid andtriuoroacetic acid. Its correct name is the class ofthe hydrogen bond donor solvents (HBD).

    The model includes in class 5, in addition to thealkanes, carbon tetrachloride and tetrachloroethyl-ene. This class is now named the class of thealiphatic apolar and aprotic solvents (AAA).

    In the proposed classication only a few solventsappear not clearly, or wrongly, classied.

    Class 1 had wrongly assigned to it trichloroethylene(W = 0.8 debye), furfuryl alcohol (W = 1.92 debye)and aniline (W = 1.51 debye). These compoundshave low but signicant dipolar moment values andwould have been better classied in other classescloser to their structural properties (furfuryl alcoholamong the HBD solvents and aniline among theEPD or aromatic solvents). Perhaps also N,N-dimethylaniline should be included in the EPD orthe aromatic solvent class.

    Unsaturated non-aromatic compounds such asacetylacetone, carbon disulde and cyclohexenewere wrongly assigned to class 2. Acetylacetonewould, perhaps, be better classied in class 1 as itsstructure has two carbonyl groups, making it apolar compound, but the proton between the twocarbonyl groups has an acidic property and so it isnot completely an aprotic solvent. The other two

    Table 3KNN classication model

    K Descriptors NOMER (%) NMERCV (%)

    4 UI, Hy, AAC, nN 69.5 2.4

    466 trends in analytical chemistry, vol. 18, no. 7, 1999

  • TRAC 2552 25-6-99

    solvents should have been classied as aliphaticapolar aprotic solvents (class 5). Benzaldehyde hasa high dipolar moment value (3.02 debye) andbecause of its polarity should have been included inclass 1, but its classication (class 2) was madebecause of the presence of the aromatic ring.

    Some halogenated alkanes, which should have beenclassied in class 1 with the aprotic polarcompounds, were included in class 3.

    Chloroform was curiously assigned to class 4. Theanesthetic properties of this solvent are due to itscapacity to form a hydrogen bond [33].

    Some long-chain ethers were assigned to class 5.Perhaps these compounds would have been morecorrectly included among the EPD solvents, but the

    long aliphatic chain is dominant over the oxygenatoms.

    5.4. Improving of classication by CP-ANN

    Using the training set described above (ve classes )and the molecular descriptors selected to classify thesolvents by the KNN method (UI, Hy, AAC and nN),several CP-ANN architectures were tested. A goodseparation for the ve classes was obtained by thenet architecture-1 (20U20U4, 200 iterations ), withNER (%) = 95.6 and NERCV (%) = 88.6.

    Fig. 1 shows the distribution of the 152 solventsstudied in this work together with rough class bounda-

    Fig. 1. Kohonen 20U20 toroidal top map of CP-ANN-1. Contour lines are drawn for each of the ve classes, following the classassignment from the counter-propagation procedure-1. The solvents are labelled as previously assigned by the KNN clas-sication method.

    trends in analytical chemistry, vol. 18, no. 7, 1999 467

  • TRAC 2552 25-6-99

    ries; it must be remembered that the form of the map isthat of a toroid. Column B of Table 1 reports the classassignments of this net, multiple for compounds notsharply assigned.

    Some anomalous classications of KNN areexplained and corrected by the observation of the rel-ative positions of the classes in the space dened bythis net. Note that the overlapping of class 3 with otherclasses and the relative positions of the classes withinthe classication space explain some of the not clearlyclassied solvents by the KNN method.

    As shown in Fig. 1, classes 1, 2 and 4 are clearlyseparate. Class 3 is surrounded by classes 1, 4, and 5and there is a logical overlapping of classes 3 and 1,

    and classes 3 and 4, because the compounds with elec-tron pairs (class 3) can also act in the hydrogen bond-ing processes (class 4) and be polar compounds (class1). The overlapping of classes 1, 3 and 5 produces aplace in the classication space where ethers and halo-genated alkanes can be found. The right class forethers would be the class 3, and it is evident in thismap that near class 1 ethers with a small aliphatic chainare found, also including dioxane (65), DEGME (40)and TEGME (133), while ethers with a longer ali-phatic chain (e.g., di-n-butylether, 42) are near theclass 5 border. The two tertiary amines (140 and143) are isolated in the same neuron, far from theother compounds of class 3, due to their strong struc-

    Fig. 2. Kohonen 20U20 toroidal top map of CP-ANN-2. Contour lines are drawn for each of the ve classes, following the classassignment from the counter-propagation procedure-2. The training solvents are labelled as classied by CP-ANN-1, thesolvents of unknown classes are labelled with an asterisk.

    468 trends in analytical chemistry, vol. 18, no. 7, 1999

  • TRAC 2552 25-6-99

    tural dissimilarity from the others in the class. Cyclo-hexene (34) and carbon disulde (23) are now at theborder of the correct class 5 and furfuryl alcohol (79)is now correctly classied in class 2 near class 4. Ace-tylacetone (6) is at the border of the correct class 1,near acetic anhydride (2 ). Tetrachloroethylene (135)and carbon tetrachloride (24) are wrongly classied.

    A new CP-ANN (called 2) was nally performed toverify the tting for our proposal of classication: allthe solvents were put into the classes dened by theprevious net (CP-ANN-1), while ethers were put intoclass 3 and halogenated alkanes and carbon disuldeare considered to be in an unknown class. The new netCP-ANN-2 (20U20U4, 500 iterations ) (Fig. 2 )clearly denes class 3, where the tertiary amines arenow collected near the other compounds of this class:ethers and halogenated alkanes. Cyclohexene (34),carbon disulde (23) and tetrachloroethylene (135),not clearly classied by CP-ANN-1, are collectedtogether between classes 2 and 5. Carbon tetrachloride(24) still remains together with the compounds ofclass 4, where it was wrongly assigned by the previousCP-ANN-1. Column C of Table 1 reports the nalclassication for all 152 solvents of our data set.

    5.5. Regression models

    Fourteen important physico-chemical propertieswere selected to search for quantitative structure^activity relationships. The properties are the eightused by Chastrette for his classication and sevenother physico-chemical properties of interest for sol-vents. A wide set of molecular descriptors was used asindependent variables for multiple linear regressionmodels. The genetic algorithm was applied to selectthe best descriptor subset from among the 174 previ-ously dened descriptors.

    The regression models selected as the best for eachproperty are the following.

    Boiling point ( bp, C, 1 atm. ):bp =329.29+0.45 MW+46.81 nHD+16.34nHA+11.22 GSI+36.90 E1s3138.00 E3sn = 152, p = 6, Q2LOO = 79.3, Q

    2LMO = 78.7,

    R2 = 81.5.Density (d, g /cm3, 20C):

    d = 0.869+0.008 MW30.078 Sv30.315 E3pn = 152, p = 3, Q2LOO = 90.6, Q

    2LMO = 90.4,

    R2 = 91.2.Refraction index (n, 20C):

    n = 1.383+0.001 MW+0.09 nS30.05 nF+0.06nR06+0.02 Hy30.22 E3p

    n = 152, p = 6, Q2LOO = 82.3, Q2LMO = 80.0,

    R2 = 84.6.Molar refractivity (MR, cm3 / mol ):

    MR =30.65+0.09 MW+2.17 Svn = 152, p = 2, Q2LOO = 94.7, Q

    2LMO = 94.7,

    R2 = 94.9.Specic thermal capacity at constant pressure (Cp,

    cal / molWK, 25C):Cp = 3.40+11.47 nHD+3.64 K1+0.92 Se37.26 Hyn = 126, p = 4, Q2LOO = 92.35, Q

    2LMO = 92.14,

    R2 = 93.05.Vapor enthalpy (Hv, kcal / mol, at bp):

    Hv = 9.50+1.21 nN+2.66 nOH+2.78 nSO+0.12ISIZ34.99 CHI0A+0.86 E1sn = 140, p = 6, Q2LOO = 77.9, Q

    2LMO = 74.6,

    R2 = 79.9.Dipolar moment (W, debye):W =30.46+0.73 nN31.33 AAC30.26 K1+0.61MAXDP+0.30 E1m+0.15 Tsn = 45, p = 6, Q2LOO = 77.6, Q

    2LMO = 77.0,

    R2 = 79.4.Reichardt-Dimroth polarity parameter (NT, 25C

    and 1 bar ):NT = 0.54+0.20 nHD30.24 IDDM+0.02 Ss+0.04MAXDP30.11 G1mn = 130, p = 5, Q2LOO = 79.0, Q

    2LMO = 78.5,

    R2 = 81.1.Kamlet-Taft polarity parameter:

    KT = 1.19+0.07 IAC30.38 CHI1A30.14BAL30.08 Se30.31 E3en = 104, p = 5, Q2LOO = 77.3, Q

    2LMO = 76.4,

    R2 = 79.9.Hildebrandt solubility parameter (H,

    {kcal /dm3}1=2):H = 22.83+0.49 nHA39.07 CHI0A+0.32GSI33.21 IDDM+1.85 SIC+2.06 Hyn = 135, p = 6, Q2LOO = 77.2, Q

    2LMO = 76.0, R

    2 = 79.7.Aqueous solubility (S, mol / l at 25C):

    Log S = 1.36+0.72 nH+0.81 nOH31.84 nNO+1.57nHA31.92 WIAn = 88, p = 5, Q2LOO = 80.1, Q

    2LMO = 77.0, R

    2 = 82.7.Hydrophobicity ( log Kow ):

    Log Kow = 1.03+0.01 MW+1.06 nNO30.98nHA31.35 AAC+0.74 WIA30.40 Hyn = 145, p = 6, Q2LOO = 85.9, Q

    2LMO = 84.9, R

    2 = 87.9.Henry coefcient ( log H):

    Log H = 5.7931.36 nN+1.41 nNO31.86nHD34.56 AAC30.59 MAXDP33.02 P2mn = 71, p = 6, Q2LOO = 88.0, Q

    2LMO = 84.9, R

    2 = 89.9.Flash point (Fp, C, method `closed cup'):

    Fp =3291.57329.54 nX+25.42 nOH+186.93AAC311.93 Se+30.22 Sp+29.07 Hy

    trends in analytical chemistry, vol. 18, no. 7, 1999 469

  • TRAC 2552 25-6-99

    n = 136, p = 6, Q2LOO = 78.7, Q2LMO = 77.9, R

    2 = 81.3.All the reported models have good or acceptable

    predictive powers and conrm their stability by vali-dation with 30% perturbation. Only models for aque-ous solubility, vapor enthalpy and Henry coefcientshow some instability, but their performances appearsatisfactory.

    From all the models several kinds of descriptorswere selected, belonging to every group, structural,empirical, topological and 3D-WHIM descriptors. Itis interesting to note that in all the models, in additionto dimensional descriptors, selected descriptors areoften meaningful; for instance, MAXDP ( maximumseparation of electrotopological positive charges ) isselected for modelling properties related to molecularcharge distribution (dipolar moment, Reichardt-Dim-roth polarity parameter ), hydrophilicity factor (Hy)and hydrogen donors or acceptors (nHD, nHA) areselected to model solubility, hydrophobicity and, gen-erally, properties related to molecule bonding.

    Topological descriptors selected by GA to modelphysico-chemical properties are the Gordon-Scantle-bury index (GSI ), the average Wiener index (WIA),the information content on size ( ISIZ), the informa-tion content on magnitude of distance degrees( IDDM), the topological shape (K1), the average con-nectivity indices of order 0 and 1 (CHI0A andCHI1A), the Balaban J index (BAL), the structuralinformation content (SIC), the atomic compositionindex ( IAC) and the average atomic compositionindex (AAC). GSI, WIA, ISIZ and IDDM are mainlyrelated to molecular size, CHI0A and CHI1A arerelated to molecular branching, while IAC and AACare related to molecular complexity in terms of atomtypes. The selected WHIM descriptors are related tothe global size (T and S), to the shape (P2), to the axialatom distribution (E1, E3) and to the axial symmetry(G1). The indices Sv, Sp, Se, and Ss are the sums ofthe atomic properties as dened in the WHIMapproach [ 30^32 ] (v = van der Waals volumes,p = atomic polarizabilities, e = Mulliken atomic elec-tronegativities, S = electrotopological indices of Kierand Hall ), and are thus independent of the conforma-tion.

    6. Conclusions

    A new general solvent classication has been pro-posed. A large number of solvents have been classiedinto ve classes by a simple model with remarkablechemical meaning. The proposed solvent classica-

    tion has been established by taking into account fourmolecular descriptors (which are theoretical descrip-tions of the molecular structure ) that are in goodaccordance with the physico-chemical properties ofthe solvents. The use of various CP-ANNs improvesthe results previously obtained by the KNN classica-tion method, and has made these results more chemi-cally interpretable.

    Fourteen physico-chemical properties were alsomodelled. Using the genetic algorithm to select thebest subsets of molecular descriptors from a wideset, it was possible to model the physico-chemicalproperties of this heterogeneous data set, as are thesolvents that were studied. The prediction powers ofall the obtained regression models remain high, alsowhen estimated by the more severe leave-more-outprocedure.

    7. Glossary

    Genetic algorithm^variable subset selection (GA-VSS): this is a strategy for the selection of subsets ofvariables based on genetic algorithms. Each variable isdenoted by a bit equal to 1 if present in the regressionmodel or equal to 0 if excluded. A population consti-tuted by a number of 0 /1 bit strings (each of lengthequal to the total number of variables ) is evolved fol-lowing the rules of the genetic algorithms, maximizingthe predictive power of the models. As regression tool,the ordinary least square regression (OLS) is used.Finally, the best model for each dimension is selected.To avoid multicolinearity without prediction power,the QUIK rule (Q under the inuence of K ) [ 34 ] isadopted, which excludes models with a K multivariatecorrelation index of the [X ] variable block greater thanthe correlation within the [X+y ] block variables,where y is the response variable.

    K-nearest neighbor (KNN): a classication methodsearching for the k nearest neighbors of each object inthe data set and performs the classication of the con-sidered object considering the majority of the classesto which the k-th nearest objects belong.

    No-model error rate (NOMER): this is a referencemeasure for classication without any classicationmodel, i.e. all the objects are considered to belong tothe most numerous class and the error rate is calculatedas the ratio between the these objects and the totalnumber of objects.

    Cross-validated non-error rate (NERCV ): this is theerror rate calculated for classication methods with across-validation procedure.

    470 trends in analytical chemistry, vol. 18, no. 7, 1999

  • TRAC 2552 25-6-99

    Cross-validated R2 (Q2CV): this is a measure of thepredictive power of a regression model. In particular,Q2LOO and Q

    2LMO denote predictive power calculated

    leaving out from the regression model one object at atime ( leave-one-out procedure, LOO) and leaving outmore than one object at a time ( leave-more-out proce-dure, LMO), respectively.

    References

    [ 1 ] M. Bohole, W. Kollecker, D. Martin, Z. Chem. 17 (1977)161.

    [ 2 ] A.J. Parker, Chem. Rev. 69 (1969) 1.[ 3 ] R. Carlson, T. Lundstedt, C. Albano, Acta Chem. Scand.

    B 39 (1985) 79.[ 4 ] L.R. Snyder, J. Chromatogr. 92 (1974) 223.[ 5 ] S.K. Poole, C.F. Poole, J. Chromatogr. 697 (1995) 415.[ 6 ] A.L. Molinero, J.R. Castllo, P. Chamorro, J.M. Munioz-

    guren, Spectrochim. Acta B At. Spectrosc. 52 (1997)103.

    [ 7 ] A. de Juan, G. Fonroda, E. Casassas, Trends Anal. Chem.16 (1997) 52.

    [ 8 ] M. Chastrette, M. Rajzmann, M. Chanon, K.F. Purcell,J. Am. Chem. Soc. 107 (1985) 1.

    [ 9 ] C. Reichardt, Solvents and Solvent Effects in OrganicChemistry, 2nd revised and enlarged edition, VCH,Weinheim, 1990, p. 75.

    [ 10 ] J. Riddick, W.B. Bunger, T. Sakano, Organic Solvents:Physical Properties and Methods Of Purication, 4thedn., Wiley, New York, 1986.

    [ 11] Handbook of Chemistry and Physics, CRC Press, BocaRaton, FL, 1995.

    [ 12 ] C. Reichardt, Solvents and Solvent Effects in OrganicChemistry, 2nd revised and enlarged edition, VCH,Weinheim, 1990, p. 365.

    [ 13 ] T. Suzuki, K. Ohtaguchi, K. Koide, Comput. Chem. 16(1992) 41.

    [ 14 ] Handbook of Fine Chemicals, ACROS-Chimica, Geel,1994^1995.

    [ 15 ] C. Hansch, A. Leo and D. Hoekman, Hydrophobic, Elec-tronic and Steric Constants, A.C.S. Professional Refer-ence Book, Washington, DC, 1995.

    [ 16 ] R. Kune, R.U. Ebert, F. Kleint, G. Schmidt, G. Scuur-mann, Chemosphere 30 (1995) 2061.

    [ 17 ] M.J. Kamlet, J.L.M. Abboud, M.H. Abraham, R.W. Taft,J. Org. Chem. 48 (1983) 2877.

    [ 18 ] M.H. Abraham, P.L. Grellier, J.L.M. Abboud, R.M. Doh-erty, R.W. Taft, Can. J. Chem. 66 (1988) 2673.

    [ 19 ] M.J. Kamlet, R.W. Taft, J. Chem. Perkin Trans. 2 (1985)1583.

    [ 20 ] Hyperchem, Rel. 4 for Windows, Autodesk, Sausalito,CA, 1995.

    [ 21] R. Todeschini, WHIM-3D /QSAR software for the calcu-

    lation of the WHIM descriptors, Rel. 2.1 for Windows,Talete, Milan, 1996.

    [ 22 ] R. Todeschini, U. Consentino, I.E. Frank and G. Moro,Scan software for chemometric analysis, Rel. 1 for Win-dows, Jerll, Standard, CA, 1992.

    [ 23 ] R. Todeschini, Koala ^ Kohonen Articial Layers, Rel.1.0, Milan Research Group of Chemomotrics, Milan,1998.

    [ 24 ] J. Zupan, M. Novic, I. Ruisanchez, Chem. Intell. Lab.Syst. 38 (1997) 1.

    [ 25 ] R. Leardi, R. Boggia, M. Terrile, J. Chemometr. 6 (1992 )267.

    [ 26 ] R. Todeschini, Moby Digs software for variable subsetselection by genetic algorithms, Rel. 1.0 for Windows,Talete, Milan, 1997.

    [ 27 ] R. Todeschini, M. Vighi, A. Finizio, P. Gramatica, SARQSAR Environ. Res. 7 (1997) 173.

    [ 28 ] L.B. Kier and L.H. Hall, Molecular Connectivity in Struc-ture^Activity Analysis, Research Studies Press, Letch-worth /Wiley, New York, 1986.

    [ 29 ] R. Todeschini, R. Cazar, E. Collina, Chem. Intell. Lab.Syst. 15 (1992) 51.

    [ 30 ] R. Todeschini, P. Gramatica, Quant. Struct.-Activ. Relat.16 (1997) 113.

    [ 31] R. Todeschini, P. Gramatica, Quant. Struct.-Activ. Relat.16 (1997) 120.

    [ 32 ] R. Todeschini, P. Gramatica, SAR QSAR Environ. Res. 7(1997) 89.

    [ 33 ] C. Reichardt, Solvents and Solvent Effects in OrganicChemistry, 2nd revised and enlarged edition, VCH,Weinheim, 1990, p. 75.

    [ 34 ] R. Todeschini, V. Consonni and A. Maiocchi, Chemo-metr. Int. Lab. Syst. ( in press ).

    Prof. Paola Gramatica has a degree in organic chemistry,she has been involved in studies on natural organiccompounds since 1995, when she started her researchon QSAR applications of molecular descriptors andchemometric methods. She is now Professor of OrganicChemistry and also of Environmental Chemistry at theUniversity of Insubria. She is the chief of the QSARResearch Unit at the Department of Structural andFunctional Biology of the University of Insubria, Varese,Italy.

    Prof. Dr. Roberto Todeschini has a degree in physicalchemistry. A Professor of Physical Chemistry, since 1983his main research interest has been chemometrics. Heteaches Chemometrics at the University of Milan, he isthe chief of the Milan Chemometric Research Group atthe Department of Environmental Sciences of theUniversity of Milan-Bicocca, and is a member of theeditorial board of Chemometrics and IntelligentLaboratory Systems.

    Dr. Natalia Navas is Professor of Analytical Chemistryat the Department of Analytical Chemistry, University ofGranada, Spain.

    trends in analytical chemistry, vol. 18, no. 7, 1999 471