Pseudofragmental descriptors based on combinations of atomic properties for prediction of physical...

4
ISSN 00125008, Doklady Chemistry, 2010, Vol. 430, Part 2, pp. 39–42. © Pleiades Publishing, Ltd., 2010. Original Russian Text © N.I. Zhokhova, I.I. Baskin, A.N. Zefirov, V.A. Palyulin, N.S. Zefirov, 2010, published in Doklady Akademii Nauk, 2010, Vol. 430, No. 5, pp. 635–638. 39 Among a great number of QSPR/QSAR approaches to the prediction of physical and chemical properties and biological activity of organic com pounds, the methods using fragmental descriptors play a specific role [1, 2]. The values of the latter can be either the occurrence numbers or indicators of the presence of some fragments in the structures of chem ical compounds. Advantages of these descriptors are their transparent meaning and the possibility of fast automatic generation on the basis of only the struc tural formula. Fragmental descriptors can be calcu lated without knowledge of the 3D structure or elec tronic structure of molecules and, therefore, can be easily used for operating large databases. One of the disadvantages of fragmental descriptors is the problem of rare fragments that can be absent in the training set but can exist in the compounds for which the prediction is performed. Since the contribu tions of rare fragments cannot be determined on the basis of the training set, considerable errors of predic tion are expected for compounds containing such fragments. We suggest solving this problem by intro ducing additional descriptors with the values being to an extent related to the contributions of fragments to the predicted property. For this purpose, we also sug gest using special fragmental descriptors with the val ues being calculated by combining the properties of the atoms that constitute these fragments. Such descriptors are referred to as pseudofragmental descriptors in order to distinguish them from “proper” descriptors assigned the values of the occurrence num bers or indicators of the presence of certain fragments in the structures of chemical compounds. The atomic properties that are believed to influence the contribu tions of fragmental descriptors to the predicted prop erty, for example, the atomic weight, number of elec trons, covalent radius, electronegativity, ionization potential, etc., can be used for predicting physical and chemical properties of organic molecules. It is also important for the used combinations of properties to have a clear physical meaning since this provides a bet ter chance for the existence of correlation of their val ues with fragmental contributions. If such a correla tion exists, a small number of pseudofragmental descriptors enter into statistical models instead of numerous proper fragmental descriptors, including potentially rare, thus acting as a compressed generali zation of the latter. This largely solves the problem of rare fragments if the pseudofragmental descriptors are constructed on the basis of frequently encountered fragments consisting of separate atoms or short chains of arbitrary atoms, which are present almost in all molecules. As the first example of a pseudofragmental descrip tor, let us consider the construction p1_AR3 = . Here, the atomic radius acts as the atomic property. It is evident that the cubed atomic radius is proportional to the atomic “volume.” Inasmuch as summation is over atoms, they act as basic fragments for calculating the descriptor. The physical meaning of this descriptor is the average specific volume of an atom. We can assume that this descriptor will play a significant role for predicting the volumetric proper ties of substances, for example, the density. Even if it is necessary to predict a similar property for a chemical compound containing a rare element absent in the training set, a reasonable approximation of its contri bution to the predicted property will be obtained. The pseudofragmental descriptors under consider ation can be used for constructing statistical models in combination with proper fragmental descriptors [3], 1 N a R i 3 i 1 = N a CHEMISTRY Pseudofragmental Descriptors Based on Combinations of Atomic Properties for Prediction of Physical Properties of Polymers in Quantitative Structure–Property Relationship Studies N. I. Zhokhova, I. I. Baskin, A. N. Zefirov, V. A. Palyulin, and Academician N. S. Zefirov Received July 30, 2009 DOI: 10.1134/S0012500810020023 Moscow State University, Moscow, 119991 Russia

Transcript of Pseudofragmental descriptors based on combinations of atomic properties for prediction of physical...

Page 1: Pseudofragmental descriptors based on combinations of atomic properties for prediction of physical properties of polymers in quantitative structure—property relationship studies

ISSN 0012�5008, Doklady Chemistry, 2010, Vol. 430, Part 2, pp. 39–42. © Pleiades Publishing, Ltd., 2010.Original Russian Text © N.I. Zhokhova, I.I. Baskin, A.N. Zefirov, V.A. Palyulin, N.S. Zefirov, 2010, published in Doklady Akademii Nauk, 2010, Vol. 430, No. 5, pp. 635–638.

39

Among a great number of QSPR/QSARapproaches to the prediction of physical and chemicalproperties and biological activity of organic com�pounds, the methods using fragmental descriptors playa specific role [1, 2]. The values of the latter can beeither the occurrence numbers or indicators of thepresence of some fragments in the structures of chem�ical compounds. Advantages of these descriptors aretheir transparent meaning and the possibility of fastautomatic generation on the basis of only the struc�tural formula. Fragmental descriptors can be calcu�lated without knowledge of the 3D structure or elec�tronic structure of molecules and, therefore, can beeasily used for operating large databases.

One of the disadvantages of fragmental descriptorsis the problem of rare fragments that can be absent inthe training set but can exist in the compounds forwhich the prediction is performed. Since the contribu�tions of rare fragments cannot be determined on thebasis of the training set, considerable errors of predic�tion are expected for compounds containing suchfragments. We suggest solving this problem by intro�ducing additional descriptors with the values being toan extent related to the contributions of fragments tothe predicted property. For this purpose, we also sug�gest using special fragmental descriptors with the val�ues being calculated by combining the properties ofthe atoms that constitute these fragments. Suchdescriptors are referred to as pseudofragmentaldescriptors in order to distinguish them from “proper”descriptors assigned the values of the occurrence num�bers or indicators of the presence of certain fragmentsin the structures of chemical compounds. The atomicproperties that are believed to influence the contribu�tions of fragmental descriptors to the predicted prop�

erty, for example, the atomic weight, number of elec�trons, covalent radius, electronegativity, ionizationpotential, etc., can be used for predicting physical andchemical properties of organic molecules. It is alsoimportant for the used combinations of properties tohave a clear physical meaning since this provides a bet�ter chance for the existence of correlation of their val�ues with fragmental contributions. If such a correla�tion exists, a small number of pseudofragmentaldescriptors enter into statistical models instead ofnumerous proper fragmental descriptors, includingpotentially rare, thus acting as a compressed generali�zation of the latter. This largely solves the problem ofrare fragments if the pseudofragmental descriptors areconstructed on the basis of frequently encounteredfragments consisting of separate atoms or short chainsof arbitrary atoms, which are present almost in allmolecules.

As the first example of a pseudofragmental descrip�tor, let us consider the construction p1_AR3 =

. Here, the atomic radius acts as the atomic

property. It is evident that the cubed atomic radius isproportional to the atomic “volume.” Inasmuch assummation is over atoms, they act as basic fragmentsfor calculating the descriptor. The physical meaning ofthis descriptor is the average specific volume of anatom. We can assume that this descriptor will play asignificant role for predicting the volumetric proper�ties of substances, for example, the density. Even if it isnecessary to predict a similar property for a chemicalcompound containing a rare element absent in thetraining set, a reasonable approximation of its contri�bution to the predicted property will be obtained.

The pseudofragmental descriptors under consider�ation can be used for constructing statistical models incombination with proper fragmental descriptors [3],

1Na

����� Ri3

i 1=

Na

CHEMISTRY

Pseudofragmental Descriptors Based on Combinations of Atomic Properties for Prediction of Physical Properties

of Polymers in Quantitative Structure–Property Relationship Studies

N. I. Zhokhova, I. I. Baskin, A. N. Zefirov, V. A. Palyulin, and Academician N. S. Zefirov

Received July 30, 2009

DOI: 10.1134/S0012500810020023

Moscow State University, Moscow, 119991 Russia

Page 2: Pseudofragmental descriptors based on combinations of atomic properties for prediction of physical properties of polymers in quantitative structure—property relationship studies

40

DOKLADY CHEMISTRY Vol. 430 Part 2 2010

ZHOKHOVA et al.

which we previously successfully used for predictingphysical and chemical properties of organic com�pounds of different classes [4–6]. Previously [7, 8] wedemonstrated the efficiency of separate combinationsof descriptors of this type with fragmental descriptors.

In the present work, we studied descriptors basedon combinations of atoms in fragments for predictionof three key physical properties of polymers: therefractive index (n, 298 K), glass transition tempera�ture (Т, K), and density in the amorphous state

Specification formulas for the calculation of combinations of atomic properties within fragments, and the codes of descrip�tors most frequently encountered in QSPR models obtained for predicting polymer properties

No. Descriptor description Code and formula

Density in the amorphous state

1 Ratio of the number of electrons to the number of atoms in the molecule, or the mean number of electrons per atom

p1_ANe = Ne/Na

2 Mean product of the atom electronegativities for all bonds in the molecule

p2_APE =

3 Maximal product of the magnitude of the differencebetween electronegativities for all bonds in the molecule by the order of the corresponding bond

p2_HDE = maxp2(|χ(a1) – χ(a2)|nb)

4 Sum of the products of the electronegativity differences of atoms in 1–2 and 5–4 positions for all five�atom chains

p5_SPDE =

5 Ratio of the sum of cubed atomic radii to the numberof atoms in the molecule

p1_AR3 =

6 Average product of the differences between the electrone�gativities of the atoms in the 1–2 and 5–4 positions for all five�atom chains

p5_APDE =

Glass transition temperature

7 Average product of atomic radii in the 1–4 positions for all four�atom chains

p4_APR =

8 Number of π electrons in the molecule p1_Npi = Nπ

9 Sum of the differences between the electronegativities for all X–H bonds in the molecule, where X is a heteroatom

p2_SDEHnc =

Polarizability

10 Mean atomic ionization potential of the molecule p1_AIP =

11 See descriptor 7

12 Minimal electronegativity of an atom in the molecule p1_LE = min(χi)

13 See descriptor 9

14 Mean product of the differences between the electrone�gativities of the atoms in the 1–2 and 3–2 positions for all three�atom bonded fragments, excluding the bonds to hydrogen atoms

p3_APDEh =

15 Sum of the products of the differences between the elec�tronegativities of atoms in the 1–2 and 4–3 positions for all four�atom chains

p4_SPDE =

Note: Descriptors are arranged in order of decreasing occurrence number in special models for the corresponding property. Designations: χ is theelectronegativity; R is the covalent atomic radius; I is the atom ionization potential; Na in the number of atoms; Ne is the number of electronsin the molecule; N

π is the number of π electrons in the molecule; Npn is the number of chains of length n; an, bn, and pn stand for atoms,

bonds and chains of atoms, respectively, which are defined as follows: .

1Nb

����� χ a1( )χ a2( )p2∑

χ a1( ) χ a2( )–( ) χ a5( ) χ a4( )–( )p5∑

1Na

����� Ri3

i 1=

Na

1Np5

������� χ a1( ) χ a2( )–( ) χ a5( ) χ a4( )–( )p5∑

1Np4

������� R a1( )R a4( )p4∑

χ a1( ) χ H( )–p2 a1 C≠

1Na

����� Ii

i 1=

Na

1Np3

������� χ a1( ) χ a2( )–( ) χ a3( ) χ a2( )–( )p3∑

χ a1( ) χ a2( )–( ) χ a4( ) χ a3( )–( )p4∑

a1—a2a1b1

a2 a4

a5

p1 p2 p3 p4 p5

a1 a3

a2

a1 a3

a2

a1 a3

a4

Page 3: Pseudofragmental descriptors based on combinations of atomic properties for prediction of physical properties of polymers in quantitative structure—property relationship studies

DOKLADY CHEMISTRY Vol. 430 Part 2 2010

PSEUDOFRAGMENTAL DESCRIPTORS 41

(ρ, g/cm3; 298 K). Previously, these properties weremodeled using the Van Krevelen group contributionmethod [9] and Askadskii schemes [10]. These meth�ods are actually not statistical and, therefore, statisti�cal characteristics of the models for them are not esti�mated. QSPR models for the calculation of polymerproperties were described by Bicerano [11]; however,the predictive power for these models has not beendetermined by means of cross validation or indepen�dent external set, which renders impossible directcomparison of their statistical characteristics.

Working sets including the experimental refractiveindices, glass transition temperatures, and densities inthe amorphous state, as well as the structures of 182,314, and 152 corresponding monomers, were formedon the basis of monograph [11].

The fast stepwise multiple linear regression(FSMLR) and three�layer feedforward artificial neu�ral network (ANN) methods implemented in theNASAWIN software suit were used for calculatingfragmental descriptors and constructing QSPR mod�els [12]. Sets of fragments containing one to five non�hydrogen atoms were generated. Combinations ofatomic properties within fragments were calculatedwith the use of the FRAGPROP descriptor block builtinto the EMMA [13] and NASAWIN [12] software.This descriptor block makes it possible to calculate50 combinations of atomic properties (or FRAG�PROP descriptors) for fragments containing up to fiveatoms. The predictive power of the models was esti�mated by means of the procedure of 5 × 4�fold doublecross validation [14, 15]. The calculated statistical

characteristics are (i) , the Q2 parameter (Q2 =(SS – PSS)/SS, where PSS is the sum of the squaredpredictive errors for some property, and SS is the sumof the squared deviations of the property from themean value) for averaged predicted values;(ii) RMSEDCV, the root�mean�square error of predic�tion; (iii) MAEDCV, the mean absolute error of predic�tion.

Our calculations showed that the quality of theQSPR models, both linear regression and neural net�work ones, obtained for all three polymer characteris�tics—the refractive index, density in the amorphousstate, and glass transition temperature—is consider�ably improved when the model contains, in addition tofragmental descriptors, descriptors describing combi�nations of atomic properties within the fragments.This is observed for the entire range of fragment sizes,from one to five non�hydrogen atoms. In particular,the best QSPR model for the refractive index wasobtained by the FSMLR method on the basis of frag�mental descriptors containing from one to four non�hydrogen atoms and had the following statistical char�

acteristics: = 0.782, RMSEDVC = 0.033, andMAEDVC = 0.021. When descriptors describing the

QDCV2

QDCC2

properties of the atoms in fragments (table) are intro�duced into the model, these characteristics areimproved to 0.872, 0.026, and 0.015, respectively. Inthe case of the glass transition temperature, the intro�duction of FRAGPROP descriptors into the bestFSMLR model constructed with the use of fragmentaldescriptors including from one to five non�hydrogenatoms also improves its statistical parameters: from

0.849 to 0.864 ( ), from 45.0 to 42.7 (RMSEDCV),and from 32.0 to 28.0 (MAEDCV). The strongestincrease in the predictive power is observed for theQSPR models constructed for the calculation of thedensity of polymers in the amorphous state. For exam�ple, the statistical characteristics of the best FSMLRmodel constructed with the use of fragments contain�

ing one or two non�hydrogen atoms ( = 0.474,RMSEDCV = 0.159, and MAEDCV = 0.959) becomeequal to 0.910, 0.066, and 0.043, respectively, whenfragmental descriptors are combined with FRAG�PROP descriptors. Combinations of atomic propertieswithin fragments that are most significant for describ�ing the above properties are presented in the table.

Thus, pseudofragmental descriptors make it possi�ble to considerably improve the quality of the modelsusing fragmental descriptors, due to the solution of therare fragment problem. It is worth noting that,although pseudofragmental descriptors per se can beused for constructing QSPR models, the best ones arealways the result of their combination with properfragmental descriptors. Therefore, the use of pseudof�ragmental descriptors should be treated as a way forimproving the models constructed on the basis of frag�mental descriptors.

REFERENCES

1. Zefirov, N.S. and Palyulin, V.A., J. Chem. Inf. Comput.Sci., 2002, vol. 42, pp. 1112–1122.

2. Baskin, I. and Varnek, A., in Chemometrics Approachesto Virtual Screening, Varnek, A. and Tropsha, A., Eds.,Cambridge: RSC Publ., 2008, pp. 1–43.

3. Artemenko, N.V., Baskin, I.I., Palyulin, V.A., andZefirov, N.S., Dokl. Chem., 2001, vol. 381, nos. 1–3,pp. 317–320 [Dokl. Akad. Nauk, 2001, vol. 381, no. 2,pp. 203–206].

4. Zhokhova, N.I. and Baskin, I.I., Palyulin, V.A., et al.,Zh. Strukt. Khim., 2004, vol. 45, no. 4, pp. 660–669.

5. Zhokhova, N.I., Palyulin, V.A., Baskin, I.I., et al., Izv.Akad. Nauk, Ser. Khim., 2003, no. 5, pp. 1005–1009.

6. Zhokhova, N.I., Baskin, I.I., Palyulin, V.A., et al., Izv.Akad. Nauk, Ser. Khim., 2003, no. 9, pp. 1787–1793.

7. Zhokhova, N.I., Bobkov, E.V., Baskin, I.I., et al., Vestn.Mosk. Univ., Ser. 2: Khim., 2007, vol. 48, no. 5,pp. 329–332.

8. Varnek, A., Kireeva, N., Tetko, I.V., et al., J. Chem. Inf.Model., 2007, vol. 47, pp. 1111–1122.

9. Van Krevelen, D.V., Properties of Polymers, 3rd ed.,Amsterdam: Elsevier, 1990.

QDCV2

QDCV2

Page 4: Pseudofragmental descriptors based on combinations of atomic properties for prediction of physical properties of polymers in quantitative structure—property relationship studies

42

DOKLADY CHEMISTRY Vol. 430 Part 2 2010

ZHOKHOVA et al.

10. Askadskii, A.A., Computational Materials Science ofPolymers, Cambridge: Cambridge Int. Sci. Publ., 2003.

11. Bicerano, J., Prediction of Polymer Properties, 2nd ed.,New York: M. Dekker, 1996.

12. Baskin, I.I., Halberstam, N.M., Artemenko, N.V.,et al., in EuroQSAR 2002. Designing Drugs and CropProtectants: Processes, Problems and Solutions, Mel�bourne: Blackwell, 2003, pp. 260–263.

13. Sukhachev, D.V., Pivina, T.S., Shlyapochnikov, V.A.,et al., Dokl. Akad. Nauk, 1993, vol. 328, no. 2, pp. 188–189.

14. Zhokhova, N.I., Baskin, I.I., Palyulin, V.A., et al.,Abstracts of XVI European Symposium on QuantitativeStructure–Activity Relationships and Molecular Model�ing, Mediterranean Sea, Italy, 2006, p. 206.

15. Zhokhova, N.I., Baskin, I.I., Palyulin, V.A., et al.,Dokl. Chem., 2007, vol. 417, part 2, pp. 282–284 [Dokl.Akad. Nauk, 2007, vol. 417, no. 5, pp. 639–614].