Variable selection for discriminating herbal medicines with chromatographic fingerprints

Analytica Chimica Acta 572 (2006) 265–271

Variable selection for discriminating herbal medicineswith chromatographic fingerprints

Fan Gong a,∗, Bo-Tang Wang a, Yi-Zeng Liang a, Foo-Tim Chau b, Ying-Sing Fung c

a Research Center of Modernization of Chinese Herbal Medicines, Institute of Chemometrics & Intelligent Analytical Instruments,College of Chemistry and Chemical Engineering, Central South University, Changsha 410083, Chinab Department of Applied Biology and Chemical Technology, The Hong Kong Polytechnic University,

Hung Hom, Kowloon, Hong Kong, Chinac Department of Chemistry, The University of Hong Kong, Pokfulam Road, Hong Kong, China

Received 11 January 2006; received in revised form 31 March 2006; accepted 10 May 2006Available online 16 May 2006

Abstract

When discriminating herbal medicines with pattern recognition based on chromatographic fingerprints, typically, the majority of variables/datapubotwuawwhb©

K

1

bitpfidtt

0d

oints contain no discrimination information. In this paper, chemometric approaches concerning forward selection and key set factor analysissing principal component analysis (PCA), unweighted and weighted methods based on the inner- and outer-variances, Fisher coefficient from theetween- and within-class variations were investigated to extract representative variables. The number of variables retained was determined basedn the cumulative variance percent of principal components, the ratio of observations to variables and the factor indicative function (IND). In ordero assess the methods for variable selection and criteria levels to determine the number of variables retained, the original and reduced datasetsere compared with Procrustes analysis and a weighted measure of similarity. Moreover, the tri-variate plots of the first three PCA scores weresed to visually examine the reduced datasets in low dimensional space. Herbal samples were finally discriminated by use of Bayes discriminationnalysis with the reduced subsets. The case study for 79 herbal samples showed that, the methods of forward selection associating the variablesith the loadings closest to 0 and key set factor analysis were preferable to determine the representative variables. Procrustes analysis and theeighted measure were not indicative to extract representative variables. High matching between the original and reduced datasets did not suggestigh prediction accuracy. Visually examining the PC1–PC2–PC3 scores projection plots with the reduced subsets, not all the herb samples coulde separated due to the complexity of chromatographic fingerprints.

2006 Elsevier B.V. All rights reserved.

eywords: Variable selection; Herbal medicine; Chromatographic fingerprint; Bayes discrimination analysis

. Introduction

For herbal medicine, there is a need to assure stringent qualityy chemical assay and standardization. Recently, fingerprint-ng approach with chromatography has been recommended forhe identification and assay of chemical components in complexharmaceutical systems [1–8]. In most studies, chromatographicngerprints are constructed with the retention time-peak areaata matrix including only the selected peaks whether the iden-ity of the peaks were known or not. However, as the determina-ion of an optimal set of integration parameters for peak detection

∗ Corresponding author. Tel.: +86 731 8830824.E-mail address: gongfan [email protected] (F. Gong).

is not a trivial task, total chromatogram has been widely used asa chromatographic fingerprint [1–4,9].

In general, there are a large number of data points recordedin each chromatographic fingerprint due to the complexity ofherbal medicines. However, only few of them are of signifi-cance as most data points contain no discrimination informationor approach the noise level [10–13]. On the other hand, use-ful information will be contaminated with noise if all datapoints are used for data analysis. As a result, representative datapoints/variables would be identified before the discriminationof herbal medicines with pattern recognition in fingerprint anal-ysis. So far, there are many classical approaches for variableselection [14–20]. Moreover, the methods for variable selectionand criteria levels to determine the number of variables retainedwere assessed in [17].

003-2670/$ – see front matter © 2006 Elsevier B.V. All rights reserved.oi:10.1016/j.aca.2006.05.032

mailto:[email protected]

dx.doi.org/10.1016/j.aca.2006.05.032

266 F. Gong et al. / Analytica Chimica Acta 572 (2006) 265–271

Table 1Methods for variable selection and criteria to determine the number of variables retained

Variable selection method Method of selecting p variables from N original variables Criteria for deciding on the value of p

Forward selection (B4)Associate p variables with each of the first p components and retainthese variables1. Select the variables associated with the highest loading for eachof the first p components2. Select the variables associated with the loadings closest to 0 foreach of the first p components

Key set factor analysis1. Select the first variable associated with the loading closest to 0 1. The number of principal components

required to account for some proportion(α, here α = 90%) of the total variance

2. Select the first variable associated with the highest loading

2. Arbitrarily select p such that the ratioof number of observations to p is 3:1

3. Select the first variable be most orthogonal to the mean variable

3. The number of chemical factors withthe factor indicator function (IND)

Unweighted w Calculate the ratio of the mean of the standard deviation of variablesin each class to the standard deviation of variables of all samples

Weighted w Calculate the ratio of the mean of the weighted standard deviationof variables in each class to the weighted standard deviation ofvariables of all samples

Fisher coefficient Calculate the ratio of the between- to the within-class variances

In this study, herbal samples are detected with high per-formance liquid chromatography-diode array detector (HPLC-DAD). Similar to total ionic currents (TICs) in gaschromatography–mass spectrometry (GC–MS), chromato-graphic fingerprint here is constructed by summing each totalchromatogram at each wavelength [3]. As the data sizes of thechromatographic fingerprints obtained are very large, signifi-cant data points or representative variables with discriminationinformation are selected with forward selection [14–17,19], keyset factor analysis [19–22], unweighted/weighted methods andthe Fisher coefficient approach [18]. The number of variablesretained is determined based on the cumulative variance per-cent of principal components, the ratio of observations to vari-ables and the factor indicative function (IND) [17,23]. In orderto assess the approaches employed here, Procrustes analysis[13,16,17,19,20,24,25] and a measure of similarity [17] are usedto compare the original data and the reduced datasets. Further-more, the subset retained is visually examined with the tri-variateplots of PCA scores. Herbal samples are finally discriminated byuse of Bayes discrimination analysis with the subsets selected.In this work, 79 real herbs covering 50 Rhizoma chuanxiong,10 Radix angelicae, 2 Cortex cinnamomi and 17 Herba menthaesamples from different sources are investigated. Details of thereferred experiments are given in [3].

2. Methodology investigated

2

sw

mb

unweighted w, weighted w and Fisher coefficient, the inner-and outer-variances, within- and between-group variationsof variables are considered. As the basic principles on thesemethods have been extensively described in Refs. [14–20], onlysimple introduction is restated in this study. All the approachesfor variable selection and the criteria to determine the numberof representative variables are summarized in Table 1. Here, thesize of original data X is M × N with M samples and N variables.If p variables are retained, the size of the reduced data Y is M × p.

2.1.1. Method forward selection (B4)At first, PCA is conducted on the preprocessed data. Method

B4 retains variables by starting with the first component andretaining the variable with the highest loading or with the loadingclosest to 0 for each of the first p components.

2.1.2. Method key set factor analysisAfter PCA is conducted on the preprocessed data, the

variable corresponding to the highest loading or the loadingclosest to 0 is selected as the first key variable. Or, among all thevariables, the one which is most orthogonal to the mean variableis taken as the first one. The second variable is the one which ismost orthogonal to the first variable extracted. The third is themost orthogonal to the plane defined by the first two variableskept. In the same way, other representative variables ared

2

oi

u

.1. Methods of variable selection

In this paper, five approaches for variable selection (forwardelection (B4), key set factor analysis, unweighted w, weighted

and Fisher coefficient) are used. Among them, both theethods of forward selection and key set factor analysis are

ased on PCA in chemometrics. For three other approaches of

etermined.

.1.3. Methods unweighted and weighted w

For the methods unweighted and weighted w, the inner- anduter-variances are used. The coefficient of unweighted w of theth variable can be calculated based on the following equation:

n w(i) = {average[std(i1), std(i2), . . . , std(in)]}std(iall)

(1)

F. Gong et al. / Analytica Chimica Acta 572 (2006) 265–271 267

Here, std(in) and std(iall) are the standard deviation of the ithvariable in the nth group and of all the samples, respectively.Average is the mean value.

In most cases, the numbers of samples in each group aredifferent. The coefficient of weighted w of the ith variable is:

w(i) = {[std(i1) × g1 + std(i2) × g2 + · · · + std(in) × gn]}[std(iall) × gall]

(2)

where gn is the number of the samples in the nth group and gallis the number of all the samples. For the method of weighted w,the standard deviation is weighted with the number of samplesin each group. The lower the coefficient of un w or w, the betterthe variable.

2.1.4. Method Fisher coefficientFor the Fisher method, a coefficient based on the between-

and within-group variations is calculated as the following:

• between-group variation:

SSB(i) =g∑

k=1

nk[i average(yk) − i average(y)]2

•

•

2

os9tcdut

2.3. Evaluation of the reduced subsets

2.3.1. Procrustes analysis for configuration comparison oftwo datasets

Procrustes analysis is generally used to compare the configu-rations of two datasets [13,16,17,19,20,24,25]. In this study, theconfiguration differences (D) between the original data X andthe reduced subset Y can be calculated as the following.

At first, singular value decomposition is conducted on X andY,

X = UX × SX × VtX (6)

Y = UY × SY × VtY (7)

Taking the first p columns of UX and UY to obtain U1 and U2,

U1 = UX(:, 1 : p) (8)

U2 = UY(:, 1 : p) (9)

Then,

D = trace(Ut1 × U1 + Ut

2 × U2 − 2Σ) (10)

where � is the diagonal singular-value matrix of Ut1 × U2 and

“trace” means the trace of the matrix.

2

d

Q

wtnt

2

ispi

3

3d

cmtwsdc

(k = 1, 2, . . . , g) (3)

within-group variation:

SSW(i) =g∑

k=1

nk∑

j=1

[ykj − i average(yk)]2

(k = 1, 2, . . . , g, j = 1, 2, . . . , nk) (4)

Fisher coefficient:

F(i) = {[1/(g − 1)] SSB(i)}{[1/(n − g)] SSW(i)} (5)

where g is the number of groups, nk the number of samples inthe kth group, i average(yk) the mean of the ith variable in thekth group, i average(y) the total mean of the ith variable, andykj is the value of the jth sample in the kth group. The higherthe value of Fisher coefficient, the better the variable.

.2. Determination of number of variables selected

In this study, three criteria are used to determine the numberf variables retained (see Table 1). For the first criterion, p iset to be the number of PCs required to account for no less than0% of the total variance with PCA. We arbitrarily select p suchhat the ratio of the number of samples to p is 3:1 for the secondriterion as in [17]. Finally, IND which has been widely used toetermine the number of the chemical ranks in chemometrics issed here. p is equal to the number of the principal factors forhe third criterion [23].

.3.2. Weighted similarity of two data matricesIn [17], a weighted measure of similarity (Q) between two

atasets is used.

=∑p

i=1viri∑pi=1vi

(i = 1, 2, . . . , p) (11)

here p is the number of useful variables retained, vi the propor-ion of the total variance explained by the ith principal compo-ent of the original data, ri is the correlation coefficient betweenhe ith principal components of the original and reduced datasets.

.3.3. Tri-variate plot of PCA scoresThe relationship between the samples is visually examined

n low dimensional space with PCA. If the grouping of herbalamples can be clearly demonstrated in the first three PC scoreslot, the selected subset possibly contains much discriminativenformation.

. Results and discussion

.1. Selection of representative variables andetermination of number of variables retained

Fig. 1 shows four preprocessed fingerprints of Rhizomahuanxiong, Radix angelicae, Cortex cinnamomi and Herbaenthae, respectively. Here, the baselines and the retention

ime shift have been corrected. As the analysis is conductedith HPLC-DAD and a data matrix can be obtained for each

ample, the zero-component regions of a peak cluster could beetermined with local factor analysis [3,26,27]. Here, the zero-omponent region is defined as the region where no chemical


Fig. 1. Four chromatographic fingerprints of Rhizoma chuanxiong, Radix angel-icae, Cortex cinnamomi and Herba menthae.

components elute (rank zero). With the information on the zero-component regions, the baseline can be then corrected. For thepeak alignment, some marker components in all samples areidentified based on their retention times and UV spectra. Themarker peaks determined are then aligned to the correspondingpeaks found in the target with linear and spline interpolation [3].

For each chromatographic fingerprint here, there are 11,050data points recorded. Thus, the size of the data matrix is79 × 11,050 (79 herbal samples and 11,050 data points for eachfingerprint). Among 11,050 variables/data points, some repre-sentative ones would be selected.

As shown in Table 1, five approaches are used for variableselection in this study. Moreover, for the method of forwardselection (B4), we can select the variables associated with thehighest loading or with the loading closest to 0 for each of thefirst p components. For key set factor analysis, three ways areemployed to determine the first variable: (1) the first variableassociated with the loading closest to 0; (2) the first variableassociated with the highest loading; (3) the first variable beingmost orthogonal to the mean of all original variables.

In order to determine the number of representative variablesretained (p), three criteria are used here. For the first criterion,p is equal to the number of PCs required to account for no lessthan 90% of the total variance with PCA. The second criterionis that we just arbitrarily select p such that the ratio of numberof observations to p is 3:1. Finally, IND is used to aid the deter-mination of p. In this paper, p is determined to be 25, 27 or 28with these three criteria, respectively (see Table 2).

Different variables are retained if various approaches forvariable selection are used even if p is kept unchanged. Forexample, Table 3 shows the original serial numbers of the vari-ables retained with the methods of forward selection (B4 witht

Table 2Number of the variables retained, configuration consensus, weighted measure of simBayes discrimination analysis for 79 herbal samples

Method of variable selection Criteria p

B4: the p variables associatedwith the highest loadings

α = 90% 253 : 1 27IND 28

B4: the p variables associatedwith the loadings closest to 0

α = 90% 253 : 1 27IND 28

Key set factor analysis: the first variableassociated with the loading closest to 0

α = 90% 253 : 1 27IND 28

Key set factor analysis: the first variableassociated with the highest loadings

α = 90% 253 : 1 27IND 28

Key set factor analysis: the first variablemost orthogonal to the mean variable

α = 90% 253 : 1 27

U

W

F

IND 28

nweighted w

α = 90% 253 : 1 27IND 28

eighted w

α = 90% 253 : 1 27
IND 28
isher coefficientα = 90% 25

3 : 1 27IND 28

he variables associated with the highest loading for each of the

ilarity between the original and reduced datasets and prediction accuracy with

Configurationconsensus (%)

Weightedsimilarity (%)

Predictionaccuracy (%)

71.62 54.89 86.0873.80 46.78 89.8774.18 45.73 91.14

68.27 53.27 97.4770.05 53.48 94.9470.89 53.61 96.20

74.78 1.53 96.2075.27 3.37 96.2076.53 3.56 94.94

72.82 30.76 96.2074.85 30.40 97.4776.83 30.65 94.94

73.05 0.55 96.2075.25 0.08 97.4776.41 0.07 96.20

55.76 3.23 83.5456.54 2.71 83.5457.36 2.14 79.75

57.20 1.69 83.5459.66 1.59 88.61
60.07 0.49 87.34
57.17 2.87 87.3457.60 2.16 88.6158.21 2.97 89.87


Table 3Serial number of the original variables retained with two different approaches for variable selection

Variable selection method p Serial number of original variable retained

B4: the p variables associated with thehighest loadings

28 3542, 3544, 3541, 3525, 3543, 3215, 1971, 2855, 2505, 3214, 3333, 1333, 3567,1332, 4708, 1331, 1334, 2088. 188, 2083, 3517, 2090, 2124, 2127, 4710, 25, 189, 214

Key set factor analysis: the first variableassociated with the loading closest to 0

28 5, 667, 2708, 2847, 3118, 2132, 2800, 619, 803, 995, 2286, 4714, 2072, 3189, 2229,5435, 1396, 1928, 862, 3307, 4494, 3448, 5123, 111, 2104, 291, 700, 2479

Fig. 2. Representative variables retained with two different approaches for variable selection: (a and e) Rhizoma chuanxiong, (b and f) Radix angelicae, (c and g)Cortex cinnamomi and (d and h) Herba menthae.


first p components, p = 28) and key set factor analysis (the firstvariable associated with the loading closest to 0, p = 28). Seenfrom Table 3, the variables selected are different. Fig. 2a–h rep-resents the variables retained for the same samples of Rhizomachuanxiong (a) and (e), Radix angelicae (2) and (f), Cortex cin-namomi (c) and (g) and Herba menthae (d) and (h) with thesetwo methods, respectively. Clearly, Fig. 2 also demonstrates thedifferences of the variables selected by various methods.

3.2. Evaluation of the reduced datasets containingvariables selected

After the representative variables are selected, the originaldata and reduced subsets are compared with Procrustes analysisand a measure of similarity. Also, the PC1–PC2–PC3 scoresplots are used to visually examine the samples with PCA on therepresentative variables retained.

Table 2 also shows the configuration consensus with Pro-crustes analysis and the weighted similarity between the originaland reduced datasets. Seen from Table 2, high configuration con-sensus does not indicate high weighted similarity.

Fig. 3a and b shows the PC1–PC2–PC3 scores plots of thetwo reduced data matrices associated with the highest configu-ration consensus and weighted similarity (76.86% and 54.89%,see Table 2), respectively. Seen from Fig. 3a and b, not all thehPpnh

3.3. Discrimination of herbal medicines with reducedsubsets by use of Bayes discrimination analysis

In order to further explain the methods of variable selectionand the criteria used to determine the number of representativevariables selected, 79 real herbs are discriminated with Bayesdiscrimination analysis in this study. Here, two-third of herbsare taken as the training samples and the others as the predictionones. The prediction accuracy is also list in Table 2. Seen fromTable 2, for the methods of forward selection (B4) associatingthe variables with the loadings closest to 0 and key set factoranalysis, the prediction accuracy is no less than 94.94% evenif p is selected as 25, 27 or 28. However, except the methodof forward selection (B4) associating with the variables withthe highest loadings and p = 28, the prediction accuracy is lowerthan 90.00% for other methods. Thus, the forward selection (B4)associating the variables with the loadings closest to 0 and keyset factor analysis are preferable for variable selection in thiswork.

Fig. 3c and d shows the PC1–PC2–PC3 scores plots of thereduced subsets with B4 (the p variables associated with theloadings closest to 0, p = 25) and key set factor analysis (the firstvariable associated with the highest loadings, p = 27), respec-tively. For both the methods, the prediction accuracy reachesthe highest (97.47%). However, as shown in Fig. 3c and d, notall the samples could be separated satisfactorily.

cmc

Fa

erbal samples can be separated successfully with the first threeCs of the reduced data sets. Thus, the variables retained areossibly not informative to characterize the PCA model and theumber of principal factors of the subsets might be over threeere.

ig. 3. PC1–PC2–PC3 scores plots with the highest values of: configuration consenccuracy, p = 27 (d), respectively.

Seen from Table 2, the configuration comparison with Pro-rustes analysis and a similarity measure between the originalatrix and the reduced subsets are not indicative to the dis-

rimination of herbal samples with Bayes discrimination anal-

sus (a), weighted similarity (b), prediction accuracy, p = 25 (c) and prediction


ysis. High matching between the original and reduced datasetsdoes not suggest high prediction accuracy. For example, for themethod of key set factor analysis associating with the first vari-able with the highest loadings, the values of configuration con-sensus, weighted similarity and prediction accuracy are 72.82%,30.76% and 96.20% if p = 25. For p = 27 and 28, they are 74.85%,30.40% and 97.47%, 76.83%, 30.65% and 94.94%, respectively.

4. Conclusion

In this study, five approaches are investigated for variableselection and three criteria are employed to determine the num-ber of representative variables retained. Moreover, Procrustesanalysis, weighted similarity and tri-variate plots of PCA scoresare used to assess the reduced subsets.

In view of the prediction accuracy with Bayes discriminationanalysis, the methods of forward selection (B4) with thevariables associated with the loadings closest to 0 and key setfactor analysis are advantageous over other approaches forvariable selection. On the other hand, Procrustes analysis anda similarity measure are not indicative to variable selection.However, even if there are only 25, 27 or 28 variables retainedand thus the PC1–PC2–PC3 scores plots are not satisfactoryto separate all the herbal samples, the prediction accuracy canbe still high if suitable approaches are employed for variableselection. Possibly, these variables selected are informative forBm

prp

A

RSFaFGP

(AoE/B-101) in Hong Kong, Hong Kong Research GrantsCouncil (HKU 7089/00P).

References

[1] W.J. Welsh, W.K. Lin, S.H. Tersigni, E. Collantes, R. Duta, M.S. Carey,W.L. Zielinski, J. Brower, J.A. Spencer, T.P. Layloff, Anal. Chem. 68(1996) 3473.

[2] State Food Drug Administration of China, Chinese Trad. Pat. Med. 22(2000) 671.

[3] F. Gong, Y.Z. Liang, Y.S. Fung, F.T. Chau, J. Chromatogr. A 1029(2004) 173.

[4] F. Gong, Y.Z. Liang, P.S. Xie, F.T. Chau, J. Chromatogr. A 1002 (2003)25.

[5] E.S. Ong, J. Sep. Sci. 25 (2002) 825.[6] FDA Guidance for Industry—Botanical Drug Products (Draft Guidance),

2000, VIII, B, 2e, 3e.[7] R. Upton, International Symposium on Quality of Traditional Chinese

Medicine with Chromatographic Fingerprint, Guangzhou, China, 2001.[8] X. Di, K.K.C. Chan, H.W. Leung, C.W. Huie, J. Chromatogr. A 1018

(2003) 85.[9] N.P.V. Nielsen, J.M. Carstensen, J. Smedsgarrd, J. Chromatogr. A 805

(1998) 17.[10] F. Gong, Y.Z. Liang, Q.S. Xu, F.T. Chau, A.K.M. Leung, J. Chromatogr.

A 905 (2001) 193.[11] F. Gong, Y.Z. Liang, F.T. Chau, J. Sep. Sci. 26 (2003) 112.[12] F. Gong, Y.Z. Liang, H. Cui, F.T. Chau, B.T.P. Chan, J. Chromatogr. A

909 (2001) 237.[13] C. Demir, P. Hindmarch, R.G. Brereton, Analyst 121 (1996) 1443.[14] I.T. Jolliffe, Appl. Statist. 21 (1972) 160.[[[[

[

[

[[

[

[[

[[

ayes discrimination analysis although they are not for the PCAodel.As chromatographic fingerprint with a large number of data

oints is very complex, it is not a trivial task to select someepresentative variables for herbal discrimination. This paperrovides some potential tools to address such difficult problem.

cknowledgements

This research work was financially supported by Scientificesearch Foundation for the Returned Overseas Chinesecholars, State Education Ministry of China, Key Laboratoryoundation of State Key Laboratory of Chemo/Biosensingnd Chemometrics (no. 200408), National Natural Scienceoundation of China (20175036 and 20235020), Universityrants Council of Hong Kong SAR via the Area of Excellenceroject “Chinese Medicine Research and Further Development”

15] I.T. Jolliffe, Appl. Statist. 22 (1973) 21.16] W.J. Krzanowski, Appl. Statist. 36 (1987) 22.17] J.R. King, D.A. Jackson, Environmetrics 10 (1999) 67.18] A.D. Shaw, A.D. Camillo, G. Vlahov, A. Jones, G. Bianchi, J. Rowland,

D.B. Kell, Anal. Chim. Acta 348 (1997) 357.19] Q. Guo, W. Wu, D.L. Massart, C. Boucon, S.F. Jong, Chemom. Intell.

Lab. Syst. 61 (2002) 123.20] Q. Guo, W. Wu, D.L. Massart, C. Boucon, S.D. Jong, Anal. Chim. Acta

446 (2001) 85.21] E.R. Malinowski, Anal. Chim. Acta 134 (1982) 129.22] F. Gong, Y.Z. Liang, Q.S. Xu, F.T. Chau, K.M. Ng, Anal. Chim. Acta

450 (2001) 99.23] E.R. Malinowski, Factor Analysis in Chemistry, Wiley-Interscience, New

York, 2002.24] M. Kubista, Chemom. Intell. Lab. Syst. 7 (1990) 273.25] W.J. Krzanowski, Principles of Multivariate Analysis: a user’s Perspec-

tive, Clarendon Press, Oxford, 1988.26] O.M. Kvalheim, Y.Z. Liang, Anal. Chem. 64 (1992) 936.27] Y.Z. Liang, O.M. Kvalheim, H.R. Keller, D.L. Massart, P. Kiechle, F.

Erni, Anal. Chem. 64 (1992) 946.

Variable selection for discriminating herbal medicines with chromatographic fingerprints

Documents

Transcript of Variable selection for discriminating herbal medicines with chromatographic fingerprints