Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

1/12

ELSEVIER

Chemometrics and Intelligent Laboratory Systems 33 (199 6) 35-46

Chemomet r i cs and

inte l l igent

laboratory sys tems

Artificial neural networks in classification of NIR spectral data:

Design of the training set

W. Wu a, B. Walcmk a,, D.L. Massart a7 ,S. Heuerding b, F. Erni b, I.R. Last ,

K.A. Prebble

a

ChemoAC, Pharmaceutical I nstitute, Vri je Uni versiteit Brussel, Laarbeeklaan 103, B-1090

Brussel

Belgium

b Sandoz Pharma AC, Anal ytical Research and Development, CH -4002 Basle , Switzerland

Analytical Department Laboratories, The Wellcome Foundation L td, Dartfor d, Kent DA1 5AH , UK

Received 24 May 1995; accepted 18 September 1995

Abstract

Artificial neural networks (NN) with back-error propagation were

used for the classification with NIR spectra and applied

to the classification of different strengths of drugs. Four training set selection methods were compared by applying each of

them to three. different data sets. The NN architecture was selected through a pruning method, and batching operation, adap-

tive learning rate and momentum were used to train the NN. The presented results demonstrate that selection methods based

on Kennard-Stone and D-optimal designs are better than those based on the Kohonen self-organized mapping and on ran-

dom selection m ethods and allow 10 0% correct classification for both recognition and prediction. The Kennard-Stone de-

sign is more practical than the D-optimal design. The Kohonen self-organized mapping method is better than the random

selection method.

Keywords: Drug

analysis; Neural network; NIR; Pattern recognition

1 Introduction

One observes an increasing interest in the applica-

tion of neural networks (NNs) in chemical calibra-

tion and pattern recognition problems [l-13]. Al-

though NNs do not require any assumptions about

data distribution, they can be successfully applied

Corresponding author.

On leave from Silesian University, Katowice, Poland.

only to sufficiently large and representative data sets.

The term sufficiently large is relative. The im por-

tant factor is the ratio of the number of samples to the

number of weights considered in the net architecture.

Widrow [ 141 suggests as a rule of thum b that the

training se t size should be about 10 times the num ber

of weights in a network. According to other authors

[15], the maximum number of nodes in the hidden

layer should be of the order g(m + l), where and

g denote the number of input and output units, re-

spectively. Although these suggestions differ to some

extent, all NN users agree that the higher the ratio of

the number of samples to the number of weights the

0169 -743 9/% / 15.00 8 199 6 Elsevier Science B.V. All rights reserved

SSDI

0169-7439(95)00077-l


2/12

36

W. Wu et al./ Chemometrics and Intelligent Laboratory Systems 33 1996) 35-46

better the generalization ability of NN . For the given

number of samples this ratio can be maximized by

minimising the net architecture (reducing input data,

pruning redundant weights, etc.).

The second requirement, data representativity,

means that the samples in the data set should be

(evenly) spread over the expected range of data vari-

ability. In some cases it may be possible to generate

such samples as the training set using experimental

design techniques. However, in most cases such as in

analysis of food samples, one does not have this pos-

sibility [16,17]. Usually, one needs to select the train-

ing (model) samples from a large set of samples. The

other samples could be used to test the net. However,

using all samples to train the net may lead to overfit-

ting and to large prediction errors for the test set. To

avoid this the net training must be monitored. This

means that apart from the training set, one needs two

other data sets, the monitoring and test sets. In indus-

trial practice, the sample size is not very large. One

can use the same data set to monitor training and later

evaluate the NN. Hence, at least two data sets (the

training and test sets) are required. The principles for

the design of these two sets are the same as the prin-

ciples of design of any model set. Our study aims to

evaluate different strategies of training set design,

namely random selection, the Kohonen self-organis-

ing map approach [ 181, and two ne w approaches pro-

posed by us namely the Kennard and Stone design

[

191 and D-optimal design [2 0,21].

2. Theory

2.1. Notation

m

number of input variables for NN

g

number of classes (i.e. number of output

variables of NN)

n

number of objects in the data

I

z

number of objects in the training set

number of variables in the training set

X

matrix of the training set n X m)

N number of objects in the training set or in the

test set

Y

target matrix (N

X g>

out

output matrix of NN (N

X 8 .

2.2. Design of training set

2.2.1. Random selection

There are several ways of selecting the training set.

The simp lest one is a random selection which m eans

that no clear selection criterion is applied. There is a

risk that objects of some class are not selected in the

training set. In order to avoid such risk, we select 3/4

of the objects separately from each class, and put

them together a s training set. If 3/4 of the objects is

not an integer, the number is rounded to the nearest

integer in decreasing direction.

2.2.2. Kohonen self-organising maps I1 8 61

Another possible procedure for selecting the train-

ing set is to apply clustering techniques. The Koho-

nen network can be applied as such [6]. Zupan et al.

compared three kinds of methods in an example of the

reactivity of chemical bonds, and found that the Ko-

honen self-organising map performed best [6]. The

main goal of the Kohonen neural network is to map

objects from m-dimensional into two-dimensional

space. When the objects have similar properties in the

original space, they will map to the same node. In this

study, a (3

X

3) Kohonen network is chosen, contain-

ing 9 nodes. The learning rate is 0.1 at the beginning

and is linearly decreased so that at the last training

cycle it reaches 0 . The neighbourhood size is also

decreased linearly but reaches a minimum of 1 after

one-quarter of the training cycles, and remains 1 for

the rest of the training. The network is stabilised af-

ter each pattern has been presented to the network

about 500 times. 3/4 of the objects are randomly se-

lected from the objects which map to the same node.

If 3/4 of the objects is not an integer, we round it to

the nearest integers in increasing direction, otherwise

some nodes would have no objects after rounding.

This procedure is applied to each class separately. All

selected objects are put together as the training set.

2.2.3.

Kennard-Stone design [19,25,291

The Kennard-Stone algorithm technique was

originally used to produce a design when no standard

experimental design can be applied. With this tech-

nique, all objects are considered as candidates for the

training set. The design objects are chosen sequen-

tially. At each stage, the aim is to select the objects


3/12

W. Wu e t a l . / Chem om e t r i c s a nd I n t e l li g e n t L abo r a t o r y S y s t em s 33 1996 ) 3 5 - 4 6

37

so that they are uniformly spaced over the object

space. The first two objects are selected by choosing

the two objects that are farthest apart. The third ob-

ject selected is the one farthest from the first two ob-

jects, etc. Let d,, denote the squared Euclidean dis-

tance from the ith object to the jth object. Suppose

k

objects have already been selected, where k

and three kinds of placebo. Twenty different tablets

of each dosage form (active and placebo) are mea-

sured four times through a glass plate on which the

tablets are positioned. The average spectra of the four

were collected.

Data set2 contains 160 NIR spectra(lOOOl-4000

cm- 779 wavelengths) of capsules containing drug s


4/12

38

W . Wu e t a l . / Chew me t r i c s a nd I n t e l li g e n t L abo r a t o r y S y s t em s 33 1996 ) 3 5 - 4 6

of different dosages (0.1, 0.25, 0.5, 1.0 and 2.5 mg)

and three kinds of placebo. Twenty different cap-

sules of each dosage form (active and placebo) are

measured four times through the glass plate.

Daru set 3 contains 135 NIR spectra (1100-2500

nm; 700 wavelengths) of tablets containing different

dosages (20,50, 100 and 200 mg) of the experimen-

tal active ingredient, a placebo and a clinical com-

parator. There are respectively 15, 17, 15, and 21

spectra in the classes of different dosages, 47 spectra

in the class of placebo and 20 spectra in the class of

comparator. Spectra are measured through the blister

package, which contributes to the spectrum at around

1700 nm.

3.1. D at a p re p rocessing

The pre-processing step consists of trimming, data

transformation, and training set selection. The first

and last 15 wavelengths were trimmed from each of

the spectra to remove edge effects. In this study, the

standard normal variate @NV) transformation [24]

was applied to reduce the effects of scatter, particle

size, etc. After transformation, each data set was di-

vided into the training set and test set by the tech-

niques described. The data of the training set were

subjected to a principal component analysis (PCA).

The first 10 principal com ponents were taken into

consideration. The scores of the objects from the test

set in the PC space were calculated using the load-

ings obtained from the training set.

4.

Neural network parameters

The multilayer feedforward network trained with

the backpropagation learning algorithm was applied

[22]. The goal of net training is to minimize the root

mean square error @MS)

RMs =

d

fL Cjs1( ij - OUtij)

Ng

where yij is the element of target matrix y (N X g>

for the data considered (training set or test set), and

ourij is the element of the output matrix out N X g)

of the NN.

To make backpropagation faster the following

three techniques were used: batching operation,

adaptive learning rate and momentum [23]. In batch-

ing operation, one can apply multiple input vectors

simultaneously, instead of one input vector per time,

and obtain the networks response to each of them.

Adding an adaptive learning rate can also decrease

training time. This procedure increases the training

speed, but only to the extent that the net can learn

without large error increases. At each iteration new

weights and biases are calculated using the current

learning rate. The new output of the net and error term

are then calculated. If the new error exceeds the old

error by more than a predefined ratio (typically 1.041,

the new weights, biases, output and error are dis-

carded. In addition, the learning rate is decreased

(typically multiplied by 0.7). If the new error is less

than the old error, the learning rate is increased (typi-

cally multiplied by 1.05). Otherwise the new weights,

etc., are kept [23]. Momentum decreases backpropa-

gations sensitivity to small details in the error sur-

face and helps the net to avoid getting stuck in shal-

low minima.

To avoid overfitting, the perform ance of the net-

work is tested every hundred or thousand epochs

during the training, and the weights for which the

minim al RM S for the test set is observed are

recorded.

The target vector describing the belongingness of

the object to a class was set to binary values of 1 (for

corresponding class) and 0 (for other classes). The fi-

nal output of the net can be evaluated in two differ-

ent ways: the object can be considered as correctly

classified if the largest output regardless of its abso-

lute value is observed on a node signalling the cor-

rect class, or the object can be considered as cor-

rectly classified if the largest output is observed on a

node signalling the correct class and its value is

higher than 0.5. The second criterion is stricter and

was chosen in the study to evaluate NN performance.

It allows soft modelling of data, i.e. it can happen that

the

i t b

object will not be classified into any of the

predefined classes.

The performance of a classification system is ad-

ditionally expressed as the percent of the number of

correctly classified objects of the training and test

sets, divided by the total number of objects present in

these sets.


5/12

W. Wu et al. / Chemom etrics and Intelligent Loboratory Systems 33 1996) 35-46

39

5. Results and discussion

5.1. Selection of NN architecture

To compare different techniques of training set se-

lection, we can compare them for the optimal struc-

ture obtained with each training set selection tech-

nique. As described by Zupan and Gasteiger [6], an-

other way is to compare the performance using a

fixed structure of NN. We chose the latter procedure.

The architecture of the network was first optimised

for the data which were divided into the training and

test sets by the Kennard-Stone algorithm. Then we

used this structure as the fixed structure. The input

and output values were range-scaled between 0.1 and

0.9 variable by variable. The backpropagation leam-

ing rule with adaptive learning rate and momentum

was used. The initial values of learning rate and mo-

mentum were fixed at 0.1 and 0.3, respectively.

An effort was made to check the influence of the

random initialisation of the net weights upon the fi-

nal classification results. For instance, with data set 1

rerunning a neural net (4 nodes in the hidden layer)

10 times with the randomised initial weights results

in correct classification rates (CCRs) for training and

test sets, each time equal to lOO% , while the mean

values of RMSs for training and test sets are equal to

0.0565 and 0.0619, with standard deviations of

0.0061 and 0.0059, respectively. It demonstrates that

CCRs for training and test sets are stable with differ-

ent seeds of random generator, although the RM Ss

change. The adaptive learning rate makes the results

more independent of the initial values of weights. The

NN utilised in this study consisted of two active lay-

ers of nodes with a sigmoidal transfer function. The

number of nodes in the output layer is determined by

the number of classes. Normally the number of nodes

in the input layer is also determined by the structure

of the data. As already explained, for NIR data the

number of variables is much larger than the number

of objects and the variables are highly correlated. The

data can be orthogonalized and reduced by principal

component analysis, but then the number of input PCs

should be optimised. According to Widrows sugges-

tion (see Section 11, the number of objects ought to

be about 10 times the number of weights. However,

in practical use, the number of objects is limited and

there are seldom so many available. Therefore, we

relaxed this condition during the optimisation of the

NN architecture: the ratio of the number of objects to

the number of weights ough t to be more than 1. If the

numbers of input and output nodes are fixed, the

maximum number of hidden nodes can be estimated

using this rule. For instance, if there are 60 objects in

the training set, we never train an NN having more

than 60 weights. If we want to use 10 input nodes and

6 output nodes, then the number of hidden nodes

cannot exceed 3. The NN with 3 hidden nodes, 10

input nodes and 6 output nodes has together 57 (11

X

3 + 4 X 6) w eights. For the net with 4 hidden

nodes, the number of weights (11

X

4 + 5

X

6 = 79)

is already larger than 60.

There is no standard way to optimise the architec-

ture of NN. The simplest way is to try systematically

all combinations of nodes to find the optimal number

of nodes in the input and the hidden layer. Data set 1

Table 1

Data set 1: correctly classified rate of the training set (CCR) and test set (CCRt) of all combinations within 10 input nodes and 4 hidden

nodes; maximum number of epochs 5000

Input nodes 1 hidden node 2 hidden nodes

3 hidden nodes 4 hidden nodes

CCR cCRt

CCR CCRt CCR CCRt

CCR

CCRt

2 28.6 28.6 59.1

65.7 69.5 77.1 80.0 77.1

28.6

28.6 66.7 71.4 91.4 97.1 100 100

28.6

28.6 71.4 71.4

96.2

97.1

100

100

28.6

28.6 69.5 71.4 100 100

100 100

28.6

28.6

71.4

71.4 85.7

85.7 100 100

28.6

28.6 75.2 77.1 100

100 100 100

28.6

28.6 79.1 80.0

98.1

97.1

100 100

28.6

28.6 71.4

68.6 100

94.3

100 97.1

28.6

28.6 71.4 68.6

99.1

97.1

100 100


6/12

lo

W. Wu et al. Chemometri cs and I ntelli gent L aborator y Systems 33 (1996) 35-46

is used as an example. We consider all combinations

of the number of input nodes varying from 2 to 10

(i.e. the first two to the first 10 PCs) and the number

of hidden nodes varying from 1 to 4. The results of

all these nets are shown in Table 1. There are eight

NNs which give 100% classification for both recog-

nition and prediction. However, this approach re-

quires a lot of trials, and could be even more time

consuming if we would like to take into account all

possibly combinations of two, three, etc., PCs.

A more efficient approach to optimise the number

of nodes in the input and the hidden layers is as fol-

lows: first the number of input nodes is fixed at the

maximal number of PC factors, and the number of

hidden nodes is increased from small to large until the

performance of the network does not improve any

more or both the recognition and prediction percent-

ages are 100%.

The maximum number of PCs to be entered can be

decided by the variance explained (for instance, the

number of PCs needed to explain 99% variance) or

by the results of pilot experiments. In the latter case,

we train the net using the first 10 PCs as input and

the maximum number of hidden nodes which is esti-

mated by the above rule. If the NN performs well, 10

can be used as the maximum number of PCs. If the

performance is not satisfying, m ore PCs will be taken

into account.

When the number of nodes in the hidden layer has

been fixed and the maximum number of input nodes

has been decided, the number of input nodes is pruned

according to the value of the weights. If the weights

connected to one input node are all large, this indi-

cates that the variable corresponding to the input node

plays an important role in NN. If the weights con-

nected to one input node are all small, this indicates

that the variable corresponding to the input node plays

a small role in the NN and can be pruned off. If the

weights are intermediate, one can try to prune the

variable and if the NN still performs well one can

decide to prune it definitively. A hidden node can also

be pruned if the weights connecting the hidden node

to the input nodes, and the weights between the hid-

den node and the output nodes, are small. The mag-

bl 0.2

2-Layer

Badcpmpagetion

ith

AdaptiveLR & Momentum

I I

1 I I I I 1

I I

I I I I

I

1

I

0 500

looo 1500

2ooo 2500 3cMl 3 m 4ooo 4500 5ooo

Epoch; . training - test

60

1 I I

I I

I I

I

0 300

looo 1 500 Moo 2500 3ooo 3500 4ooo 4500 m

Epoch; training -test

Fig. 1. a) The root mean square error RM S) as a function of the numb er of haining epochs; b) the percentage of correctly classified

objects as a function of the number of training epochs; network architecture 10 4 7); data set 1.


7/12

W. Wu e t a l ./ C hem om e t r i c s a nd I n t e l li g e n t L abo r a t o r y S y s t em s 33 1996 ) 3 5 - 4 6

41

Input Biases

(b)

0

2 4

6 6

10 12

::

0 5

I

1.5

2 2.5

3 3 .5

4 4 .5

Nauronj

Fig. 2. (a) Hinton diagram of the weights between the nodes of the input layer and tbe nodes of the hidden layer in the network (10 X 4 X 7);

(b) sum of the absolute values of the weights of tbe node in the input layer, (c) sum of the absolute values of the weights of the node in the

hidden layer; data set 1.

(a)

p-Layer &&propgation with AdaptiveLR 8 Momentum

0.3

1

1

I 1 I I I

I

A

. . . . . . . . . . . . . ,

0.05

I

I

0 500 loo0 1500

zfloo

2 5 00 3 ow s o 0

4 o o o 4 50 0 5 o o o

Epoch; Iraining -test

(bl

100

1

c

g 90-

iti

u 80-

i3

$

70 - :

60

I I 1

1

I I

I

0 iw

l o o c l 1500 2oal 2500 3o cm 3500 4wo 4500 5ow

E p o c h ; . training -test

Fig. 3. (a) The root mean square error (RMS) as a function of the number of aaining epochs; (b) the percentage of comxtly classified

objects as a function of the number of training epochs; network architecture (3 X 4 X 7); data set 1.


8/12

42

W. Wu et al. Chemomerr ics and I ntelli gent L& orator y Systems 33 (19%) 35-46

nitude of weights can be easily displayed in the Hin-

ton diagram. This diagram displays the elements of

the weight matrix with squares whose areas are pro-

portional to their magnitude. The bias vector is sepa-

rated from the other weights with a solid vertical line.

The largest square corresponds to the weight with

largest magnitude and all others arc drawn with sizes

relative to the largest square [23]. The sum of the ab-

solute values of the weights connected to a node can

be used to estimate the importance of the role played

by the node. This pruning is repeated until the per-

formance of the network degrades.

Table 2 demon strates the results of classification

for the sequence of steps in the optimisation of the net

architecture for data set 1. As one can see, a 100%

correct classification is observed for the NN w ith the

first 10 PCs as input variables and 4 nodes in the

hidden layer.

Fig. 1 demonstrates the performance of the net-

work with 10 input and 4 hidden nodes during the

training. Fig. 2 shows the Hinton diagram. The

weights of input nodes 4 to 10 are much smaller than

those of the first three nodes. This suggests that the

Table 2

Data set 1: correctly classified rate of the training set (CCR ) and

test set (CCRt); training set selected by the Kemxud-Stone proce-

dure; maximum number of epochs 5000

Input Hidden

CCR cCRt Time

nodes nodes

(C) (%) (s)

10

3 99.1 97.1 568

10 4 100 100 534

3 4 100

100 531

2 4

80 77.1 627

PCs 4 to 10 do not contribute significantly to the net-

work performance and the first three PC factors play

an important role in classification. After pruning

them, the network performance does not decrease

(Fig. 3). However, recognition and prediction per-

centages become worse when the input nodes are re-

duced to 2 (PCs 3 to 1 0 are rejected>. Therefore, the

optimal structure of the network for data set 1 is 4

nodes in the hidden layer and 3 input nodes. The fi-

nal weights of the optimal network are shown in the

Hinton diagram (Figs. 4 and 5). This indicates that

input i & Biases

0.5 1 1.5

2

2.5 3 3.5

4 4.5

mj

Fig. 4. (a) Hinton diagram of the weights between the nodes of the input layer and the nodes of the hidden layer in the optimal network

(3 X 4 X 7); (b) sum of tbe absolute values of the weights of the node in the input layer. (c) sum of the absolute values of the weights of the

node in tbe hidden layer, data set 1.


9/12

W. Wu et al. Chemometri cs and I ntelli gent Laborator y Systems 33 (19%) 35-46

43

Input i 8 Biases

(b)iilli

0.5

1 1.5

2 2.5

3 3.5

4

4.5

0

1 1

I

I I I

0

1

2

3 4 5

6 7

8

Neuron

j

Fig. 5. (a) Hinton diagram of the weights between the nodes of the hidden layer and the nodes of the output layer in the optimal network

(3 X 4 X 7); (b) sum of the absolute values of the weights of the node in the hidden layer; (c) sum of the absolute values of the weights of

the node in the output layer; data set 1.

(4 R~CIII

I51

IO

t

.

% *ia

(4 -

151------

10

t

.

I

*

Y

(c) KannaId-Btona

:J--yy

15

(d) Doptimal

1

5

-5

5

10 15

3

0 5 10 15

Fig. 6. the design of the training sat by random selection, Kohonen self-organising m ap, Kennard-Stone algorithm and D-optimal design

with a simulated data set; ( *) objects of the training set

; ( - )

objects of the test set.


10/12

44

W. Wu et al./ Chemometri cs and I nfelligent L & orator y Systems 33 (19%) 35-46

Table 3

Data set 2: correctly classified rate of the training set (CCR) and

test set (CCRt); training set selected by the Kemtard-Stone prcce-


Input

Hidden

CCR

CCRt Time

10

4

99.2

97.5 1670

10

5 100

100 2182

9 5 100 100 1781

8

5 99.2

100

1763

this diagram can be used as a visual tool to reduce the

number of the input scores and to select the input

variables. The optimal architecture is obtained after

we train the NN four times as described in Table 2;

while for the same data we need 36 trials with the

systematic trial method. This kind of pruning method

is much faster than the systematic trial method.

However, this pruning m ethod can be effectively ap-

plied only whe n the performance of the NN is very

good. The idea of this method is to reduce the size of

the architecture without changing the performance of

NN . If the performance of the NN is bad, there is no

sense in trying to improve the performance by prun-

ing.

Using the pruning method, the optimal architec-

ture for data set 2 is 5 hidden nodes with 9 input

nodes (Table 3); for data set 3 it is 3 hidden nodes

with 6 input nodes (Table 4). This architecture and

parameters were used in the following experiments.

5.2. Comparison of the four techniques of training set

selection

In order to visually compare the four techniques,

two-dimensiona l data of 40 objects were first simu-

lated. In this case, half of the objects was selected by

Table 4


test set (CCRt); training set selected by the Kennard-Stone proce-


Input

Hidden

::

CCRt Time

nodes

nodes

(46)

(s)

7

2

76.8

69.4

1699

7

3 100 100 1521

6 3 100 100 1867

5 3 99.0

100

1845

the studied methods except for the Kohonen method.

For the Kohonen method, 21 objects were selected,

because sometimes half of the objects mapping in the

same node was not an integer, and this was rounded

as described earlier.

Fig. 6 shows that objects selected by the Ken-

nard-Stone and D-optimal cover the whole data do-

main; while the objects selected by the random selec-

tion and Kohonen methods do not. The Kennard-

Stone procedure selects the objects so that they are

distributed evenly, and the D-optimal selects the ex-

treme objects. The number of objects which is out of

the range of the selected objects by the Kohonen

method is lower than that by random selection. Ken-

nard-Stone and D-optimal m ethods seem to select

objects that are more appropriate in the sense that

they are more representative for building the class

borders than the other methods.

Further, we studied the effect of the four tech-

niques on the performance of NN by keeping the ar-

chitecture and parameters of the network constant. In

the methods of random selection, the training set ob-

jects are randomly selected and in the case of Koho-

nen self-organising maps, they are randomly selected

from each cluster. This random selection step leads to

the possibility that the selection is sometimes very

good and sometimes very bad. To overcome such a

drawback, the methods of random selection and Ko-

honen self-organising maps were repeated three

times. The results are shown in Tables 5-7. To train

the NN one time, it takes about 10 min for data set 1,

and half an hour for data sets 2 and 3.

For data set 1 , there are no differences in the per-

formance of recognition, the recognition percentages

Table 5

Data set 1 : comparison of the four different techniques of training

set selection; number of correctly classified objects divided by to-

tal number of objects expressed behveen parentheses

Method

Random

CCR (%) CCRt (%)

Time (s)

100(105/105 ) 97.1 (34/35) 624

Random

Random

Kohonen

Kohonen

Kohonen

Kemmrd-Stone

Doptimal

100 (105j105) 100 (35/35) 634

100 (105/105) 94.3 (33/35)

636

100(113/113) %.3(26/27)

566

100(116/116) 100(24/24) 577

100 (116/116) 100 (24/24)

684

100(105/105) 100(35/35)

531

100 (105/105) loo (35/35) 516


11/12

W. Wu et al./ Chemometrics and Intelligent Laboratory Systems 33 1996) 35-46

45

Data set 2: comparison of the four different techniques of training

set selection; number of correctly classified objects divided by to-

tal number of objects expressed between parentheses

Table 6

Table 8


test set (CCRt); training set selected by D-optimal design ; maxi-

mum number of epochs 2OCGO

Method CCR (%o) CCRt (%I Time (s)

Input

nodes

6

5

4

Hidden

nodes

3

3

3

CCR

(o/o)

100

100

88.9

CCRt

(8)

100

100

88.9

Time

(s)

1881

1916

1915

Random 99.2 (119/120) 92.5 (37/40) 2218

Random 100(120/12 0) 97.5 (39/40) 2213

Random 99.2(119/120) 97.5 (39/40) 2212

Kohonen 100(132/13 2) 96.4 (27/28) 1942

Kohonen 100(130/13 0) 100 (30/30)

1956

Kohonen 99.2(130/131)

96.6 (28/29) 2309

Kennard-Stone 100(120/120)

100 @O/40)

1781

D-optimal 100(120/120)

100 @O/40)

2246

being 100% for all methods. There are differences

though in the performance of prediction. The predic-

tion percentages with the Kennard-Stone and D-op-

timal training sets are 100%. For the random selec-

tion method, one of the three replicates gives 100%

prediction percentage. For the Kohonen method, two

of the three replicates perform 100% in prediction.

For data set 2, Kennard-Stone and D-optimal

training sets lead to perfect performance (100% cor-

rect classification) for both recognition and predic-

tion. The results of the random selection are not sat-

isfying. None of the replicates allow 100% predic-

tion, and only one of the three replicates gives 100%

of recognition. W ith the Kohonen method, the results

of one replicate are good (100% recognition and

100% prediction success), an d the results of the other

two replicates are bad.

For data set 3, the Kennard-Stone and D-optimal

training sets give the same perfect performance. With

the random selection and the Kohonen methods, the

Table 7

Data set 3: comparison of the four different technique3 of training

set selection; numb er of cone&y classified objects dtvided by to-

tal number of obkcts expressed between parentheses

Method CCR (%o) CCRt (%I Time (s)

Random 100 (99/99)

Random

100 W/99)

Random

100 @9/99)

Kohonen

97.2 (103/106)

Kohonen 1OQW7/107)

Kohonen 100(110/110)

Kennard-Stone 100 W/99)

D-optimal 100 W/99)

94.4 34/36) 1863

88.9 32/36) 1868

100 36/36) 1884

96.6 28/29)

1667

100 28/28) 1685

100 (25/25)

1632

100 (36/36)

1867

100 (36/36)

1881

performances of the three replicates are sometimes

good and sometimes bad. One of the three replicates

gives good results for the random selection method,

and so do two of the three replicates for the Kohonen

method.

In order to compare the Kennard-Stone design and

D-optimal design, we tried to optimize. the architec-

ture of NN again for the D-optimal design. The ar-

chitecture of data set 3 can be further improved us-

ing the pruning method (Table 8).

The architecture of the other data sets cannot be

pruned further. This suggests that the D-optimal se-

lection might sometimes be slightly better than the

Kennard-Stone. D-optimal selection selects the train-

ing set objects which describe the whole information

as well as possible. There are more extreme objects

selected by this method than selected by the other

methods (Fig. 6). For classification, the aim is to de-

rive the border of every class and therefore the ex-

treme objects are more useful than others during

training.

6. Conclusion

Artificial NN are shown to be useful pattern

recognition tools for the classification of NIR spec-

tral data of drugs when the training sets are correctly

selected. Comparing the four training set selection

methods, the Kennard-Stone and D-optimal proce-

dures are better than the random selection and Koho-

nen methods. The results of the D-optimal design may

be slightly better than those of the Kermard-Stone

design. However, the computing time of the D-opti-

mal design (using Kennard-Stone design as the ini-

tial points) is larger than that of the Kennard-Stone


12/12

46

W. Wu et ul. Chemometri cs ana ntelli gent Laborator y Systems 33 (1996) 35-46

procedure. The random selection and Kohonen meth-

ods do not allow good performance in our study.

The number of data sets studied is not sufficiently

large to prove that these conclusions are always valid

for any data set. However, they allow at least to state

that the Kennard-Stone procedure will be a useful

approach in certain instances, and according to us, in

most instances.

References

[II

D

[31

141

151

b51

[71

181

[91

[lOI

[III

X.H. Song and R.Q. Yu, Chemom. Intell. Lab. Syst., 19

(1993) 101-109.

C. Borggaard and H.H. Thodberg, Anal. Chem., 64 (1992)

545-55 1.

Y.W. Li and P.V. Espen, Chemom. Intell. Lab. Syst., 25

(1994) 241-248.

D. Wienke and G. Kateman, Chemom. Intell. Lab. Syst.. 23

( 1994) 309-329.

T.B. Blank and S.D. Brown, J. Chemom., 8 (1994) 391-407.

J. Zupan and J. Gasteiger, Neural Networks for Chemists: An

Introduction, Weinheim, New York, 1993.

T. Noes, K. Kvaal, T . Isaksson and C. M iller, J. Near Infrared

spectrosc., 1 (199 3) 1-11.

P. de B. Harrington, Chem om. Intell. Lab. Syst., 19 (199 3)

143-154.

B.J. Wythoff, Chemom. Intell. Lab. Syst., 20 (1993) 129-

148.

G. Kateman, Chemom. Intell. Lab. Syst., 19 (1993) 135-142.

J.R.M. Smits, L.W. Bmedveld, M.W.J. Derksen and G. Kate-

man, Anal. Chim. Acta, 258 (1992) 1 -25.

[12] A.P. Weijer, L. Buydens and G. Kateman, Chemom. Intell.

Lab. Syst., 16 (1992) 77-86.

[13] J. Zupan and J. Gasteiger, Anal. Ch im. Acta, 248 (1991) l-

[14] ZWidrow , Adaline and Madaline, in Proceedings of IEEE

1st International Conference on Neural Networks, 1987 , pp.

143-158.

1151 A. Maren, C. Harston and R. Pap, Handbook of Neural Com-

puting Applications, Academic Press, San Diego, 1990.

[16] T. Naes, J. Chemom.. l(1987) 121-134.

[

171 T. Naes and T. Isaksson, A ppl. Spectrosc., 4 3 (1989) 328-33 5.

[

181 T. Kohonen, Self-Organisation and Associative Memory,

Springer, Heidelberg, 1984.

[19] R.W. Kemmrd and L.A. Stone, Technometrics, 11 (1969)

137-148.

[20] R. Carlson, Design and Optimization in Organic Synthesis,

Elsevier, Amsterdam, 1992.

[21] P.F. de Aguiar, B. Bourguignon, M.S. Khots and D.L. Mas-

sart, Chemom. Intell. Lab. Syst. (in press).

[22] T.P. Vogl, J.K. Mangis, A.K. Rigler, W.T. Zink and D.L.

Alkon, Biol. Cybemet., 59 (1988) 257-263.

[23] H. Demuth and M. Beale, Neural Network Toolbox Users

Guide, The MathWorks, Inc., 1993.

124 1 R.J. Barnes, M.S. Dh anoa and S.J. Lister, Appl. Spectrosc.,

43 ( 1989) 772-777.

[25] B. Bourguignon, P.F. de Aguiar, K. Thorns and D.L. Mas-

sart, J. Chromatogr. Sci., 32 (1994) 144-152.

[26] T.J. Mitchell, Technomettics, 16 (197 4) 203-2 10.

127 1 V.V. Fedorov, Theory of Optimal Experiments, Moscow

University. English translation by W.J. Studden and E.M.

Klimo, Academic Press, New York, 1972.

128 1 A.C. Atkinson, Chemom . Intell. Lab. Syst., 28 (199 5) 35-47.

129 1 B. Bourguignon, P.F. de Aguiar, M .S. Khots and D.L. Mas-

sart, Anal. Chem., 66 (1994) 893-904.

Walczak_Artificial Neural Networks in Classification of NIR Spectral Data

Documents

Transcript of Walczak_Artificial Neural Networks in Classification of NIR Spectral Data