Walczak_Artificial Neural Networks in Classification of NIR Spectral Data
Transcript of Walczak_Artificial Neural Networks in Classification of NIR Spectral Data
-
7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data
1/12
ELSEVIER
Chemometrics and Intelligent Laboratory Systems 33 (199 6) 35-46
Chemomet r i cs and
inte l l igent
laboratory sys tems
Artificial neural networks in classification of NIR spectral data:
Design of the training set
W. Wu a, B. Walcmk a,, D.L. Massart a7 ,S. Heuerding b, F. Erni b, I.R. Last ,
K.A. Prebble
a
ChemoAC, Pharmaceutical I nstitute, Vri je Uni versiteit Brussel, Laarbeeklaan 103, B-1090
Brussel
Belgium
b Sandoz Pharma AC, Anal ytical Research and Development, CH -4002 Basle , Switzerland
Analytical Department Laboratories, The Wellcome Foundation L td, Dartfor d, Kent DA1 5AH , UK
Received 24 May 1995; accepted 18 September 1995
Abstract
Artificial neural networks (NN) with back-error propagation were
used for the classification with NIR spectra and applied
to the classification of different strengths of drugs. Four training set selection methods were compared by applying each of
them to three. different data sets. The NN architecture was selected through a pruning method, and batching operation, adap-
tive learning rate and momentum were used to train the NN. The presented results demonstrate that selection methods based
on Kennard-Stone and D-optimal designs are better than those based on the Kohonen self-organized mapping and on ran-
dom selection m ethods and allow 10 0% correct classification for both recognition and prediction. The Kennard-Stone de-
sign is more practical than the D-optimal design. The Kohonen self-organized mapping method is better than the random
selection method.
Keywords: Drug
analysis; Neural network; NIR; Pattern recognition
1 Introduction
One observes an increasing interest in the applica-
tion of neural networks (NNs) in chemical calibra-
tion and pattern recognition problems [l-13]. Al-
though NNs do not require any assumptions about
data distribution, they can be successfully applied
Corresponding author.
On leave from Silesian University, Katowice, Poland.
only to sufficiently large and representative data sets.
The term sufficiently large is relative. The im por-
tant factor is the ratio of the number of samples to the
number of weights considered in the net architecture.
Widrow [ 141 suggests as a rule of thum b that the
training se t size should be about 10 times the num ber
of weights in a network. According to other authors
[15], the maximum number of nodes in the hidden
layer should be of the order g(m + l), where and
g denote the number of input and output units, re-
spectively. Although these suggestions differ to some
extent, all NN users agree that the higher the ratio of
the number of samples to the number of weights the
0169 -743 9/% / 15.00 8 199 6 Elsevier Science B.V. All rights reserved
SSDI
0169-7439(95)00077-l
-
7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data
2/12
36
W. Wu et al./ Chemometrics and Intelligent Laboratory Systems 33 1996) 35-46
better the generalization ability of NN . For the given
number of samples this ratio can be maximized by
minimising the net architecture (reducing input data,
pruning redundant weights, etc.).
The second requirement, data representativity,
means that the samples in the data set should be
(evenly) spread over the expected range of data vari-
ability. In some cases it may be possible to generate
such samples as the training set using experimental
design techniques. However, in most cases such as in
analysis of food samples, one does not have this pos-
sibility [16,17]. Usually, one needs to select the train-
ing (model) samples from a large set of samples. The
other samples could be used to test the net. However,
using all samples to train the net may lead to overfit-
ting and to large prediction errors for the test set. To
avoid this the net training must be monitored. This
means that apart from the training set, one needs two
other data sets, the monitoring and test sets. In indus-
trial practice, the sample size is not very large. One
can use the same data set to monitor training and later
evaluate the NN. Hence, at least two data sets (the
training and test sets) are required. The principles for
the design of these two sets are the same as the prin-
ciples of design of any model set. Our study aims to
evaluate different strategies of training set design,
namely random selection, the Kohonen self-organis-
ing map approach [ 181, and two ne w approaches pro-
posed by us namely the Kennard and Stone design
[
191 and D-optimal design [2 0,21].
2. Theory
2.1. Notation
m
number of input variables for NN
g
number of classes (i.e. number of output
variables of NN)
n
number of objects in the data
I
z
number of objects in the training set
number of variables in the training set
X
matrix of the training set n X m)
N number of objects in the training set or in the
test set
Y
target matrix (N
X g>
out
output matrix of NN (N
X 8 .
2.2. Design of training set
2.2.1. Random selection
There are several ways of selecting the training set.
The simp lest one is a random selection which m eans
that no clear selection criterion is applied. There is a
risk that objects of some class are not selected in the
training set. In order to avoid such risk, we select 3/4
of the objects separately from each class, and put
them together a s training set. If 3/4 of the objects is
not an integer, the number is rounded to the nearest
integer in decreasing direction.
2.2.2. Kohonen self-organising maps I1 8 61
Another possible procedure for selecting the train-
ing set is to apply clustering techniques. The Koho-
nen network can be applied as such [6]. Zupan et al.
compared three kinds of methods in an example of the
reactivity of chemical bonds, and found that the Ko-
honen self-organising map performed best [6]. The
main goal of the Kohonen neural network is to map
objects from m-dimensional into two-dimensional
space. When the objects have similar properties in the
original space, they will map to the same node. In this
study, a (3
X
3) Kohonen network is chosen, contain-
ing 9 nodes. The learning rate is 0.1 at the beginning
and is linearly decreased so that at the last training
cycle it reaches 0 . The neighbourhood size is also
decreased linearly but reaches a minimum of 1 after
one-quarter of the training cycles, and remains 1 for
the rest of the training. The network is stabilised af-
ter each pattern has been presented to the network
about 500 times. 3/4 of the objects are randomly se-
lected from the objects which map to the same node.
If 3/4 of the objects is not an integer, we round it to
the nearest integers in increasing direction, otherwise
some nodes would have no objects after rounding.
This procedure is applied to each class separately. All
selected objects are put together as the training set.
2.2.3.
Kennard-Stone design [19,25,291
The Kennard-Stone algorithm technique was
originally used to produce a design when no standard
experimental design can be applied. With this tech-
nique, all objects are considered as candidates for the
training set. The design objects are chosen sequen-
tially. At each stage, the aim is to select the objects
-
7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data
3/12
W. Wu e t a l . / Chem om e t r i c s a nd I n t e l li g e n t L abo r a t o r y S y s t em s 33 1996 ) 3 5 - 4 6
37
so that they are uniformly spaced over the object
space. The first two objects are selected by choosing
the two objects that are farthest apart. The third ob-
ject selected is the one farthest from the first two ob-
jects, etc. Let d,, denote the squared Euclidean dis-
tance from the ith object to the jth object. Suppose
k
objects have already been selected, where k
and three kinds of placebo. Twenty different tablets
of each dosage form (active and placebo) are mea-
sured four times through a glass plate on which the
tablets are positioned. The average spectra of the four
were collected.
Data set2 contains 160 NIR spectra(lOOOl-4000
cm- 779 wavelengths) of capsules containing drug s
-
7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data
4/12
38
W . Wu e t a l . / Chew me t r i c s a nd I n t e l li g e n t L abo r a t o r y S y s t em s 33 1996 ) 3 5 - 4 6
of different dosages (0.1, 0.25, 0.5, 1.0 and 2.5 mg)
and three kinds of placebo. Twenty different cap-
sules of each dosage form (active and placebo) are
measured four times through the glass plate.
Daru set 3 contains 135 NIR spectra (1100-2500
nm; 700 wavelengths) of tablets containing different
dosages (20,50, 100 and 200 mg) of the experimen-
tal active ingredient, a placebo and a clinical com-
parator. There are respectively 15, 17, 15, and 21
spectra in the classes of different dosages, 47 spectra
in the class of placebo and 20 spectra in the class of
comparator. Spectra are measured through the blister
package, which contributes to the spectrum at around
1700 nm.
3.1. D at a p re p rocessing
The pre-processing step consists of trimming, data
transformation, and training set selection. The first
and last 15 wavelengths were trimmed from each of
the spectra to remove edge effects. In this study, the
standard normal variate @NV) transformation [24]
was applied to reduce the effects of scatter, particle
size, etc. After transformation, each data set was di-
vided into the training set and test set by the tech-
niques described. The data of the training set were
subjected to a principal component analysis (PCA).
The first 10 principal com ponents were taken into
consideration. The scores of the objects from the test
set in the PC space were calculated using the load-
ings obtained from the training set.
4.
Neural network parameters
The multilayer feedforward network trained with
the backpropagation learning algorithm was applied
[22]. The goal of net training is to minimize the root
mean square error @MS)
RMs =
d
fL Cjs1( ij - OUtij)
Ng
where yij is the element of target matrix y (N X g>
for the data considered (training set or test set), and
ourij is the element of the output matrix out N X g)
of the NN.
To make backpropagation faster the following
three techniques were used: batching operation,
adaptive learning rate and momentum [23]. In batch-
ing operation, one can apply multiple input vectors
simultaneously, instead of one input vector per time,
and obtain the networks response to each of them.
Adding an adaptive learning rate can also decrease
training time. This procedure increases the training
speed, but only to the extent that the net can learn
without large error increases. At each iteration new
weights and biases are calculated using the current
learning rate. The new output of the net and error term
are then calculated. If the new error exceeds the old
error by more than a predefined ratio (typically 1.041,
the new weights, biases, output and error are dis-
carded. In addition, the learning rate is decreased
(typically multiplied by 0.7). If the new error is less
than the old error, the learning rate is increased (typi-
cally multiplied by 1.05). Otherwise the new weights,
etc., are kept [23]. Momentum decreases backpropa-
gations sensitivity to small details in the error sur-
face and helps the net to avoid getting stuck in shal-
low minima.
To avoid overfitting, the perform ance of the net-
work is tested every hundred or thousand epochs
during the training, and the weights for which the
minim al RM S for the test set is observed are
recorded.
The target vector describing the belongingness of
the object to a class was set to binary values of 1 (for
corresponding class) and 0 (for other classes). The fi-
nal output of the net can be evaluated in two differ-
ent ways: the object can be considered as correctly
classified if the largest output regardless of its abso-
lute value is observed on a node signalling the cor-
rect class, or the object can be considered as cor-
rectly classified if the largest output is observed on a
node signalling the correct class and its value is
higher than 0.5. The second criterion is stricter and
was chosen in the study to evaluate NN performance.
It allows soft modelling of data, i.e. it can happen that
the
i t b
object will not be classified into any of the
predefined classes.
The performance of a classification system is ad-
ditionally expressed as the percent of the number of
correctly classified objects of the training and test
sets, divided by the total number of objects present in
these sets.
-
7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data
5/12
W. Wu et al. / Chemom etrics and Intelligent Loboratory Systems 33 1996) 35-46
39
5. Results and discussion
5.1. Selection of NN architecture
To compare different techniques of training set se-
lection, we can compare them for the optimal struc-
ture obtained with each training set selection tech-
nique. As described by Zupan and Gasteiger [6], an-
other way is to compare the performance using a
fixed structure of NN. We chose the latter procedure.
The architecture of the network was first optimised
for the data which were divided into the training and
test sets by the Kennard-Stone algorithm. Then we
used this structure as the fixed structure. The input
and output values were range-scaled between 0.1 and
0.9 variable by variable. The backpropagation leam-
ing rule with adaptive learning rate and momentum
was used. The initial values of learning rate and mo-
mentum were fixed at 0.1 and 0.3, respectively.
An effort was made to check the influence of the
random initialisation of the net weights upon the fi-
nal classification results. For instance, with data set 1
rerunning a neural net (4 nodes in the hidden layer)
10 times with the randomised initial weights results
in correct classification rates (CCRs) for training and
test sets, each time equal to lOO% , while the mean
values of RMSs for training and test sets are equal to
0.0565 and 0.0619, with standard deviations of
0.0061 and 0.0059, respectively. It demonstrates that
CCRs for training and test sets are stable with differ-
ent seeds of random generator, although the RM Ss
change. The adaptive learning rate makes the results
more independent of the initial values of weights. The
NN utilised in this study consisted of two active lay-
ers of nodes with a sigmoidal transfer function. The
number of nodes in the output layer is determined by
the number of classes. Normally the number of nodes
in the input layer is also determined by the structure
of the data. As already explained, for NIR data the
number of variables is much larger than the number
of objects and the variables are highly correlated. The
data can be orthogonalized and reduced by principal
component analysis, but then the number of input PCs
should be optimised. According to Widrows sugges-
tion (see Section 11, the number of objects ought to
be about 10 times the number of weights. However,
in practical use, the number of objects is limited and
there are seldom so many available. Therefore, we
relaxed this condition during the optimisation of the
NN architecture: the ratio of the number of objects to
the number of weights ough t to be more than 1. If the
numbers of input and output nodes are fixed, the
maximum number of hidden nodes can be estimated
using this rule. For instance, if there are 60 objects in
the training set, we never train an NN having more
than 60 weights. If we want to use 10 input nodes and
6 output nodes, then the number of hidden nodes
cannot exceed 3. The NN with 3 hidden nodes, 10
input nodes and 6 output nodes has together 57 (11
X
3 + 4 X 6) w eights. For the net with 4 hidden
nodes, the number of weights (11
X
4 + 5
X
6 = 79)
is already larger than 60.
There is no standard way to optimise the architec-
ture of NN. The simplest way is to try systematically
all combinations of nodes to find the optimal number
of nodes in the input and the hidden layer. Data set 1
Table 1
Data set 1: correctly classified rate of the training set (CCR) and test set (CCRt) of all combinations within 10 input nodes and 4 hidden
nodes; maximum number of epochs 5000
Input nodes 1 hidden node 2 hidden nodes
3 hidden nodes 4 hidden nodes
CCR cCRt
CCR CCRt CCR CCRt
CCR
CCRt
2 28.6 28.6 59.1
65.7 69.5 77.1 80.0 77.1
28.6
28.6 66.7 71.4 91.4 97.1 100 100
28.6
28.6 71.4 71.4
96.2
97.1
100
100
28.6
28.6 69.5 71.4 100 100
100 100
28.6
28.6
71.4
71.4 85.7
85.7 100 100
28.6
28.6 75.2 77.1 100
100 100 100
28.6
28.6 79.1 80.0
98.1
97.1
100 100
28.6
28.6 71.4
68.6 100
94.3
100 97.1
28.6
28.6 71.4 68.6
99.1
97.1
100 100
-
7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data
6/12
lo
W. Wu et al. Chemometri cs and I ntelli gent L aborator y Systems 33 (1996) 35-46
is used as an example. We consider all combinations
of the number of input nodes varying from 2 to 10
(i.e. the first two to the first 10 PCs) and the number
of hidden nodes varying from 1 to 4. The results of
all these nets are shown in Table 1. There are eight
NNs which give 100% classification for both recog-
nition and prediction. However, this approach re-
quires a lot of trials, and could be even more time
consuming if we would like to take into account all
possibly combinations of two, three, etc., PCs.
A more efficient approach to optimise the number
of nodes in the input and the hidden layers is as fol-
lows: first the number of input nodes is fixed at the
maximal number of PC factors, and the number of
hidden nodes is increased from small to large until the
performance of the network does not improve any
more or both the recognition and prediction percent-
ages are 100%.
The maximum number of PCs to be entered can be
decided by the variance explained (for instance, the
number of PCs needed to explain 99% variance) or
by the results of pilot experiments. In the latter case,
we train the net using the first 10 PCs as input and
the maximum number of hidden nodes which is esti-
mated by the above rule. If the NN performs well, 10
can be used as the maximum number of PCs. If the
performance is not satisfying, m ore PCs will be taken
into account.
When the number of nodes in the hidden layer has
been fixed and the maximum number of input nodes
has been decided, the number of input nodes is pruned
according to the value of the weights. If the weights
connected to one input node are all large, this indi-
cates that the variable corresponding to the input node
plays an important role in NN. If the weights con-
nected to one input node are all small, this indicates
that the variable corresponding to the input node plays
a small role in the NN and can be pruned off. If the
weights are intermediate, one can try to prune the
variable and if the NN still performs well one can
decide to prune it definitively. A hidden node can also
be pruned if the weights connecting the hidden node
to the input nodes, and the weights between the hid-
den node and the output nodes, are small. The mag-
bl 0.2
2-Layer
Badcpmpagetion
ith
AdaptiveLR & Momentum
I I
1 I I I I 1
I I
I I I I
I
1
I
0 500
looo 1500
2ooo 2500 3cMl 3 m 4ooo 4500 5ooo
Epoch; . training - test
60
1 I I
I I
I I
I
0 300
looo 1 500 Moo 2500 3ooo 3500 4ooo 4500 m
Epoch; training -test
Fig. 1. a) The root mean square error RM S) as a function of the numb er of haining epochs; b) the percentage of correctly classified
objects as a function of the number of training epochs; network architecture 10 4 7); data set 1.
-
7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data
7/12
W. Wu e t a l ./ C hem om e t r i c s a nd I n t e l li g e n t L abo r a t o r y S y s t em s 33 1996 ) 3 5 - 4 6
41
Input Biases
(b)
0
2 4
6 6
10 12
::
0 5
I
1.5
2 2.5
3 3 .5
4 4 .5
Nauronj
Fig. 2. (a) Hinton diagram of the weights between the nodes of the input layer and tbe nodes of the hidden layer in the network (10 X 4 X 7);
(b) sum of the absolute values of the weights of tbe node in the input layer, (c) sum of the absolute values of the weights of the node in the
hidden layer; data set 1.
(a)
p-Layer &&propgation with AdaptiveLR 8 Momentum
0.3
1
1
I 1 I I I
I
A
. . . . . . . . . . . . . ,
0.05
I
I
0 500 loo0 1500
zfloo
2 5 00 3 ow s o 0
4 o o o 4 50 0 5 o o o
Epoch; Iraining -test
(bl
100
1
c
g 90-
iti
u 80-
i3
$
70 - :
60
I I 1
1
I I
I
0 iw
l o o c l 1500 2oal 2500 3o cm 3500 4wo 4500 5ow
E p o c h ; . training -test
Fig. 3. (a) The root mean square error (RMS) as a function of the number of aaining epochs; (b) the percentage of comxtly classified
objects as a function of the number of training epochs; network architecture (3 X 4 X 7); data set 1.
-
7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data
8/12
42
W. Wu et al. Chemomerr ics and I ntelli gent L& orator y Systems 33 (19%) 35-46
nitude of weights can be easily displayed in the Hin-
ton diagram. This diagram displays the elements of
the weight matrix with squares whose areas are pro-
portional to their magnitude. The bias vector is sepa-
rated from the other weights with a solid vertical line.
The largest square corresponds to the weight with
largest magnitude and all others arc drawn with sizes
relative to the largest square [23]. The sum of the ab-
solute values of the weights connected to a node can
be used to estimate the importance of the role played
by the node. This pruning is repeated until the per-
formance of the network degrades.
Table 2 demon strates the results of classification
for the sequence of steps in the optimisation of the net
architecture for data set 1. As one can see, a 100%
correct classification is observed for the NN w ith the
first 10 PCs as input variables and 4 nodes in the
hidden layer.
Fig. 1 demonstrates the performance of the net-
work with 10 input and 4 hidden nodes during the
training. Fig. 2 shows the Hinton diagram. The
weights of input nodes 4 to 10 are much smaller than
those of the first three nodes. This suggests that the
Table 2
Data set 1: correctly classified rate of the training set (CCR ) and
test set (CCRt); training set selected by the Kemxud-Stone proce-
dure; maximum number of epochs 5000
Input Hidden
CCR cCRt Time
nodes nodes
(C) (%) (s)
10
3 99.1 97.1 568
10 4 100 100 534
3 4 100
100 531
2 4
80 77.1 627
PCs 4 to 10 do not contribute significantly to the net-
work performance and the first three PC factors play
an important role in classification. After pruning
them, the network performance does not decrease
(Fig. 3). However, recognition and prediction per-
centages become worse when the input nodes are re-
duced to 2 (PCs 3 to 1 0 are rejected>. Therefore, the
optimal structure of the network for data set 1 is 4
nodes in the hidden layer and 3 input nodes. The fi-
nal weights of the optimal network are shown in the
Hinton diagram (Figs. 4 and 5). This indicates that
input i & Biases
0.5 1 1.5
2
2.5 3 3.5
4 4.5
mj
Fig. 4. (a) Hinton diagram of the weights between the nodes of the input layer and the nodes of the hidden layer in the optimal network
(3 X 4 X 7); (b) sum of tbe absolute values of the weights of the node in the input layer. (c) sum of the absolute values of the weights of the
node in tbe hidden layer, data set 1.
-
7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data
9/12
W. Wu et al. Chemometri cs and I ntelli gent Laborator y Systems 33 (19%) 35-46
43
Input i 8 Biases
(b)iilli
0.5
1 1.5
2 2.5
3 3.5
4
4.5
0
1 1
I
I I I
0
1
2
3 4 5
6 7
8
Neuron
j
Fig. 5. (a) Hinton diagram of the weights between the nodes of the hidden layer and the nodes of the output layer in the optimal network
(3 X 4 X 7); (b) sum of the absolute values of the weights of the node in the hidden layer; (c) sum of the absolute values of the weights of
the node in the output layer; data set 1.
(4 R~CIII
I51
IO
t
.
% *ia
(4 -
151------
10
t
.
I
*
Y
(c) KannaId-Btona
:J--yy
15
(d) Doptimal
1
5
-5
5
10 15
3
0 5 10 15
Fig. 6. the design of the training sat by random selection, Kohonen self-organising m ap, Kennard-Stone algorithm and D-optimal design
with a simulated data set; ( *) objects of the training set
; ( - )
objects of the test set.
-
7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data
10/12
44
W. Wu et al./ Chemometri cs and I nfelligent L & orator y Systems 33 (19%) 35-46
Table 3
Data set 2: correctly classified rate of the training set (CCR) and
test set (CCRt); training set selected by the Kemtard-Stone prcce-
dure; maximum number of epochs 15000
Input
Hidden
CCR
CCRt Time
10
4
99.2
97.5 1670
10
5 100
100 2182
9 5 100 100 1781
8
5 99.2
100
1763
this diagram can be used as a visual tool to reduce the
number of the input scores and to select the input
variables. The optimal architecture is obtained after
we train the NN four times as described in Table 2;
while for the same data we need 36 trials with the
systematic trial method. This kind of pruning method
is much faster than the systematic trial method.
However, this pruning m ethod can be effectively ap-
plied only whe n the performance of the NN is very
good. The idea of this method is to reduce the size of
the architecture without changing the performance of
NN . If the performance of the NN is bad, there is no
sense in trying to improve the performance by prun-
ing.
Using the pruning method, the optimal architec-
ture for data set 2 is 5 hidden nodes with 9 input
nodes (Table 3); for data set 3 it is 3 hidden nodes
with 6 input nodes (Table 4). This architecture and
parameters were used in the following experiments.
5.2. Comparison of the four techniques of training set
selection
In order to visually compare the four techniques,
two-dimensiona l data of 40 objects were first simu-
lated. In this case, half of the objects was selected by
Table 4
Data set 3: correctly classified rate of the training set (CCR) and
test set (CCRt); training set selected by the Kennard-Stone proce-
dure; maximum number of epochs 20000
Input
Hidden
::
CCRt Time
nodes
nodes
(46)
(s)
7
2
76.8
69.4
1699
7
3 100 100 1521
6 3 100 100 1867
5 3 99.0
100
1845
the studied methods except for the Kohonen method.
For the Kohonen method, 21 objects were selected,
because sometimes half of the objects mapping in the
same node was not an integer, and this was rounded
as described earlier.
Fig. 6 shows that objects selected by the Ken-
nard-Stone and D-optimal cover the whole data do-
main; while the objects selected by the random selec-
tion and Kohonen methods do not. The Kennard-
Stone procedure selects the objects so that they are
distributed evenly, and the D-optimal selects the ex-
treme objects. The number of objects which is out of
the range of the selected objects by the Kohonen
method is lower than that by random selection. Ken-
nard-Stone and D-optimal m ethods seem to select
objects that are more appropriate in the sense that
they are more representative for building the class
borders than the other methods.
Further, we studied the effect of the four tech-
niques on the performance of NN by keeping the ar-
chitecture and parameters of the network constant. In
the methods of random selection, the training set ob-
jects are randomly selected and in the case of Koho-
nen self-organising maps, they are randomly selected
from each cluster. This random selection step leads to
the possibility that the selection is sometimes very
good and sometimes very bad. To overcome such a
drawback, the methods of random selection and Ko-
honen self-organising maps were repeated three
times. The results are shown in Tables 5-7. To train
the NN one time, it takes about 10 min for data set 1,
and half an hour for data sets 2 and 3.
For data set 1 , there are no differences in the per-
formance of recognition, the recognition percentages
Table 5
Data set 1 : comparison of the four different techniques of training
set selection; number of correctly classified objects divided by to-
tal number of objects expressed behveen parentheses
Method
Random
CCR (%) CCRt (%)
Time (s)
100(105/105 ) 97.1 (34/35) 624
Random
Random
Kohonen
Kohonen
Kohonen
Kemmrd-Stone
Doptimal
100 (105j105) 100 (35/35) 634
100 (105/105) 94.3 (33/35)
636
100(113/113) %.3(26/27)
566
100(116/116) 100(24/24) 577
100 (116/116) 100 (24/24)
684
100(105/105) 100(35/35)
531
100 (105/105) loo (35/35) 516
-
7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data
11/12
W. Wu et al./ Chemometrics and Intelligent Laboratory Systems 33 1996) 35-46
45
Data set 2: comparison of the four different techniques of training
set selection; number of correctly classified objects divided by to-
tal number of objects expressed between parentheses
Table 6
Table 8
Data set 3: correctly classified rate of the training set (CCR) and
test set (CCRt); training set selected by D-optimal design ; maxi-
mum number of epochs 2OCGO
Method CCR (%o) CCRt (%I Time (s)
Input
nodes
6
5
4
Hidden
nodes
3
3
3
CCR
(o/o)
100
100
88.9
CCRt
(8)
100
100
88.9
Time
(s)
1881
1916
1915
Random 99.2 (119/120) 92.5 (37/40) 2218
Random 100(120/12 0) 97.5 (39/40) 2213
Random 99.2(119/120) 97.5 (39/40) 2212
Kohonen 100(132/13 2) 96.4 (27/28) 1942
Kohonen 100(130/13 0) 100 (30/30)
1956
Kohonen 99.2(130/131)
96.6 (28/29) 2309
Kennard-Stone 100(120/120)
100 @O/40)
1781
D-optimal 100(120/120)
100 @O/40)
2246
being 100% for all methods. There are differences
though in the performance of prediction. The predic-
tion percentages with the Kennard-Stone and D-op-
timal training sets are 100%. For the random selec-
tion method, one of the three replicates gives 100%
prediction percentage. For the Kohonen method, two
of the three replicates perform 100% in prediction.
For data set 2, Kennard-Stone and D-optimal
training sets lead to perfect performance (100% cor-
rect classification) for both recognition and predic-
tion. The results of the random selection are not sat-
isfying. None of the replicates allow 100% predic-
tion, and only one of the three replicates gives 100%
of recognition. W ith the Kohonen method, the results
of one replicate are good (100% recognition and
100% prediction success), an d the results of the other
two replicates are bad.
For data set 3, the Kennard-Stone and D-optimal
training sets give the same perfect performance. With
the random selection and the Kohonen methods, the
Table 7
Data set 3: comparison of the four different technique3 of training
set selection; numb er of cone&y classified objects dtvided by to-
tal number of obkcts expressed between parentheses
Method CCR (%o) CCRt (%I Time (s)
Random 100 (99/99)
Random
100 W/99)
Random
100 @9/99)
Kohonen
97.2 (103/106)
Kohonen 1OQW7/107)
Kohonen 100(110/110)
Kennard-Stone 100 W/99)
D-optimal 100 W/99)
94.4 34/36) 1863
88.9 32/36) 1868
100 36/36) 1884
96.6 28/29)
1667
100 28/28) 1685
100 (25/25)
1632
100 (36/36)
1867
100 (36/36)
1881
performances of the three replicates are sometimes
good and sometimes bad. One of the three replicates
gives good results for the random selection method,
and so do two of the three replicates for the Kohonen
method.
In order to compare the Kennard-Stone design and
D-optimal design, we tried to optimize. the architec-
ture of NN again for the D-optimal design. The ar-
chitecture of data set 3 can be further improved us-
ing the pruning method (Table 8).
The architecture of the other data sets cannot be
pruned further. This suggests that the D-optimal se-
lection might sometimes be slightly better than the
Kennard-Stone. D-optimal selection selects the train-
ing set objects which describe the whole information
as well as possible. There are more extreme objects
selected by this method than selected by the other
methods (Fig. 6). For classification, the aim is to de-
rive the border of every class and therefore the ex-
treme objects are more useful than others during
training.
6. Conclusion
Artificial NN are shown to be useful pattern
recognition tools for the classification of NIR spec-
tral data of drugs when the training sets are correctly
selected. Comparing the four training set selection
methods, the Kennard-Stone and D-optimal proce-
dures are better than the random selection and Koho-
nen methods. The results of the D-optimal design may
be slightly better than those of the Kermard-Stone
design. However, the computing time of the D-opti-
mal design (using Kennard-Stone design as the ini-
tial points) is larger than that of the Kennard-Stone
-
7/24/2019 Walczak_Artificial Neural Networks in Classification of NIR Spectral Data
12/12
46
W. Wu et ul. Chemometri cs ana ntelli gent Laborator y Systems 33 (1996) 35-46
procedure. The random selection and Kohonen meth-
ods do not allow good performance in our study.
The number of data sets studied is not sufficiently
large to prove that these conclusions are always valid
for any data set. However, they allow at least to state
that the Kennard-Stone procedure will be a useful
approach in certain instances, and according to us, in
most instances.
References
[II
D
[31
141
151
b51
[71
181
[91
[lOI
[III
X.H. Song and R.Q. Yu, Chemom. Intell. Lab. Syst., 19
(1993) 101-109.
C. Borggaard and H.H. Thodberg, Anal. Chem., 64 (1992)
545-55 1.
Y.W. Li and P.V. Espen, Chemom. Intell. Lab. Syst., 25
(1994) 241-248.
D. Wienke and G. Kateman, Chemom. Intell. Lab. Syst.. 23
( 1994) 309-329.
T.B. Blank and S.D. Brown, J. Chemom., 8 (1994) 391-407.
J. Zupan and J. Gasteiger, Neural Networks for Chemists: An
Introduction, Weinheim, New York, 1993.
T. Noes, K. Kvaal, T . Isaksson and C. M iller, J. Near Infrared
spectrosc., 1 (199 3) 1-11.
P. de B. Harrington, Chem om. Intell. Lab. Syst., 19 (199 3)
143-154.
B.J. Wythoff, Chemom. Intell. Lab. Syst., 20 (1993) 129-
148.
G. Kateman, Chemom. Intell. Lab. Syst., 19 (1993) 135-142.
J.R.M. Smits, L.W. Bmedveld, M.W.J. Derksen and G. Kate-
man, Anal. Chim. Acta, 258 (1992) 1 -25.
[12] A.P. Weijer, L. Buydens and G. Kateman, Chemom. Intell.
Lab. Syst., 16 (1992) 77-86.
[13] J. Zupan and J. Gasteiger, Anal. Ch im. Acta, 248 (1991) l-
[14] ZWidrow , Adaline and Madaline, in Proceedings of IEEE
1st International Conference on Neural Networks, 1987 , pp.
143-158.
1151 A. Maren, C. Harston and R. Pap, Handbook of Neural Com-
puting Applications, Academic Press, San Diego, 1990.
[16] T. Naes, J. Chemom.. l(1987) 121-134.
[
171 T. Naes and T. Isaksson, A ppl. Spectrosc., 4 3 (1989) 328-33 5.
[
181 T. Kohonen, Self-Organisation and Associative Memory,
Springer, Heidelberg, 1984.
[19] R.W. Kemmrd and L.A. Stone, Technometrics, 11 (1969)
137-148.
[20] R. Carlson, Design and Optimization in Organic Synthesis,
Elsevier, Amsterdam, 1992.
[21] P.F. de Aguiar, B. Bourguignon, M.S. Khots and D.L. Mas-
sart, Chemom. Intell. Lab. Syst. (in press).
[22] T.P. Vogl, J.K. Mangis, A.K. Rigler, W.T. Zink and D.L.
Alkon, Biol. Cybemet., 59 (1988) 257-263.
[23] H. Demuth and M. Beale, Neural Network Toolbox Users
Guide, The MathWorks, Inc., 1993.
124 1 R.J. Barnes, M.S. Dh anoa and S.J. Lister, Appl. Spectrosc.,
43 ( 1989) 772-777.
[25] B. Bourguignon, P.F. de Aguiar, K. Thorns and D.L. Mas-
sart, J. Chromatogr. Sci., 32 (1994) 144-152.
[26] T.J. Mitchell, Technomettics, 16 (197 4) 203-2 10.
127 1 V.V. Fedorov, Theory of Optimal Experiments, Moscow
University. English translation by W.J. Studden and E.M.
Klimo, Academic Press, New York, 1972.
128 1 A.C. Atkinson, Chemom . Intell. Lab. Syst., 28 (199 5) 35-47.
129 1 B. Bourguignon, P.F. de Aguiar, M .S. Khots and D.L. Mas-
sart, Anal. Chem., 66 (1994) 893-904.