Post on 31-Jul-2020
Robust Single Hidden Layer Feedforward Neural Networks for Pattern Classification
Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy
Kevin (Hoe Kwang) Lee
Faculty of Engineering and Industrial Sciences Swinburne University of Technology
Melbourne, Australia
2013
i
Abstract Artificial neural network (ANN) has been becoming one of the most widely researched
artificial intelligence techniques since 1970s. The ANN is known to be capable of
providing good performance in many practical applications such as system modelling,
forecasting, classification, and regression, using only a finite number of available
samples. Its strength resides in its capability of learning complex non-linear mappings
and performing parallel processing. Given many learning paradigms developed for the
ANNs, the generalization capability and robustness are the key design criteria to ensure
the high quality of performance in practical implementations. This thesis focuses on the
development of robust supervised training algorithms for a class of ANNs known as the
single hidden layer feedforward neural networks (SLFNs).
In conventional SLFN training algorithms, the backpropagation learning method
is widely adopted to iteratively tune the weights and biases of SLFNs. The slow
learning speed of the iterative algorithms has been a major bottleneck in practice. In
addition, the backpropagation training process has a high probability of stopping at local
minima of the cost functions in weight space, because the final training outcome is
heavily dependent on user-initialized parameters.
The extreme learning machine (ELM) has become an emerging learning
technique for the SLFNs recently. It has been shown that the ELM can overcome most
disadvantages of conventional training algorithms. Different from the conventional
training algorithms used for SLFNs, the ELM proposes that (i) the input weights and the
hidden layer biases can be randomly assigned without training, and (ii) the output
weights can be analytically determined using the Moore-Penrose pseudoinverse. Thus,
the ELM has prominent advantages of extremely fast learning speed, less user
intervention, and good generalization performance.
ii
Based on the ELM, three new training algorithms are developed and verified for
SLFNs in this thesis. The first training algorithm considers the design of the input
weights of the SLFNs according to the finite-impulse-response filtering theory, such
that the outputs of the SLFN hidden layer become less sensitive to the input pattern
noise. In addition, the optimal design of the input weights of the SLFNs will also be
developed using both the regularization theory and the prediction risk minimization
theory. The proposed optimal designs of both the input and output weights will then
significantly improve the robustness of the SLFN classifiers in dealing with real-world
datasets. Specifically, the classification of bioinformatics datasets related to cancer
diagnosis will be studied in detail. The relationship between the linear separability of
the bioinformatics sample patterns and the classification performance of the above
proposed training algorithm will be investigated together with a newly developed
frequency domain feature selection method.
The second training algorithm considers the optimal design of the input weights
based on the discrete Fourier transform of feature vectors. The feature assignment
method is developed based on the pole placement theory in control systems, which
optimally determines the target features for pattern classes, such that the noise
components can be reduced, and the respective target features are maximally separated.
In the third training algorithm, unlike the one developed in the second scheme,
the desired feature vectors are defined first in the feature space, and the input weights
are then designed in the sense that, in the training phase, the feature vectors from the
hidden layer outputs of the SLFNs can be placed at the desired positions in the feature
space, specified by the desired feature vectors. The reference feature vectors, in this
algorithm, are chosen from the hidden layer outputs of the SLFN classifiers trained with
the ELM. Both the regularization theory and the prediction risk minimization theory
will be employed to balance and reduce the optimization risks when determining the
hidden layer and the output layer weights. In addition, the performance of all three
proposed training algorithms for the SLFNs will be verified using a series of sound clip
recognition and handwriting recognition experiments.
iii
Acknowledgement I would like to first thank my supervisors Prof. Zhihong Man and Dr. Zhenwei Cao for
their most valuable guidance and support during the past 4 years of my research that has
culminated with the production of this thesis. They have made every effort to provide
academic and financial advice to make my PhD life an interesting and rewarding one.
Especially, I would like to give my greatest gratitude to Prof. Zhihong Man for his
constant supervision of my research progress, and for making sure that I met the
requirements and deadlines throughout the course of the past 4 years. There has not
been another lecturer who is so accessible and enthusiastic in teaching me the much
needed research and writing skills that are vital to complete this thesis.
I am grateful to Swinburne University for awarding me the SUPRA scholarship
and providing me with a comfortable and conducive working environment. Special
thanks to Cathy, Melissa, and Adrianna from the research student administration and
support team, all of you have helped me settle in at my workplace and attended to my
enquiries promptly and always with a welcoming smile. I am also indebted to the senior
technical staff, Walter and Phil, who swiftly resolved any issues I had with my research
equipment and building access cards. The little things that these university staff did
made all the difference in my everyday life as a PhD candidate.
My research life would have been a little bit boring if not for the friends in my
research group. Therefore, to Fei Siang, Aiji, Sui Sin, Wang Hai, and Tuan Do, I am
very thankful to have you all as friends and fellow researchers. The countless sharing
sessions we had proved invaluable in keeping our PhD lives a little more vibrant.
Last but not least, I would like to thank my family for their unrelenting support
during my times of doubt and hardship. Without them I would not have come this far.
iv
v
Declaration This is to certify that:
1. This thesis contains no material which has been accepted for the award to the candidate of any other degree or diploma, except where due reference is made in the text of the examinable outcome.
2. To the best of the candidate’s knowledge, this thesis contains no material
previously published or written by another person except where due reference is made in the text of the examinable outcome.
3. The work is based on the joint research and publications; the relative
contributions of the respective authors are disclosed. ________________________ Kevin (Hoe Kwang) Lee, 2013
vi
vii
Contents 1.0 Introduction 1
1.1. Motivations 2
1.1.1. Selection of Robust Hidden Layer Weights 2
1.1.2. Optimal Design of Output Layer Weights 3
1.2. Objectives and Major Contributions 5
1.3. Organization of the Thesis 6
2.0 Literature Review 9
2.1. Introduction 10
2.2. Statistical Classifiers 12
2.3. Single Hidden Layer Feedforward Neural Networks 15
2.3.1. Gradient Descent Based Algorithms 17
2.3.2. Standard Optimization Method Based Algorithms 21
2.3.3. Least Squares Based Algorithms 21
2.4. Extreme Learning Machine 22
2.4.1. Learning Theories of ELM 25
2.4.2. Batch Learning ELM 30
2.4.3. Sequential Learning ELM 33
2.4.4. ELM Ensembles 33
2.5. Regularized ELM 34
2.6. ELM and SVM 38
2.7. ELM and RVFL 42
viii
2.8. Applications of ELM 43
2.8.1. Medical Systems 43
2.8.2. Image Processing 46
2.8.3. Face Recognition 46
2.8.4. Handwritten Character Recognition 47
2.8.5. Sales Forecasting 48
2.8.6. Parameter Estimation 49
2.8.7. Information Systems 50
2.8.8. Control Systems 51
2.9. Conclusion 53
3.0 Finite Impulse Response Extreme Learning Machine 55
3.1. Introduction 55
3.2. Problem Formulation 58
3.3. Design of the Robust Input Weights of SLFNs 65
3.4. Design of the Robust Output Weight Matrix 70
3.5. Experiments and Results 73
3.6. Conclusion 84
4.0 Classification of Bioinformatics Datasets with FIR-ELM
for Cancer Diagnosis 85
4.1. Introduction 85
4.2. Time Series Analysis of Microarrays 88
4.3. Linear Separability of Microarrays 94
4.4. Outline of the FIR-ELM 96
4.4.1. Basic FIR-ELM 96
4.4.2. Frequency Domain Gene Feature Selection 99
4.5. Experiments and Results 101
4.5.1. Biomedical Datasets 101
ix
4.5.2. Experimental Settings 102
4.5.3. Leukemia Dataset 103
4.5.4. Colon Tumor Dataset 107
4.6. Discussions 110
4.6.1. Linear Separability of the Hidden Layer Output
for SLFN 110
4.7. Conclusion 112
5.0 Frequency Spectrum Based Learning Machine 113
5.1. Introduction 114
5.2. Problem Formulation 117
5.3. Design of the Optimal Input and Output Weights 123
5.4. Experiments and Results 130
5.4.1. Example 1: Classification of Low Frequency Sound
Clips 130
5.4.2. Example 2: Classification of Handwritten Digits 138
5.5. Conclusion 148
6.0 An Optimal Weight Learning Machine for Handwritten
Digit Image Recognition 149
6.1. Introduction 149
6.2. Problem Formulation 153
6.3. Optimal Weight Learning Machine 164
6.4. Experiments and Results 172
6.5. Conclusion 181
7.0 Conclusions and Future Work 183
7.1. Summary of Contributions 183
7.2. Future Research 184
x
7.2.1. Ensemble Methods 184
7.2.2. Analytical Determination of Regularization Parameters 185
7.2.3. Analysis of the Effects of Non-linear Nodes 185
7.2.4. Multi-class Classification of Real-World Dataset 186
Bibliography 187
Appendix: Matlab Codes 211
List of Publications 221
xi
List of Figures 2.1 A single hidden layer feedforward neural network 16
2.2 The backpropagation training algorithm 18
2.3 (a) Gradient descent on a convex cost function
(b) Gradient descent on a non-convex cost function with randomly
initialized weights 20
2.4 A set of handwritten digits from the MNIST database 48
3.1 A single hidden layer neural network with linear nodes 59
3.2 Output of the SLFN with the ELM 62
3.3 Output error of the SLFN with the ELM 62
3.4 Output of the SLFN with the modified ELM 63
3.5 Output error of the SLFN with the modified ELM 63
3.6 Output of the SLFN with the FIR hidden nodes 64
3.7 Output error of the SLFN with the FIR hidden nodes 64
3.8 A sound clip modulated by the envelope function 74
3.9 The disturbed sound clip 74
3.10 Signal classification with the ELM algorithm 75
3.11 The RMSE with the ELM algorithm 75
3.12 Signal classification with the modified ELM algorithm 76
3.13 The RMSE with the modified ELM algorithm 76
3.14 Signal classification with the FIR-ELM algorithm with the
rectangular window 77
3.15 The RMSE with the FIR-ELM algorithm with the rectangular
window 78
3.16 Signal classification with the FIR-ELM algorithm with the Kaiser
window 79
3.17 The RMSE with the FIR-ELM algorithm with the Kaiser window 79
xii
3.18 The RMSE via d/γ for the FIR-ELM algorithm 81
3.19 Signal classification using the SLFN classifier with the non-linear
hidden nodes and trained with the FIR-ELM algorithm 81
3.20 The RMSE of the SLFN classifier with the non-linear hidden nodes
and trained with the FIR-ELM algorithm 82
3.21 The non-linear sigmoid function ( ) ( ( ))⁄ 82
3.22 Signal classification using the SLFN classifier with the non-linear
hidden nodes and trained with the FIR-ELM algorithm 83
3.23 The RMSE of the SLFN classifier with the non-linear hidden nodes
and trained with the FIR-ELM algorithm 83
3.24 The non-linear sigmoid function ( ) ( ( ))⁄ 84
4.1 Aggregated time series for the colon dataset 90
4.2 The filtered and detrended time series and
91
4.3 Plot of residual between and
93
4.4 Overlaid plot of and
for genes 1000 to 1040 94
4.5 A single hidden layer feedforward neural network with linear nodes
and an input tapped delay line 98
4.6 Frequency response of a sample from the colon dataset 100
4.7 An FIR filter design search algorithm for FIR-ELM 100
4.8 Classification performance for leukemia with low pass filter 104
4.9 Classification performance for leukemia with high pass filter 105
4.10 Classification performance for leukemia with band pass filter 105
4.11 Classification performance for colon dataset with low pass filter 107
4.12 Classification performance for colon dataset with high pass filter 108
4.13 Classification performance for colon dataset with band pass filter 108
5.1 A single hidden layer network with linear nodes and an input delay
line 117
5.2 A sound clip modulated by the envelope function 131
5.3 The disturbed sound clip with SNR of 10 dB 131
5.4 Classification using R-ELM with SNR of 10 dB 133
5.5 The RMSE using R-ELM with SNR of 10 dB 133
5.6 The classification using FIR-ELM with SNR of 10 dB 134
5.7 The RMSE using FIR-ELM with SNR of 10 dB 134
xiii
5.8 Classification using DFT-ELM with SNR of 10 dB 135
5.9 The RMSE using DFT-ELM with SNR of 10 dB 135
5.10 The RMSE and class error via
for the DFT-ELM 137
5.11 The RMSE and classification error via the number of hidden nodes
with the DFT-ELM 138
5.12 A set of handwritten digits from the MNIST database 139
5.13 (a) Image of digit and (b) Segmented image 139
5.14 (a)~(j) Encoded sample data set for images 0 to 9 140
5.15 Classification accuracies of the SLFN classifiers with DFT-ELM,
FIR-ELM and R-ELM via the regularization parameter ratio 147
5.16 Classification accuracies of the SLFN classifiers with DFT-ELM,
FIR-ELM and R-ELM via the number of training samples 148
6.1 A single hidden layer neural network with both random input
weights and random hidden layer biases 156
6.2 Recognition of the handwritten digits by using the SLFN classifier
with 10 hidden nodes trained with the ELM 159
6.3 Recognition of the handwritten digits by using the SLFN classifier
with 50 hidden nodes trained with the ELM 159
6.4 Recognition of the handwritten digits by using the SLFN classifier
with 100 hidden nodes trained with the ELM 160
6.5 Recognition of the handwritten digits by using the SLFN classifier
with 10 hidden nodes trained with the R-ELM 161
6.6 Recognition of the handwritten digits by using the SLFN classifier
with 50 hidden nodes trained with the R-ELM 162
6.7 Recognition of the handwritten digits by using the SLFN classifier
with 100 hidden nodes trained with the R-ELM 162
6.8 A single hidden layer neural network with linear nodes and an
input tapped delay line 164
6.9 A set of handwritten digits from the MNIST database 172
6.10 (a) Image of digit and (b) Segmented image 172
6.11 (a)~(c) Sample feature vectors for digit 0 173
6.12 Recognition of the handwritten digits by using the SLFN classifier
with 10 hidden nodes trained with the OWLM 175
xiv
6.13 Recognition of the handwritten digits by using the SLFN classifier
with 50 hidden nodes trained with the OWLM 176
6.14 Recognition of the handwritten digits by using the SLFN classifier
with 100 hidden nodes trained with the OWLM 176
6.15 Classification accuracy versus regularization ratio for the MNIST
dataset 179
6.16 Classification accuracy versus regularization ratio for the USPS
dataset 180
6.17 Classification accuracies with OWLM, R-ELM and ELM via the
number of training samples for the MNIST dataset 180
6.18 Classification accuracies with OWLM, R-ELM and ELM via the
number of training samples for the USPS dataset 181
xv
List of Tables 2.1 Research incorporating the ELM and SVM 42
3.1 Comparison of averaged RMSE for sound clip recognition
experiment 80
4.1 Correlation coefficient of colon dataset binary classes with
different FIR filters 92
4.2 Summary of leukemia and colon datasets 95
4.3 Linearly separable gene pairs for leukemia and colon datasets 95
4.4 Selection of DCT coefficients for leukemia and colon datasets 102
4.5 Classification performance for leukemia dataset 106
4.6 Confusion matrix for classification of leukemia dataset 106
4.7 Classification performance for colon dataset 109
4.8 Confusion matrix for classification of colon dataset 109
4.9 Linearly separable gene pairs for the hidden layer output in ELM
and FIR-ELM for leukemia and colon datasets 111
5.1 Comparisons of the R-ELM, FIR-ELM and DFT-ELM 136
5.2 Classification accuracies of the handwritten digit classification with
R-ELM, FIR-ELM and DFT-ELM 146
6.1 Classification accuracies of the handwritten digit classification
with the ELM, R-ELM and OWLM for the MNIST dataset 177
6.2 Classification accuracies of the handwritten digit classification
with the ELM, R-ELM and OWLM for the USPS dataset 178
xvi
xvii
List of Abbreviations and Acronyms AI – artificial intelligence
ALL – acute lymphoblastic leukemia
AML – acute myeloid leukemia
ANN – artificial neural networks
AWGN – additive white gaussian noise
BC – bayesian classifier
BP – backpropagation
CBP – circular backpropagation
C-ELM – circular-ELM
CO-ELM – constrained-optimization based ELM
CV – cross-validation
DCT – discrete cosine transform
DFT – discrete fourier transform
DFT-ELM – discrete fourier transform extreme learning machine
DTFT – discrete time fourier transform
EEG – electroencephalogram
ELM – extreme learning machine
ERM – empirical risk minimization
FGFS – frequency domain gene feature selection
FIR – finite impulse response
FIR-ELM – finite impulse response extreme learning machine
FLDA – fisher’s linear discriminant analysis
FNR - false negative rate
FPR – false positive rate
GA – genetic algorithm
ICA – independent component analysis
xviii
LDA – linear discriminant analysis
MIMO – multiple-inputs-multiple-outputs
MLP – multilayer perceptron
MSE – mean-squared-error
OAA – one against all
OAO – one against one
OS – online sequential
OWLM – optimal weights learning machine
PCA – principle component analysis
RBF – radial basis function
RELM – regularized extreme learning machine
RMSE – root-mean-square error
ROC – receiver operating characteristic
RVFL – random vector functional link
SLFN – single hidden layer feedforward neural network
SNR – signal to noise ratio
SRM – structural risk minimization
SVD – singular value decomposition
SVM – support vector machine
TER – total error rate
TNR – true negative rate
TPR – true positive rate
1
Chapter 1
Introduction
Artificial intelligence (AI) techniques have been becoming increasingly popular in
recent years. The AI techniques provide scientists and engineers with alternative
methods for solving complex problems that do not have parametric solutions or for
which the parametric solutions are too difficult to express analytically. Among many AI
methods, artificial neural network (ANN) is a versatile technique and has been widely
applied to many areas such as system modelling, economic forecasting, fault diagnosis,
bioinformatics, handwriting recognition, and information management. Its strength is
largely due to its capability of learning complex non-linear mappings and performing
parallel processing [1], [2]. Given a finite set of sample data from an unknown system,
the ANN is capable of providing a practical solution for system modelling or pattern
classification in a much shorter time frame compared to traditional analytical methods.
Although the use of ANNs has already been of a great success in numerous real-
world applications, several major drawbacks have been noted in practice. For instance, a
longer training time may be required for recursive learning and the training process may
easily stop at local minima. These issues have inhibited the use of ANNs in some
situations where the fast online training is required, the global solutions are essential,
and a large amount of data needs to be processed. Recently, Huang et al. [3] proposed a
compact training algorithm known as the extreme learning machine or ELM for single
hidden layer feedforward neural networks (SLFNs). The ELM has been shown to
provide an extremely fast training for SLFNs, while achieving an optimal global
solution in the sense of least squares.
2
The aim of this research is to develop a few new robust training algorithms for
SLFNs to function as pattern classifiers with less number of hidden nodes, achieving
excellent robustness property and high classification accuracy. Several computer
generated and real-world pattern classification examples and experiments are provided
to demonstrate the effectiveness of the proposed algorithms.
1.1 Motivations
It is well known that the SLFN has the capability of performing universal
approximation [4-7]. Such an advantage has made SLFNs highly adapt to various
environments where the processes may have uncertainties and complex dynamics, and
the collected data may be degenerated by disturbances and noises. However, many
existing training algorithms developed for the SLFN are based on the conventional
backpropagation (BP) [8]. Such BP-based algorithms with the recursive feature have
disadvantages of longer training time, stopping at local minima in training and slow
convergence, as mentioned in the previous section.
Although the recently proposed ELM [3] has provided some partial solutions to
the issues mentioned above, there is still an urgent need to focus on the development of
robust neural networks pattern classifiers to deal with real-world data that are highly
disturbed or have non-linearly separable patterns. Therefore the research performed
herein intends to cover this important area by building upon the compact concept
introduced in the ELM. Some important issues that are revealed by our recent study on
the ELM [9-12] include:
Non-optimal assignment of hidden layer weights for the SLFN.
Lack of a holistic prediction risk minimization strategy in the design of the
SLFN weights and biases.
1.1.1 Selection of Robust Hidden Layer Weights
Zhu et al. [37], was amongst the first works to state that the randomly selected weights
and biases in the hidden layer of the ELM are not optimal in the sense that the
3
projections provided by the random weights generate stochastic results for each trial. It
is then generally accepted that the ELM requires repeated testing to obtain the reliable
results. Since then, many modified algorithms have been proposed to initialize the
hidden layer weights and biases of the SLFN being trained with the ELM algorithm.
The attempts in [33-38] provided some improvements over the random selection
method either in terms of accuracy achieved, or condition of the network matrices.
However, these methods approach the problem of weights selection from a
mathematical and biological computing point of view which is generally a global
optimization problem with respect to the provided training samples. As the hidden layer
of the SLFN is often seen as the projection layer, which is responsible to map the input
space into the feature space, the selection of the weights should take into consideration
the robustness of the outputs generated as well. From an engineering point of view, the
robustness refers to the consistent performance of the SLFN given a wide range of
unseen samples that are noisy. Therefore this thesis focuses on developing robust
hidden layer weights assignment strategies.
1.1.2 Optimal Design of Output Layer Weights
In the ELM training algorithm, the pseudo-inverse is a least squares solution for the
output layer weights of the SLFN. It is also known as an empirical risk minimization
(ERM) operation. The ERM works well for learning algorithms with a finite number of
training samples when the sample set size is large enough such that the sample points
are dense in the pattern space. However, the problem becomes complex when the
number of dimensions involved is large due to the curse of dimensionality. The curse of
dimensionality states that as the number of dimensions of a sample pattern increases, it
becomes exponentially harder to approximate the underlying function. This is because
as the number of dimensions increase, the sampled patterns have a high probability of
being far apart [83]. When learning is performed on such sample points which are far
apart, the interpolation between the points becomes hard to determine because there
may be more than one solution based on the ERM principle.
In order to find an optimal solution under the ill-posed problem stated above,
some constraints need to be introduced in the learning process such that the
4
interpolation between the sample points is smooth. In the absence of large sample
datasets that are dense in pattern space, the alternative is to introduce a penalty term in
the cost function of the output weights. Two of the commonly used cost functions with
penalties are given in (1.1.1) and (1.1.2).
‖ ‖
‖ ‖ ( )
‖ ‖
‖ ‖
( )
where they are known as the L1-norm and L2-norm regularization (Tikhonov
regularization [84], [85]) respectively. The regularization constant acts to scale the
importance of the regularization term , which is the magnitude of the SLFN output
layer weights, with respect to the output error . The purpose of using the magnitude of
the output layer weights as a penalty is to ensure that solutions of the interpolation
function will have small magnitude and therefore restricts the optimization of the cost
function to smoother solutions. This characteristic is also known as the structural risk
minimization (SRM) principle discussed by Vapnik in [86]. In the context of training
the SLFN, the prediction risk can be defined as the sum of the approximation risk and
estimation risk, where risk is embodied by error.
( )
The approximation risk represents the closeness of fit with respect to the sample
targets, while the estimation risk represents the generalization ability over unseen
samples. The regularization term controls the emphasis of the optimization process
with a trade-off between achieving the SRM, which increases the robustness and
generalization capability, and the ERM, which reduces the classifier error based on the
training sets. If the focus is placed on the ERM, it is easy to over-fit the data and hence
affects the final testing performance, while placing heavier emphasis on the SRM may
reduce the size of the feature space significantly and distort the outputs; this issue is
more formally known as the bias-variance dilemma [83], [87]. Selecting the optimal
value of is usually done using cross-validation in practical applications.
5
Three major contributions to improve the robustness of the ELM training
algorithm using regularization methods were proposed respectively by Toh [88], Deng
et al. [89], and Huang et al. [90] recently. The authors recognized that the ELM
algorithm is susceptible to perturbations and noise in the input sample sets and
introduced variants of the regularization method discussed above. However, an open
question still revolves around finding a deterministic method to estimate the optimal
value of the regularization constant . In this thesis, the formulation of the optimization
functions will be reconsidered to better understand the effects of regularization in
reducing the prediction risks.
All the issues above need to be addressed appropriately in the design of the new
training algorithms in this thesis.
1.2 Objectives and Major Contributions
The aim of this thesis is to develop a new breed of robust and optimized training
algorithms for the SLFN pattern classifiers with new strategies of optimizing both the
input and the output weights of the SLFNs that are capable of processing a wide variety
of real-world data with varying noise profiles and non-linear separability.
The major contributions of the thesis are outlined as follows:
Develop robust input weights assignment strategies for the SLFN pattern
classifiers that are less sensitive to noises and undesired features.
Optimally design the output weights of the SLFN pattern classifier such that the
sensitivity of the SLFN outputs can be significantly reduced with respect to
changes at the hidden layer outputs.
Develop a new optimal weights design for the ELM pattern classifier by
minimizing the total prediction risk of the SLFNs.
Develop and test the practical implementation of the robust pattern classifiers
designed in this thesis using real-world bioinformatics datasets with application
to cancer diagnosis.
6
In summary, the research performed in this thesis develops and implements a
group of robust neural network based pattern classifiers that are designed to handle real-
world datasets degenerated with a wide range of noise and disturbances. The developed
pattern classifiers have the potential to find a vast range of applications, especially, in
the fields of bioinformatics, data mining, telecommunications, and signals and image
processing.
1.3 Organization of the Thesis
This thesis explores the design of new training algorithms for SLFN pattern classifiers
that behave with strong robustness with respect to noises and undesired features, and
improve the classification accuracy in the real-world applications. The rest of the thesis
is organized as follows:
Chapter 2 begins with a brief overview of conventional pattern classifiers with a focus
on SLFNs, and the conventional training algorithms developed for SLFNs. A
comprehensive survey of ELM with emphasis on the latest technical developments and
applications is then presented to pave the background for the following main chapters of
this thesis.
Chapter 3 proposes a new finite impulse response extreme learning machine (FIR-ELM).
The new algorithm is characterized with the strategy of the input weights assignment
based on the finite impulse response digital filtering concept, which makes the SLFN
classifiers have the capability of eliminating the effects noises and reducing the
sensitivity of the hidden layer outputs. The optimal output layer weights are then
designed to balance and reduce both the empirical and structural prediction risks. The
resulting training algorithm is then verified by classifying noisy audio clips in the
simulation example.
Chapter 4 applies the FIR-ELM developed in Chapter 3 for the classification of
bioinformatics datasets for cancer diagnosis. Considering the complexity of the gene
microarray samples, a frequency domain feature selection algorithm is first developed to
determine the filtering technique used for designing the input weights. The FIR-ELM is
7
then implemented to design both the input and output weights. In the simulation
example, the effectiveness of the FIR-ELM for the classification of the gene microarray
samples is confirmed with high classification accuracy through comparing with a few
existing neural classification algorithms.
Chapter 5 develops a new frequency spectrum based learning machine for the SLFNs
with the discrete Fourier transform (DFT) theory. In order to minimize the effects of the
input pattern disturbances and maximally separate patterns in the feature space, the
frequency spectrums of the input patterns are analysed for defining the desired feature
vectors in the frequency-domain. The input weights are then optimized to ensure that, in
the frequency-domain, the DFT of the feature vectors can be placed at their desired
positions. In addition, the output layer weights are optimally designed to balance and
reduce the empirical and structural prediction risks. The feasibility and good
performance of the new algorithm is further confirmed in the simulation examples with
both linearly separable and non-linearly separable data.
Chapter 6 investigates an optimal weight learning machine (OWLM) for handwritten
digit image recognition. Unlike the input weight learning machine developed in the
frequency-domain in Chapter 5, a strategy of using the feature vectors of an ELM based
SLFN classifier as the reference features is first formulated. The input weights are then
optimized in the sense that the real feature vectors from the SLFN’s hidden layer
outputs can be placed at the locations specified by the reference feature vectors in the
feature space. It will be seen in the simulation examples that the OWLM behaves with a
strong robustness and achieves a high accuracy for the handwritten digit image
recognition of both the MNIST and USPS handwriting datasets.
Chapter 7 summarises the main contributions of this thesis and presents a few open
questions for future work.
The author’s publications based on this thesis’ research are given at the end of
this thesis. In addition, the Matlab codes for each of the new learning algorithms
developed in this thesis are provided in the Appendix.
8
9
Chapter 2
Literature Review
Neural networks based pattern classifiers have been applied in many areas of
engineering and science due to their ease of use and good generalization performance.
Conventional training algorithms for neural networks, such as the backpropagation (BP)
method, are known to face the issues of tedious training time, stopping at local minima,
having many meta-parameters to tune, and having high variance (unpredictable)
solutions. Recently, Huang et.al. proposed the extreme learning machine (ELM) for
single hidden layer feedforward neural networks, which is able to learn thousands of
times faster than the BP and even the support vector machine (SVM). The learning
process of the ELM involves only two main steps: (1) Compute the hidden layer outputs
with randomly generated input weights and biases; (2) Compute the output weights
using pseudo-inverse of the data correlation matrix from the hidden layer. The
classification performance of the ELM has been proven to be comparable or better than
the one of the SVM, especially in the multi-class pattern classification problems.
Research into the optimization framework of both the ELM and the LS-SVM have
further proven that the ELM generates the global optimal solutions compared to the LS-
SVM, and in fact, the ELM is closely related to a simplified LS-SVM optimization with
milder constraints. Most importantly, the ELM method represents a unified solution for
binary and multi-class pattern classification problems. This chapter provides a review of
the ELM and its extensions, including some issues on the ELM, and the applications of
the ELM.
10
2.1 Introduction
Pattern classification problems have been receiving a lot of attention from researchers in
diverse fields of study since more than half a century ago [178]. The main goal of the
research in this area is to develop a set of rules or a discriminating system to accurately
identify patterns and allocate them to their appropriate classes. The earliest form of
pattern classification is based on the concept of template matching [178]. It requires a
database of all possible outcomes for any new samples to be compared against in order
to determine the best match. This basic method is still used today in the form of look-up
tables for information retrieval in established and low variability situations. The main
flaw of the template matching is its susceptibility to noise and transformations within
the input patterns. In order to build more reliable pattern classifiers, intensive researches
have been done to build robust classification models that make use of the meta-data of
sample datasets. Essentially, pattern classifiers are categorized as either parametric or
non-parametric models. Parametric classifiers require the statistical information of the
sample pattern variables to build the decision function, while the non-parametric
classifiers learn directly from the sample patterns to build empirical discriminators. The
discussions in this chapter are focused on the development in the field of supervised
non-parametric pattern classifiers only, specifically, the neural networks based pattern
classifiers.
Neural networks have been used extensively in pattern classification problems
ever since the 1970s. A neural network consists of highly parallel connections between
layers of nodes that function as the localized processing units. Its massive parallel
architecture allows it to learn complex nonlinear mappings that are not possible to be
learnt using linear modelling methods. However, the main reason why neural networks
have gained a lot of interest from researchers is its simple implementation in real world
situations. Unlike the conventional statistical decision theory based classifiers, neural
networks belong to a class of non-parametric classifiers that learn from samples of a
system only without needing to know the statistical model of the pattern variables. After
training a neural network using some learning algorithm and a sample dataset, the
neural network is able to classify new unseen samples with good generalization and
accuracy. It was said in [179] that neural networks hide the statistics of the decision
11
making theory from the user and hence allows it to be appreciated by a wide audience.
There exist many modern applications of neural networks based pattern classifiers that
were triggered by the increasing availability of large scale datasets and the need to make
sense of them. Successfully implemented applications include medical systems [66-70],
[158-162], image processing [163-165], face recognition [166], [167], handwriting
recognition [71], [72], sales forecasting [168-170], parameter estimation [171-173],
information systems [76], [174], [175], and control systems [176], [177].
Recently, a novel machine learning algorithm called the extreme learning
machine (ELM) was proposed by Huang et. al. [3]. The ELM is a fast and highly
effective learning algorithm designed for single hidden layer feedforward neural
networks (SLFN). The uniqueness of the ELM lies in the random selection of hidden
layer weights that do not require any tuning, and the analytical solution of the output
layer weights using the pseudo-inverse. Obviously, the training procedures of the ELM
differ significantly from the conventional iterative methods used to train neural
networks such as the backpropagation (BP) method. Many researchers [3], [18], [23-25],
[90], [147] have reported that the ELM is capable of learning thousands of times faster
than the BP methods and tends to perform better in terms of classification accuracy and
generalization on unseen samples. Some researchers have also compared the ELM to
the popular support vector machine (SVM) commonly used for classification problems.
It was found in [180] that the ELM is capable of learning much faster than the SVM,
and the generalization performance of the ELM is comparable or better than the SVM
when the sample size is large. In terms of computational costs, the ELM is seen to be
the clear choice for handling large datasets when compared to the SVM [147], [180].
Furthermore, there are also an increasing number of studies done to compare the
learning theories of ELM and SVM. In [147], It was reported that the LS-SVM
optimization framework is related to the ELM and the LS-SVM optimization solutions
are actually suboptimal compared to the ELM due to the LS-SVM’s stricter
optimization conditions. The ELM has been proven multiple times to perform better
than the SVM in multi-class pattern classification [3], [147] and is recognized as the
state-of-the-art by many. A large number of researchers have proposed modified
versions of the ELM training algorithm to either further improve its performance or
customize the learning process for specific applications [23], [24], [33-65]. From the
12
fast growing literature on the ELM, it can be easily seen that the ELM has been
successfully implemented in a wide range of real world classification problems [66-72],
[76], [158-177].
The aim of this chapter is to provide a review of the development of ELM and
its extensions in the field of pattern classification, as well as to present some discussions
on the issues encountered within the ELM literature. The rest of this chapter is
organized as follows: Section 2.2 gives a brief overview of statistical classifiers. Section
2.3 introduces the SLFN and the conventional neural network training algorithms.
Section 2.4 provides a thorough survey of the ELM and its learning theories. Section 2.5
examines the regularized versions of the ELM. Section 2.6 and Section 2.7 analyses the
relationship between the ELM and the SVM and RVFL respectively. Section 2.8
surveys the applications of ELM in various fields, with a focus on medical pattern
classification problems. Lastly, Section 2.9 gives the conclusions.
2.2 Statistical Classifiers
The objective of the research in statistical classifiers is to construct the optimal classifier
given the accurate statistical model of the variables that make up the sample input
pattern. Direct analytical solutions for such optimal classifiers are obtainable when the
full statistical parameters are available. What is more, the statistical classifiers allow for
the generalization of new unseen samples, which is a highly desirable characteristic. For
this reason, many early applications implement statistical classifiers for complex cases
where it is impossible to catalogue all possible pattern mappings. Two commonly
known statistical classifiers are first introduced in this section, then the relationship
between the statistical classifiers and neural networks are discussed in brief to explain
why neural networks can achieve good performance as pattern classifiers.
One of the basic statistical classification algorithms is known as the Fisher’s
linear discriminant analysis (FLDA) [181]. The FLDA projects high dimensional
sample pattern vectors into a reduced dimensional space using linear transformations. It
is desired that the linear transformations chosen will minimize the intra-class separation
and maximize the inter-class separation, thereby creating a well clustered feature space
13
with multiple dis-joint regions for classification. The linear transformation matrix is
calculated using the first and second order statistics of the classes. In multi-class cases,
the FLDA is generally known as the linear discriminant analysis (LDA). A great
limitation of the LDA is that it only works well for linearly separable cases, where the
linear transformation creates well separated sample distributions after projection. Hence
the LDA is often used as a dimension reduction tool instead before applying other
nonlinear pattern classifiers such as the SVM [182].
Another well-known statistical classification algorithm is the Bayesian Classifier
(BC) [103]. Different from the LDA which only uses the prior probabilities of sample
variables, the BC uses the a posterior probabilities to arrive at a classification decision.
Given the prior probabilities ( ) and ( ) and the conditional probability density
functions ( | ) and ( | ) in a binary case, the a posterior probabilities (2.2.1) and
(2.2.2) of a new sample is interpreted as the likelihood that the sample belongs to a
specific class ( or ) after the sample is collected.
( | ) ( | ) ( ) ( ) ( )⁄ ( | ) ( | ) ( ) ( ) ( )⁄
The maximum likelihood rule in the Bayesian theory states that we should choose the
class that yields the largest a posterior probability as the output class [103].
( ( | ) ( | )) ( )
It is worth noting that the BC produces an inference from the available statistics
only, without any transformation of the sample patterns. Therefore its applications are
limited to finding the optimal threshold or decision boundary in the feature space.
Both the LDA and the BC require that the statistical parameters and probability
distribution functions related to the sample variable be available and accurate. In
practical applications, these parameters are usually not known and have to be estimated
from the sample dataset. Considering the effects of heteroscedasticity in any random
sample set, it is obvious that the sample statistics may not estimate the population
14
statistics well. Therefore, it is hard to implement accurate statistical classifiers in real
world applications.
Neural networks are known as one of the most popular non-parametric
classifiers. However, it has been long established that the neural classifiers and
statistical classifiers share many similarities, and the link between neural classifiers and
statistical classifiers has been thoroughly investigated by researchers [145]. The simple
illustration of this link by Richard and Lippmann [183] provided great insight into the
black-box like operation of neural classifiers. It was described in [145], [183] that a
neural classifier can be viewed as a mapping function ( ) , where d-
dimensional input is mapped to an M-dimensional output . Typical neural network
training algorithms such as the BP perform least squares optimization to minimize the
mean-squared-error (MSE). Based on the statistical least squares estimation theory, the
mapping function ( ) which minimizes the expected squared error [ ( )] is
actually the conditional expectation of given , hence ( ) [ | ] . If the
classification targets are set as the one-of-C format (one element is unity, the other
elements are zero), the following proof [183] is then valid for the case when the output
belongs to the -th class:
( ) [ | ]
( | ) ( | )
( | )
( | ) ( )
It can be seen that the least squares estimate of the mapping function is exactly
the a posterior probability. This significant finding proves that the neural classifier
outputs are actually estimates of the a posteriori probabilities defined in statistical
classifiers. The reason why the neural networks based methods does not directly give
the a posteriori probability as in (2.2.1) is because most neural network training
algorithms are not optimal in the least squares sense. Training issues such as
convergence to local minima and limited sample size restrict the approximation
capabilities of the neural network. In order to obtain the best estimate, the neural
15
network is said to require a large network architecture and unlimited samples. When
such conditions are met, the sum of the neural classifier outputs tends to approach unity.
In conventional multilayer perceptrons (MLP) based neural classifier, the hidden
layers act as the feature mapping stage while the output layer acts as the feature space
decision function. The statistical equivalent of the neural classifier can be seen as a
combination of the LDA and BC. The hidden layers of the MLP are found to transform
sample patterns into clusters [184], which is similar to the function of the LDA.
However, the feature mapping capabilities of the LDA are limited compared to the
nonlinear mappings of the MLP hidden layers which are capable of nonlinear
discriminant analysis. The superior feature mapping capability of the neural classifiers
help to explain why they consistently perform better than linear discriminant methods.
The output layer of the MLP, which is conventionally trained by BP is then the least
squares estimate to the optimal decision boundary in BC. Nevertheless, the limited
number of samples, the existence of noise and outliers, the lack of a large enough
architecture, and the randomness of conventional BP based neural classifiers tend to
produce results with high variance compared to the optimal BC decision boundary.
The ELM utilizes the SLFN which is a specific type of MLP with only one
hidden layer. Instead of using the BP method which produces varying models
depending on the initialization of the SLFN, the ELM uses the pseudo-inverse to
analytically determine the output layer weights. As a result of that, the output layer
weights of the ELM tend to achieve the smallest training error and the smallest norm.
Therefore it can be said that the ELM provides a better and more stable estimate of the a
posteriori probability compared to the BP in terms of optimization strategy with all
other conditions being similar.
2.3 Single Hidden Layer Feedforward Neural Networks
Neural networks belong to a class of non-parametric machine learning paradigm that are
capable of both classification and regression. One of the most researched neural
networks based pattern classifier is the SLFN. The SLFN is the main focus of this
research, where the analysis and discussions concentrate on feedforward neural
16
networks that allow the flow of information in only one direction (left to right) in the
network. A SLFN with 4 input nodes { }, 1 hidden layer with 4 neurons
{ }, and 1 output neuron { } connected to the output node { } is shown in
Figure 2.1, the bias term is omitted here for clarity. The hidden layer weights are
represented by and the output layer weights are represented by .
Figure 2.1 A single hidden layer feedforward neural network
The SLFN in Figure 2.1 can be modelled by the equation in (2.3.1),
( ) ( ) ( )
where [ ] , is the hidden layer weights matrix, is
the output layer weights matrix, is the number of neurons (here ), and
( ) is the activation function of each hidden layer neuron. The hidden layer
activation functions are usually selected as sigmoids and the similar activation functions
are assumed for every neuron in a layer. The output activation functions are normally
linear. For the case where the bias is defined explicitly, the equation (2.3.2) models the
similar SLFN, where is the bias matrix. In the discussions of the later
chapters, the SLFN representation in (2.3.1) is normally used, unless specifically noted.
( ) ( ) ( )
From the pattern classification point of view, the SLFN output layer usually
consists of as many neurons as the available classes (i.e. [ ], = number of
h1
h2
h3
h4
O1
17
classes), where each neuron represents the “active” or “inactive” state of a class. The
corresponding training target for each sample input pattern is usually generated from the
1 of C template, which sets a 1 for the target class (output neuron) and leaves the others
as 0. With this setup, the discriminant function at the output of the neural classifier
needs to only choose the neuron with the largest output value as the predicted output
class. The scalable size of the output vector allows the neural network to be trained for
both binary and multi-class pattern classification problems using the same learning
algorithm.
Many researchers have proposed novel methods to determine the ideal network
parameters and neuron weights for a given SLFN to properly learn a training sample set.
All the training algorithms can be broadly categorized into either supervised or
unsupervised learning. In supervised learning, the training dataset consists of sample
patterns and their corresponding class labels. The class labels act as the target values for
which the outputs of the trained SLFN will be compared against to determine the
training error. This error can then be utilized to further tune the SLFN weights to
produce better results. On the other hand, unsupervised learning requires no class labels
and the training algorithm works with the sample patterns alone to determine unique
structures or relationships among samples. This thesis intends to develop robust
supervised training algorithms for the SLFN. A detailed discussion on the three main
approaches [18] used in the design of supervised training algorithms for the SLFN is
given in the following sub-sections.
2.3.1 Gradient Descent Based Algorithms
The most popular training algorithms used for the neural networks in general are based
on the gradient descent method. The BP method is the most prominent representation of
this class of training algorithm to date. It was developed by Rumelhart et al. [8] in the
1980s and has since been the cornerstone of neural networks training. It is known as the
first training algorithm to iteratively update the neuron weights in the multilayer
perceptron (MLP) to reduce the output error with respect to a target. Before this, the
error correction learning algorithm for perceptrons only worked for a single processing
element. The importance of the BP algorithm lies in its ability to extend and scale the
18
training process to suit any neural network architecture. In order to perform training
using the BP algorithm there must exist some error term, which is usually calculated as
the difference between the network output and the intended target value. The additive
type of hidden layer neurons is most commonly used in these algorithms.
Figure 2.2 The backpropagation training algorithm [2]
Figure 2.2 shows an example of the BP training algorithm. It is seen that the BP
training algorithm described in Figure 2.2 employs the sequential update method, where
the neurons weights are modified based on the error produced by each input pattern
directly. Another popular weight update rule called batch update introduces a slight
change to the update procedure. In the batch update process, the whole set of sample
patterns are presented to the network and the errors from the respective samples are
summed together to produce an aggregated error term that is used to update the neuron
weights; the iteration through the entire set of input patterns is called an epoch. It has
Given a sample set {X,T} of size n, a SLFN with hidden neurons and 1 output neuron
Initialize neuron weights and biases: Initialize target error: Initialize learning rate: Initialize maximum iterations: MaxIter
Repeat for = 1 to MaxIter Input pattern and calculate output Calculate error // Update output layer neuron weights // Calculate propagated error for hidden layer neurons = // Update hidden layer weights // Check if the total error for an epoch is small enough If Stop the training End // Training also stops when reaches the MaxIter
Trained SLFN with neuron weights and biases:
19
been shown in [2] that for small values of the learning rate η both the sequential update
and batch update produce very similar weights updates for a complete cycle of the input
pattern set. However, in terms of computational efficiency, the sequential update is
faster and therefore is the preferred method for practical implementations.
A large number of modified and improved BP based training algorithms have
been proposed by researchers. These interesting developments have found various
applications that span numerous disciplines. However, there are several issues that
inhibit the large scale adoption of neural networks in real world applications. These
issues are among the essential problems that motivated the research performed in this
thesis. The list below outlines the salient issues that will be addressed herein.
(a) Convergence to local minima
As the BP method uses the information from the gradient of the error cost function
to descent to some minimum, the convergence of the algorithm to the global
minimum very much depends upon the location of the initialized weights. This
phenomenon applies to non-convex cost functions where there is more than one
minimum point. It should be noted that the probability of having a non-convex cost
function is high among real-world problems which are normally complex in nature.
If the initialized weights position is far away from the global minimum, the
optimization process will likely be trapped in one of the local minima of a non-
convex cost function. An illustrative example of this case is shown in Figure 2.3,
where only the initialized weights converge to the global minimum. It is
important to point out that since the cost function ( ) with respect to the neuron
weights ( ) is generally unknown in practical situations, there is no way to confirm
whether the gradient descent process converged to the global minimum.
20
(a) (b)
Figure 2.3 (a) Gradient descent on a convex cost function, (b) Gradient descent on a
non-convex cost function with randomly initialized weights
(b) Long training time
The BP based training algorithms iteratively tune the weights of the SLFN to form
suitable decision boundaries on the pattern space. Such a process is slow and time
consuming. Moreover, estimating the time or iterations required for the algorithm to
converge to a minimum is non-trivial. Several issues contribute to the total time
required to find the optimal weights for the SLFN such as the selection of the
learning step size, the features of the training algorithm, the complexity of the
network architecture, the size of the training set, and the separability of the sample
patterns. It is common that complex problems with regular network architectures
such as the SLFN may take hours or even days to be solved.
(c) Numerous training parameters
From the above discussions on the issues affecting the training time, it is clear that
the number of free parameters that need to be tuned is daunting. In practical
applications, most of these are selected based on trial and error or user experience. It
is hard to guarantee the optimality in such training procedures unless all possibilities
are tested. However, there are several generally agreed upon guidelines which act as
the first resort. The SLFN is one such network, with elaborate literature describing
its ability to function as universal approximators, as well as a universal
21
approximation theorem that suggests that one hidden layer is enough to gain most of
the advantages offered by neural networks.
2.3.2 Standard Optimization Method Based Algorithms
In the attempt to address the problem of non-ideal solutions produced by the BP based
training algorithms, researchers have proposed alternative optimization methods for the
SLFN. The support vector network, or support vector machine (SVM) [19], [20] by
Cortes and Vapnik is one of the leading optimization based training algorithms which
makes use of the method of Lagrange multipliers. The support vector network proposes
that only the weights connected to the output layer needs to be determined. In essence,
the input patterns are first non-linearly mapped to a high dimensional feature space
using some kernel function. Then a linear decision surface is constructed in the feature
space to best separate the classes. Different from the BP methods, the non-linear
mapping function of the support vector network is fixed a priori, and the optimization
method is used to deterministically compute the output layer decision surface with the
best generalization ability.
Although the research related to SVM training algorithms provided significant
findings in terms of the optimal decision surface, there exist two unavoidable drawbacks.
It is generally known that selecting the ideal non-linear mapping function is a non-
trivial task. The non-linear function has to transform the input patterns to the best
separable feature space in order to achieve the optimal generalization performance.
Furthermore, given the complex nature of the optimization procedure and the
polynomials of high degrees that are normally used to form decision surfaces, the
training time is considerably long. What is more, as the training algorithm is
deterministic, each support vector has to be considered simultaneously during the
optimization process and this proportionally increases the training time.
2.3.3 Least Squares Based Algorithms
The radial basis function (RBF) networks are prime examples of training algorithms that
determine the neural network weights using least squares. The RBF networks are a
22
special case of the SLFN where the hidden layer neurons are substituted using radial
basis functions such as the Gaussian function. An example of a RBF hidden node is
shown below
( ‖ ‖) ( )
where is the input pattern vector, is the centre of the th node, is the impact
factor or bias term, ( ) is the non-linear radially symmetrical activation function, is
the output of the th hidden node, and ‖ ‖ is the L2-norm. In the RBF networks [21],
[22], the parameters of each hidden node are linear and fixed, and the output is given by
a radially symmetric function of the distance metric between the input pattern and the
centre. Therefore, the RBF network hidden layer performs a fixed non-linear
transformation of the input space to the feature space. The only free parameter is then
the output layer weights that can be determined using the least squares method.
The difficulties in implementing the RBF network reside in selecting the optimal
centres and activation function. Conventionally, the centres are selected randomly from
the training data, but this selection method often requires very large networks. Lowe [21]
proposed adaptive tuning of the centres using Quasi-Newton methods similar to the
iterative BP. Chen et al. [22] later proposed the orthogonal least squares learning
algorithm to determine the optimal centres of the RBF nodes. The orthogonal least
squares method avoids the over-size and ill-conditioning problems that occur frequently
in the random selection of centres. However, it should be noted that longer training time
is required to perform optimization on the RBF hidden layer.
In summary, the literature reveals that the generalization capability, the selection
of free parameters, and the required training time are the major concerns in the design of
robust and computationally efficient training algorithms for the SLFN.
2.4 Extreme Learning Machine
Recently, a new machine learning algorithm called the extreme learning machine (ELM)
was proposed by Huang et al. [3]. Different from the conventional training algorithms
23
for SLFNs such as the BP method, the ELM proposed that only the output layer weights
need to be tuned. The basic ELM scheme assigns the hidden layer weights and biases
randomly prior to the training stage. Then the input patterns are non-linearly
transformed into the feature space by the hidden layer function, producing the hidden
layer output matrix , such that , where represents the target matrix.
( ) ( )
Finally the output layer weights are calculated by the method of least squares using the
Moore-Penrose pseudo-inverse. The convention in [3] is adopted here for readability.
( )
In summary, the basic ELM training algorithm can be defined as:
1) Given a training data set [ ], randomly assign the hidden layer weights and
biases of the SLFN.
2) Calculate the hidden layer output .
3) Solve for using the Moore-Penrose pseudo-inverse, where .
Significant improvements in training speed and generalization performance have
been reported for the ELM [18]. It is seen that the assignment of the hidden layer
weights and biases using randomly generated values reduces the model complexity
tremendously in terms of the number of parameters that require tuning. Furthermore, the
analytical solution of the output layer weights in (2.4.2) converts the output weights
optimization problem into solving a system of linear equations. The solution to the
linear equations matrix using the pseudo-inverse produces the smallest training error
and the smallest norm (including the overdetermined and underdetermined cases).
Achieving the smallest norm actually maximizes the generalization ability of the SLFN
classifier by reducing the output variance in the testing phase. In addition, Bartlett’s
theory [185] also suggests that for neural networks achieving small training error, the
smaller the norm of the network weights the better the performance.
24
The reduction in the number of parameters to tune also allows the SLFN training
to be less sensitive to human intervention. The only training parameter to specify for the
ELM is the number of neurons, while the activation functions are assumed to be
sigmoidal with additive type of hidden nodes. Hence the ELM training algorithm is
much simpler than the BP and SVM, where the model initialization in both the latter
cases requires educated estimates of the solutions. Unlike an earlier proposed randomly
initialized hidden layer weights method in [21], which actually uses randomly selected
training patterns, the ELM hidden layer weights are completely independent of the
sample dataset. In fact, the ELM hidden layer weights and biases are usually pre-
generated without knowledge of the input patterns and output target variables.
Although the ELM was initially designed for the SLFN, it was later extended to
include generalized-SLFNs which allow non-neuron alike hidden nodes [23], [24]. For
instance, the ELM literature provides theoretical proof that hard-limiting functions such
as the threshold function that is not infinitely differentiable is capable of universal
approximation [93]. As more function types are applicable for the ELM, the hidden
layer of the ELM is seen to function as the arbitrary feature mapping layer. This
observation bears strong similarities with the operations of the kernel functions in the
SVM training algorithm. Considerable research effort has been put into the comparison
and integration of the ELM and SVM and the findings are discussed in detail in section
2.6.
Lastly, it is worth noting that there have been several studies conducted to
examine the feasibility of the ELM learning algorithm [186], [187]. In [186], Wang et.
al. proposed experiments to compare the direct input SLFN (without any hidden layer)
with the random weights ELM. The experimental results confirm that the randomly
generated weights do have some positive effects on the SLFN for many classification
and regression tasks. Horata et. al. [187] studied the various methods to compute the
pseudo-inverse as it is the most time consuming step, and the basic ELM typically uses
the SVD to compute it. They subsequently proposed QR-ELM and GENINV-ELM that
significantly speed up the computational time required to calculate the pseudo-inverse
while maintaining acceptable levels of accuracy. The following sub-section examines
the learning theories developed for the ELM.
25
2.4.1 Learning Theories of ELM
The universal approximation and interpolation capabilities of neural networks have been
thoroughly researched over the past decades [3-7], [23-26]. Broomhead and Lowe [26]
suggested that feedforward neural networks may be viewed as performing a simple
curve-fitting operation in high dimensional space. Learning is then an attempt to
produce a best fit surface in the high dimensional space using a finite set of data points
(sample patterns). Generalization can then be seen as interpolating between the data
points on the fitted surface. A brief review of the learning theories developed for the
SLFN will be given below.
The early works on universal approximation stem from the initial problem of
finding a group of single variable functions to represent a high order polynomial
function. The solution to this problem was proposed by Kolgomorov, who proved that
“any continuous function defined on an -dimensional cube is representable by sums
and superpositions of continuous functions of exactly one variable” [2], with the
mathematical model shown in (2.4.3), where ( ) is the target function with input
variables { | } , ( ) are the individual continuous functions with one
variable { | } such that the sum of the outputs of individual continuous
functions are connected to the weighting variables { | } to be superimposed.
( ) ∑ [∑ ( )
]
( )
The function in (2.4.3) shows great similarities with the mathematical model of
the SLFN in (2.4.4), where ( ) is the output function with the -th input vector from
a total of sample vectors, and are the weights and biases of the hidden layer
neurons with activation function ( ) , and are the output layer weights. The
number of neurons in the hidden layer and output layer are set as the same here for
convenience.
26
( ) ∑ ( )
( )
After Kolgomorov’s theorem was conceived, Cybenko [4] and Funahashi [5] in
the late 1980s used it as a means to derive rigorous proves that there exists a solution for
the SLFN with any continuous sigmoidal non-linear activation function to approximate
any continuous function on a compact sample set with uniform topology. Hornik et al.
[6] then proved that MLPs in general, including the SLFN, is capable of universal
approximation using any arbitrarily bounded non-constant activation function, provided
that sufficient number of hidden neurons are available. In the early 1990s, Leshno et al.
[7] provided the general proof that the SLFN is capable of universal approximation with
any locally bounded piecewise continuous activation function as long as the activation
function is not a polynomial. An important deviation from Kolgomorov’s theorem that
is found in the above proofs is that instead of allowing for different non-linear functions,
the proofs involving the SLFN use the same non-linear function (neuron activation
function) for all its hidden layer neurons. The universal approximation theorem below is
derived from the works of these early researchers [4-6]:
Theorem 2.1: Let ( ) be a non-constant, bounded, and monotone-increasing
continuous function. Let denote the -dimensional unit hypercube [ ] , and the
space of continuous functions on be denoted by ( ). Then given any function
( ) and , there exists an integer and sets of real constants , , and ,
where and such that we may define
( ) ∑ (∑
)
( )
as an approximate realization of the function where | ( ) ( )|
.
All the above contributions, including Theorem 2.1, provide existential proof
that there exists some combination of neurons that when put together correctly, is able
27
to approximate any continuous function. However, there are no constructive instructions
on how these “best case” networks can be found. Hence numerous training algorithms
are developed to search the weight space for the ideal set of neuron weights for a given
network architecture to produce the best approximation to a target function.
Unfortunately, there is no short-cut to find out whether the best solution is found
without testing the entire set of combinations, but it is generally acceptable that the final
testing error is negligible for the intended application (i.e. let the learning error
where is a small positive number).
In all of the SLFN learning theories mentioned above, the meta-parameters
(network weights, activation functions, training variables) are required to be freely
adjustable. As a consequence, the training time increases quickly with complex
algorithms that require a lot of tuning. The ELM does not require any tuning at the
hidden layer, in fact, in the typical implementation of ELM they can be generated
randomly. The output layer weights can then be determined analytically using the
pseudo-inverse. It differs significantly from the conventional BP method in that it does
not involve any iterative tuning at any stage of training the SLFN. The learning
performed by the ELM on a finite set of sample data is initially based on the
interpolation theory [3], [18]. In terms of learning a target function, the interpolation
theory and universal approximation are two different approaches to a problem. In the
interpolation theory, the network should learn all the sample data points well, so that
these data points can be reproduced exactly, and any query to the region outside the set
of training points can be estimated by interpolation or extrapolation from existing points.
The universal approximation theorem does not demand that the sample data points be
learned exactly, rather, it proposes that the total error over the entire sample space be
minimized. However, it should be noted that both interpolation and universal
approximation capabilities can allow a neural network to learn a continuous function
within a dense set given a large enough amount of neurons, and their differences are
significant only when the number of neurons is limited.
The first function approximation theory for the ELM was introduced in [3],
using the interpolation theory. For a dataset with sample patterns { |
} , and corresponding target vectors { | } ,
28
such that the SLFN attempts to map | , the SLFN with neurons can be
written as
( ) ∑ ( )
( )
where ( ) is the SLFN output vector, { | } is the
output layer weights, { | } is the hidden layer weights,
{ | } is the bias, and ( ) is the activation function. The
function in (2.4.6) can be written compactly as matrices , where
[ ( ) ( )
( ) ( )
] ( )
[ ] and [ ]
It has been shown in [3] that from the interpolation theory point of view, if the
activation function ( ) is infinitely differentiable in any interval the hidden layer
weights and biases can be randomly generated. Different from the previous work that
randomly selects weights from the training data [21], the random weights generated for
the ELM can be completely data independent. Theorem 2.2 [3] below states the results.
Theorem 2.2: Given any small positive value , activation function |
which is infinitely differentiable in any interval, and arbitrary distinct samples
( ) , there exist such that for any { } randomly generated
from any intervals of , according to any continuous probability distribution, with
probability one, ‖ ‖ .
Theorem 2.2 shows that the ELM is capable of learning the sample points
efficiently with a small amount of error. The SLFN trained according to this theorem is
known as the approximate interpolation network, where the points are learned with
some error. However, the authors in [3] also proved that for the case of samples, it is
29
possible for the ELM to learn the samples exactly if , such that ‖
‖ . This extension of Theorem 2.2 is known as the exact interpolation network,
where there exists a set of network parameters that can learn exactly samples point
with neurons. The proof provided for this theorem relies on the definite probability of
finding a set of network parameters to make the hidden layer output matrix full rank,
and hence, guarantees invertibility.
There also exist universal approximation proofs for the ELM where the authors
have attempted to provide rules for the design and training of the SLFN. In [25], an
incremental version of the ELM was used to prove that if the incrementally added
neurons (with randomly selected weights and biases) of a SLFN used (2.4.8) to update
the corresponding new output weight term, the universal approximation theorem below
holds.
Theorem 2.3: Given any bounded non-constant piecewise continuous function |
for additive nodes or any integrable piecewise continuous function | and
∫ ( )
for radial basis function nodes, for any continuous target function and
any randomly generated function sequence { } , ‖ ‖ holds with
probability one if
( ) ⟨ ( ) ⟩
‖ ‖ ( )
Theorem 2.3 only confirms the universal approximation capability of the ELM
for the incremental learning method, where the network starts with an empty set of
neurons and iteratively adds more neurons as required. Indeed, the universal
approximation capability of the ELM in the most commonly used batch learning mode
was only published later in [23], [24]. Within the derivations of [23], [24], the authors
still used the incremental learning method to first proved that such SLFN training
methods could perform universal approximation, before stating the theorem for the
fixed structure SLFN by induction. This important proof is shown as Theorem 2.4.
30
Theorem 2.4: Given any non-constant piecewise continuous function | , if span
{ ( ) ( ) } is dense in the function space , for any continuous
target function and any function sequence { ( ) ( } randomly generated
based on any continuous sampling distribution, ‖ ‖ holds with
probability one if the output weights are determined by ordinary least square to
minimize ‖ ( ) ∑ ( ) ‖.
With the proof provided in Theorem 2.4, the universal approximation capability
of the original ELM [3] and all other modifications of its batch learning type of training
algorithms with fixed SLFN network architecture are justified. In addition, Theorem 2.4
also indicates that the ELM training algorithm can work for the “generalized” SLFNs,
which include a wide range of non-linear and non-differentiable activation functions
where the nodes need not be neuron alike. This essentially means that the ELM training
algorithm allows much greater flexibility in the selection of basis functions to be used in
the SLFN architecture.
Lastly, it is worth noting that all the ELM approximation theories state that as
the number of neurons becomes very large (possibly much larger than the number of
training samples), the performance of the trained SLFN with the ELM algorithm will
proportionally improve. Therefore the ELM training norm deviates from the
conventional MLP learning rule that at most neurons are required to learn the -
unique sample patterns well. In fact, the ELM networks are usually tested with a large
amount of neurons at the first attempt. The next sub-sections survey the extensions of
the ELM algorithm and their applications.
2.4.2 Batch Learning ELM
The training algorithms that are developed based on the ELM can be generally
categorized into off-line or batch learning and on-line or sequential learning. The batch
learning method trains a group of samples patterns at once and usually requires longer
computational time, while the sequential learning method learns the sample patterns
one-by-one as new samples arrive (usually in real-time systems). The initial ELM was
designed for batch learning and was later extended to have sequential learning
31
capabilities. In the design of the basic batch learning ELM, the SLFN hidden layer
weights and biases are assigned randomly, and the output layer weights are obtained by
solving for in the least squares sense such that . If the number of hidden
nodes is equal to the number of unique training samples, the matrix is invertible, and
there exist a solution where with zero error ( ). However in practical
cases where the number of samples is either more than or less than the number of
hidden nodes, such that the matrix is not full rank, the pseudo-inverse is used as it
achieves the smallest training error and has the smallest norm. The neural network
theory states that for SLFNs reaching small training error, the smaller the norm of the
network weights, the better the generalization performance [18].
Within the batch learning category, the development of the ELM training
algorithms can be divided into three sub-sections, namely, (i) the optimal selection of
weights and biases, (ii) the use of different types of nodes and activation functions, and
(iii) the automatic model selection algorithms.
2.4.2.1 The Optimal Selection of Weights and Biases
The first sub-section involves the least modification to the architecture of the
SLFN. Instead of randomly generating the network weights and biases, the methods
proposed in [33-39] implement alternative weights selection strategies that improve the
performance of the ELM. The discussions in [33-35] revolve around the ill-posed
solution of the pseudo-inverse when the matrices are not well conditioned. They
propose methods to properly select the hidden layer weights and biases such that the
hidden layer output matrix is well conditioned before calculating the output weights. In
[36], [37], the authors implement evolutionary algorithms to select the hidden layer
weights and biases, whereas the input data is used to generate the suitable hidden layer
weights and biases in [38]. Different from the works mentioned above, the algorithm in
[39] proposes the optimal calculation of the output layer weights when the hidden layer
output matrix is non-full rank.
32
2.4.2.2 The Use of Different Types of Nodes and Activation Functions
The second sub-section of learning algorithms proposes some changes to the
nodes or activation functions in the SLFN. In [40], the parameterized RBF was used in
the hidden layer so that the RBF node meta-parameters can be tuned to provide better
performance. A method to implement fuzzy activation functions was proposed in [41],
while the authors in [42] proved that the combination of upper integral functions at the
hidden layer can function as a classifier. Other than that, a good deal of attention was
also paid to the development of ELM algorithms for complex SLFNs. The authors in
[43-45] have developed fully complex ELMs with the main purpose of implementing an
ELM-based channel equalizer. In [46], both the sine and cosine were used as activation
functions for periodic function approximation.
2.4.2.3 The Automatic Model Selection Algorithms
The third and final sub-section for batch learning algorithms consists of
automatic model selection methods which determine the optimal architecture for the
ELMs. Generally, the neural networks literature on model selection based training
algorithms can be defined as constructive methods and pruning methods. The
constructive methods start from a single hidden layer neuron, and incrementally add
neurons until a target or stopping criterion is met, while the pruning methods start with a
large number of neurons and selectively remove neurons which are less important. In
[23], [24], [47-49], constructive model selection methods were proposed to iteratively
search for the optimal number of hidden nodes for the SLFN trained with the ELM
algorithm. The convergence rate analysis for the constructive type training algorithms
was also provided in [50], where the main results show that the constructive type
training algorithms have universal approximation capability. Alternatively, the authors
in [51] and [52] proposed optimally pruned ELM algorithms that remove hidden layer
neurons from a large group of pre-generated neurons, based on regularization theories.
In [53], the modified Gram-Schmidt algorithm was used to select the important nodes in
the SLFN, while the fast pruned-ELM in [54] used statistical methods to determine the
best nodes to be retained.
33
2.4.3 Sequential Learning ELM
Real-time systems require speed and continuous updating capability that is only
available using sequential learning algorithms. Because the ELM is a very fast learning
algorithm by itself, the sequential learning form of the ELM promises even greater
speeds and improved overall performance compared to the conventional methods. The
online sequential extreme learning machine (OS-ELM) and its enhanced versions in
[55-57] extends the simple learning algorithm of the ELM to the sequential case, where
the sample by sample learning equations are derived and are proven to be equivalent to
the batch learning case within a compact set of training samples. There is also the online
version of the ELM with fuzzy activation functions developed in [58] for function
approximation and classification problems, and the structure-adjustable (constructive or
pruning based) sequential learning ELMs in [59], [60].
2.4.4 ELM Ensembles
In order to boost the performance of the ELM learning algorithms mentioned above
(batch learning and sequential learning methods), the combination of several ELMs to
learn a problem as a group, and to devise a collective output, was developed by
researchers. These approaches which combine the efforts of multiple ELMs are known
as ensemble methods. The most prominent works in this area are the contributions in
[61-64], where different types of combinational theories are implemented to improve
the classification performance of the single ELM. The recently published results in [64]
shows that the increased number of ELMs generally improves the classification ability
of the neural classifier ensemble when tested with 19 real-world problems. It was also
found that the number of ELMs between 5 and 35 is sufficient for most cases. In terms
of practical implementations of the ELM ensemble in real-world situations where fast
training is essential, the authors in [65] have proven that the ELM ensembles can be
trained in parallel using graphics processing units (GPU), which are faster by several
orders of magnitude than the usual CPU computational time.
34
2.5 Regularized ELM
The basic ELM training algorithm uses the pseudo-inverse to analytically determine the
output weights after the hidden layer weights and biases are randomly generated.
According to the statistical optimization theory [83], the pseudo-inverse is a form of
empirical risk minimization (ERM) strategy, where the focus is on minimizing the
training error for classification of the training sample set. However, achieving the
minimum training error does not directly translate to good classification performance in
the testing phase, where the generalization ability is essential. In fact, over-fitting is
often observed in ELM applications [89]. A good classification model should optimally
balance between the structural risk and empirical risk minimizations. In order to
improve the generalization ability of the ELM, several modified ELMs [88-90] that
make use of regularization techniques in the determination of the output weights were
proposed.
Deng et. al. [89] described the ELM as being ERM themed and that the training
process of the ELM does not consider the heteroscedasticity in real world sample
datasets. Therefore the ELM is susceptible to over-fitting and outliers during training.
They proposed the regularized ELM (RELM) which introduces a regularization factor
for the empirical risk and uses the magnitudes of the output weights to represent the
structural risk in training. Furthermore, an error weighting factor matrix was included
to minimize the interference of outliers in the training sample set. The optimization
problem is defined in (2.5.1).
‖ ‖
‖ ‖ ( )
∑ ( )
The standard optimization technique using Lagrange multipliers was then used to obtain
the output weights solution in (2.5.2).
35
(
)
( )
where is the identity matrix, is the hidden layer output matrix, and is the target
matrix. When the matrix is set as the identity matrix, the solution in (2.5.2) can be
written as in (2.5.3).
(
)
( )
The RELM reported great improvements in the generalization ability compared to the
ELM when tested on 13 real world datasets while still maintaining the fast training
speed. However, the authors used the matrix search method to select the optimal
regularization factor and number of neurons combination. The matrix search method is
time consuming as every possible combination within the specified range of values has
to be tested. Although the time spent to search for the optimal meta-parameters is
essential and significant, most of the results in the literature do not include it.
Toh [88] proposed the total error rate-based ELM (TER-ELM) designed for
pattern classification problems. Instead of minimizing the empirical error (as in basic
ELM), it is desired in the TER-ELM that the classification error of the SLFN be
minimized. The possible classification results are first categorized as True Positive Rate
(TPR), True Negative Rate (TNR), False Positive Rate (FPR), and False Negative Rate
(FNR). The objective is then to minimize the total error rate (TER):
( )
In the optimization of the TER, Toh used the quadratic function to approximate the step
function commonly used as the hard discriminator (outputs either 1 or 0 only).
Comparisons of other functions such as the sigmoid or logistic-like functions, piecewise
power functions, and some non-linear smooth functions in their work pointed out that
the non-linear nature of these functions require an iterative search method to obtain the
solution. Therefore, the quadratic function remains as their choice to derive a
deterministic solution. By defining two class-specific diagonal weighting matrices
36
and , which represents the weight factors for the positive and negative samples of
each category in a multi-category problem (similar to multi-class classification), the
generalized optimization solution for the SLFN output weights can be written as:
( ) ( )
where , the elements of and are ordered according to the two
classes, and represents the number of categories of the multi-category problem,
such that [ ]. In order to handle the case when the matrix is not full
rank, the regularizer is included to improve the numerical stability in the regularized
solution in (2.5.6).
( ) ( )
The TER-ELM was first evaluated against the ELM using 42 benchmark datasets for
classification from the UCI, StatLog and StatLib datasets. It should be noted that the
regularizer was not used in the experiments ( ), instead, the author selected 10 sets
of class-specific weights that form , to verify the effectiveness of the TER
minimization design. The average classification results showed a consistent
improvement by the TER-ELM over the basic ELM. Later comparisons with other
state-of-the-art classifiers reveal that the TER-ELM produces comparable results, but
the simple deterministic methodology employed by the TER-ELM significantly reduces
the network complexity and computational costs.
Comparing the solution in (2.5.6) with the solution for the RELM in (2.5.2), it
can be seen that in (2.5.2) the weight matrix is sample specific to reduce the effects of
outliers, while the weight matrix in (2.5.6) is class specific to balance the sensitivity
of the classifier. However, both the TER-ELM and RELM algorithms still require extra
time for parameter selection compared to the ELM.
Huang et. al. [90] proposed the constrained-optimization based ELM (CO-ELM)
to extend the work by Deng et. al. [89] and Toh [88] to the generalized SLFNs and the
kernel functions. In [90], it was first proven that the ELM with generalized SLFN is
37
capable of universal classification as long as the number of neurons selected are large
enough for the universal function approximation conditions to be valid. Then the
Lagrange multiplier method for equality constraints is applied to the ELM optimization
functions in (2.5.7) to show that the multi-class single output type of SLFN belongs to a
special case of the multi-class multi output type of SLFN for classification; and that
only the multi output case needs to be considered for CO-ELM.
‖ ‖
∑
( )
( )
For the case when the number of training samples is less than the number of
pattern variables, the solution in (2.5.8) is suggested.
(
)
( )
For the case when the number of training samples is more than the number of
pattern variables, the solution in (2.5.9) is suggested.
(
)
( )
Both the output layer weights solutions in (2.5.8) and (2.5.9) tend to reach the
smallest training error and the smallest norm, and selecting which one to use is mainly a
matter of the computational cost required in performing the matrix inverse. They also
suggested that the number of neurons can be set at a large number (i.e. 1000)
automatically, as the generalization performance of the CO-ELM is not sensitive to the
dimensionality of the feature space.
For the case when the hidden layer mapping ( ) is unknown, the ELM kernel
matrix in (2.5.10) is proposed.
38
( ) ( ) ( ) ( )
By inserting the ELM kernel into (2.5.8), the corresponding output function of the CO-
ELM classifier can be defined as in (2.5.11), and hence the CO-ELM can be
implemented for kernel functions as well.
( ) ( ) (
)
( )
[ ( )
( )
] (
)
The classification performance of the CO-ELM on 12 binary-class datasets and
12 multi-class datasets have confirmed that the CO-ELM always achieves comparable
classification performance compared to SVM and LS-SVM for binary-class cases, and
the CO-ELM performs much better for multi-class cases. Overall, the CO-ELM
demonstrated much faster training speed than the SVM and LS-SVM.
In summary, the regularization methods are seen as an integral part of the ELM
training algorithm, so much so that the CO-ELM solutions in (2.5.8) and (2.5.9) have
become the preferred form of the ELM for comparisons with other pattern classification
algorithms. The CO-ELM is sometimes directly referred to as the ELM. By proper
tuning of the regularization factor or regularization matrix, the ELM can avoid
overfitting, thus improving the generalization ability significantly without any
noticeable decrease in training speed. However, finding the ideal regularizer values
remain a costly computational task.
2.6 ELM and SVM
Support vector machines have been the popular choice for pattern classification
applications because of their good generalization performance and solid literature [19,
20]. Given a set of sample training patterns ( ), the SVM first maps the input
patterns into a high dimensional feature space using some kernel function ( )
( ) ( ), and then the hyperplane that maximally separates the margin between the
39
two classes are determined using the method of Lagrange multipliers [20]. The outcome
of the optimization procedures are support vectors that define the boundaries of each
class and a decision function that acts as the discriminator. The decision function of the
SVM is given in (2.6.1). The convention in [90] is adopted here for readability.
( ) (∑ ( )
) ( )
where is the sample input pattern, is the number of support vectors , is the
Lagrange multiplier corresponding to the target of the support vector , and is the
bias. The SVM is originally developed for binary classification problems and requires
specific configurations such as the one-against-one or one-against-all methods to
perform multi-class classification.
From the SVM decision function in (2.6.1) it is easily seen that it resembles the
ELM decision function in (2.4.6), where the kernel function ( ) behaves like the
additive type neuron with an activation function to map the input pattern into feature
space and the terms act like the output layer weights . However, the bias term
performs different functions in the ELM and SVM, in the ELM the bias term shifts the
points in the feature space, while the bias in the SVM shifts the separating hyperplane
away from the origin where necessary. Several detailed investigations have been
conducted in [90], [147], [188] to identify the connections between the ELM and SVM.
In [188], Wei et. al. performed comparative studies on the learning
characteristics of ELM and SVM. It was found that both ELM and SVM obtains the
global solutions, ELM through least squares, and SVM through quadratic programming.
However, the computational load of the SVM is significantly higher when the sample
size is large. The number of support vectors generated by the SVM is also usually more
than the number of neurons required by the ELM to achieve similar performances.
Obviously, the decision function of the SVM is more complex compared to the ELM,
and therefore the testing times of the ELM is also much faster than the SVM. In
addition, the SVM requires the careful selection of the kernel function, kernel
parameters, and the regularization constant, while the ELM only needs to determine the
40
optimal number of neurons during training. All the results points to the superiority of
the ELM over the SVM.
Liu et. al. [189] and Frenay and Verleysen [190] directly applied the ELM
random weights hidden layer function as the so called ELM kernel in SVM. It was
found that the ELM kernels eliminates the need for tedious kernel parameter selection
as only the number of neurons need to be specified, and this number can be set
sufficiently large (i.e. 1000 neurons) without affecting the classification performance
due to the regularization parameter. The classification results achieved showed good
generalization performance that is comparable to standard SVM kernels with the
significant advantage that the SVM with ELM kernels were supremely fast.
Huang et.al. [90] derived the optimization method based solution for the ELM
using the similar quadratic programming framework of the SVM. Two differences in
the SVM optimization problem was noted for the ELM, (i) instead of using a
conventional kernel function such as the RBF kernel, the ELM uses random feature
mappings. (ii) the bias is not required in the ELM’s optimization constraints since in
theory the separating hyperplane of the ELM feature space passes through the origin.
The primal and dual Lagrangian was defined for the ELM and the solution to the dual
cases was derived. Using the decision function of the dual optimization with the ELM
kernel in (2.6.2), the so called support vector network for ELM is shown in [90].
( ) (∑ ( )
) ( )
The most obvious difference between the SVM and ELM decision functions
(2.6.1) and (2.6.2) is the missing bias term in the ELM function. The elimination of the
bias term for the optimization method based ELM also correspondingly reduced the
optimization constraints compared to the SVM. As a result of that, the SVM is said to
reach suboptimal solutions within a stricter feature space, compared to the optimization
method based ELM. The experiment results on 13 benchmark binary classification
problems confirmed that the ELM achieves better generalization performance overall.
41
In [147], Huang et. al. further investigated the relationship between ELM and
SVM by comparing the LS-SVM and PSVM with ELM. It was found that both the LS-
SVM and PSVM, that uses equality constraints compared to conventional SVMs,
actually have similar optimization formulation compared to the CO-ELM (henceforth
referred to as ELM) discussed in section 2.5. If the bias term is removed from the LS-
SVM and PSVM optimization function, the resultant learning algorithms are unified
with ELM. This is a significant finding because the ELM learning theory allows the
SLFN classifier to achieve comparable or better classification performance when
compared to other state-of-the-art pattern classification algorithms. In addition, the
ELM is suited for multiple applications, including binary classification, multi-class
classification, and regression problems. In terms of computational complexity, the ELM
has to only tune the regularizer parameter if the hidden layer feature mapping is known,
and the number of neurons is set to a large value (i.e. 1000 neurons). The comparably
simpler ELM algorithm runs up to thousands of times faster than the SVM and LS-
SVM. Even when the hidden layer feature mapping is not known, kernels can be used in
the ELM. The experiment results further confirmed that the ELM achieves comparable
generalization performance compared to the SVM and LS-SVM for binary classification
problems, and the ELM has much better generalization performance in multi-class cases.
In summary, the research incorporating the learning theories of the ELM and
SVM can be categorized as in Table 2.1. It can be seen from Table 2.1 that different
approaches have been used to merge the ELM and SVM learning theories, and all the
methods have reported comparable results or some improvements over the conventional
SVM. In terms of the optimization formulation, the ELM and SVM are indeed very
similar, where the maximal margin theory of the SVM is similar to the minimization of
the norm of the output weights in ELM [90], [147]. From the many experiments
comparing the ELM and the SVM, it is seen that the ELM is the preferred algorithm for
handling large scale datasets while the SVM is better when the sample size is small.
42
Table 2.1 Research incorporating the ELM and SVM
Algorithm Linked to ELM Linked to SVM
ELM Basic ELM [3] Optimization method based ELM [90]
SVM SVM with ELM kernels [189, 190]
LS-SVM with bias removed [147] Conventional SVM [20]
2.7 ELM and RVFL
The ELM is capable of achieving very fast training and good generalization
performance. However, it should be noted that the idea of using random weights and
biases in the SLFN was first suggested in the mid-1990s. Similar to the ELM, the
random vector functional link (RVFL) net proposed by Pao et al. [27-32], also uses the
randomly generated weights and biases. The RVFL is a modified version of the
functional link net which initially proposed the use of functional links that are non-
linear, instead of the usual dot product, at the input connections of the SLFN. Then the
authors introduced the use of randomly generated weights and biases in the input layer
as a special case, and the output weights could be calculated using quadratic
programming methods, such as gradient descent.
In terms of network architecture, the main difference in the literature of the ELM
by Huang et al., and the RVFL by Pao et al., is that the RVFL is defined as a “flat”
neural network, which means that it does not have a hidden layer. Hence the inputs of
the RVFL are connected directly to the output layer. However, the inputs of the RVFL
contain the similar hidden layer neuron calculations as the ELM hidden layer with
random weights and biases. Therefore, the effect of the difference in network
architecture on the mathematical formulation of both networks is indeed subtle.
Nevertheless, the ELM is different from the RVFL in terms of the types of acceptable
hidden layer nodes and the batch learning principle of the output layer weights.
The training procedure of the ELM differs from the RVFL in the sense that the
batch learning method using the pseudoinverse is the initial basis for fast learning
43
proposed for the ELM algorithm, while the RVFL initially suggests the gradient descent
approach. Although the training of the RVFL with the similar pseudoinverse is indeed
possible, the ELM has vast amounts of literature that is solely based on the batch
learning pseudoinverse method. The definition of the possible types of hidden layer
nodes is also more general in the ELM literature, where the hidden layer nodes can be
non-neuron alike, and the network is known as the generalized-SLFNs (including
sigmoid networks, RBF networks, trigonometric networks, threshold networks, fuzzy
inference systems, fully complex neural networks, high-order networks, ridge
polynomial networks, wavelet networks, etc.) [23], [24]. Based on the points compared
herein, the ELM is seen to propose a clear difference from the conventional neural
network architecture and the traditional iterative training algorithms.
2.8 Applications of ELM
The ELM boasts good generalization performance and low computational complexity.
Therefore there have been many applications of the ELM in various fields. This section
gives a brief survey of the applications of ELM, with a focus on medical pattern
classification problems. The following sub-sections discuss the applications of the ELM
in the area of medical systems [66-70], [158-162], image processing [163-165], face
recognition [166], [167], handwriting recognition [71], [191], sales forecasting [168-
170], parameter estimation [171-173], information systems [76], [174-175], and control
systems [176], [177].
2.8.1 Medical Systems
One of the most popular fields of application for the ELM is in medical systems. The
amount of data generated by medical equipment nowadays is significantly voluminous
hence fast intelligent systems are required to infer vital information that could aid
medical practitioners. Due to its good classification performance for large scale datasets,
the ELM is often used in computer aided medical diagnosis.
Li et. al. [158] proposed a computer aided diagnosis system for thyroid disease
using the ELM. In order to differentiate between the hyperthyroidism, hypothyroidism,
44
and normal condition of the thyroid, they developed the so called PCA-ELM. The PCA-
ELM uses the principal component analysis (PCA) to first reduce the dimensions of the
patient data before using the ELM to learn the classification model. The PCA-ELM was
compared with the PCA-SVM in the experiments, and it was found that the PCA-ELM
performs much better than the PCA-SVM in terms of classification accuracy and tends
to achieve a smaller standard deviation in the 10 fold cross-validation (CV) tests with
shorter run time.
Malar et. al. [159] developed a novel classifier for mammographic
microcalcifications using wavelet analysis and the ELM. Microcalcification in breast
tissue act as potential indicators of breast cancer, but it is usually hard to identify
because of the dense nature of the breast tissue and the poor contrast of the
mammogram. Selected wavelet texture features extracted from a pool of 120 regions of
interest within a mammogram are used as input patterns to train the ELM. The 10 fold
CV tests with the ELM, Bayes classifiers, and SVM showed that the ELM achieves the
best classification accuracy and sensitivity. The receiver operating characteristic (ROC)
of the ELM also verified its superiority over the other classifiers.
Authors in [160-162] designed ELM based classifiers for electroencephalogram
(EEG) signals classification. The EEG records electrical activity within the brain and is
a very useful method in diagnosing neurological disorders. However, the EEG
recordings are usually taken over a long period of time and are very tedious to be
examined by visual inspection. In [160] and [161], the authors developed epileptic
seizure detection systems using the ELM. After passing the EEG signal through the
feature extraction process, the ELM was used to learn the sample pattern sets. Both
studies reported excellent detection results with good sensitivity and specificity. The
authors also agree that the high classification accuracy and low computational cost of
the ELM based systems have great potential to be implemented as real-time detection
systems. In [162], Shi and Lu used the EEG signals to estimate the continuous vigilance
of humans in human machine interaction situations. Techniques for real-time estimation
of operator vigilance are highly desirable in human machine interactions due to work
safety concerns. Detailed experiments comparing the basic ELM, ELM with L1 and L2
norm penalty respectively, and the SVM were conducted to investigate the effectiveness
45
of the learning algorithms. All the ELM based methods have outperformed the SVM in
terms of learning speed. In terms of accuracy, the regularized versions of ELM, which
includes the L2 norm performs better than the SVM, while the other ELM algorithms
has comparable results with the SVM.
In [67], the ELM was used to predict the appropriate super-family of proteins for
each new sample protein sequence. Conventionally the new protein sequence would
need to be compared with the individual identified protein sequences, hence the testing
process is very time consuming. The experiment results compared the performances of
the BP training algorithm with the radial basis function ELM and the sigmoidal ELM. It
was found that the ELM training algorithms produced better classification accuracy and
were four orders of magnitude faster than the BP method. In [68], the authors developed
a protein secondary structure prediction framework based on the ELM. The prediction
framework is aimed at obtaining the three-dimensional structure of the protein sequence
in order to determine its functions. The protein secondary structure acts as the
intermediate step in predicting the three-dimensional protein sequence. Previous works
in this area are reported to achieve high accuracy but is very time consuming [68]. The
ELM was combined with a new encoding scheme, a probability-based combining
algorithm, and a helix-postprocessing method in the so called ELM-PBC-HPP. The
experiment results show that the proposed method provides the same high level of
accuracy at much faster training speeds compared with BP and SVM.
In [66], the authors studied the effectiveness of the ELM in the multi-category
classification of 5 bioinformatics datasets from the UCI machine learning repository. In
all the experiments, the ELM was found to outperform the SVM in terms of training
speed and classification accuracy. The maximum performance of the ELM is revealed
to be achieved within a set limit of hidden layer nodes. In addition, the optimal size of
the SLFN trained with the ELM was also consistently more compact compared to
conventional SLFN training methods such as the BP. In [69] and [70], the authors
investigated the multi-category classification of microarray gene expressions for cancer
diagnosis. A total of 3 microarray datasets were examined in [69], which consist of the
GCM dataset, Lung dataset, and Lymphoma dataset. A recursive gene feature
elimination method was used for gene selection. The experiment results showed that the
46
ELM is more robust compared to the SVM under a varying number of genes selected
for training. Furthermore, it was concluded that as the number of classes becomes large,
the ELM achieves a higher accuracy with less training time and a more compact
network compared to other algorithms. Lastly, the ELM classifier in [70] uses the
ANOVA for the first stage gene importance ranking, followed by the training of the
ELM using the minimum gene subset. The results obtained for the performance of the
ELM were consistent with the findings in [69].
2.8.2 Image Processing
Decherchi et. al. [163] proposed the Circular-ELM (C-ELM) for the automated
assessment of image quality on electronic devices. The C-ELM implements the
additional augmented input dimension as seen in the circular backpropagation (CBP)
architecture to the basic ELM. It is seen that the C-ELM outperforms the basic ELM
and the conventional CBP algorithm in converting the visual signals into quality scores.
Wang et. al. [164] developed ELM based algorithms to learn the image de-
blurring functions of traditional image filters. It was reported that the newly developed
learned filter-partial differential equations (LF-PDE) de-blurring model overcomes the
limitation of traditional filters in terms of edge protection and allows the user to
customize the learning functions to select certain de-blurring properties.
Yang et. al. [165] presented a fingerprint matching system based on the ELM.
The system implements an image pre-processing and feature extraction module to
produce 15-dimentional feature vectors to be classified by the ELM and RELM. It is
seen that the ELM and RELM outperforms the traditional BP and SVM in both
classification accuracy and training time. In addition, the fast training and testing times
of the ELM and RELM is said to be suitable for real-time implementations.
2.8.3 Face Recognition
Face recognition is one of the most important biometric identification methods with
many possible applications. In [166], Zong and Huang investigated the performance of
47
regularized ELM on 4 popular face databases, namely, YALE, ORL, UMIST, and
BioID. The experiment compared ELM-OAO, ELM-OAA, SVM-OAO, SVM-OAA,
and Nearest Neighbour methods with pre-processing applied to all databases to reduce
the dimensions of the samples. In their experimental results, it can be seen that the ELM
and SVM variants achieve comparable classification accuracy while consistently
performing better than Nearest Neighbour. For most cases, the OAA classifiers are
reported to give better results. The advantage of ELM in this application is in the model
simplicity where only one meta-parameter needs to be tuned after the SLFN is assigned
a constant large number of neurons.
Marques and Grana [167] proposed a face recognition system using the lattice
independent component analysis (LICA) and ELM. The so called LICA-ELM system
performs feature extraction in the first stage and then does learning and classification on
the generated features. The experimental results from testing on the Color FERET
database shows that the ELM and the regularized ELM achieves better classification
performance compared to the Random Forest decision tree, SVM, and BP methods. It
can be seen that the regularized ELM performs much better than the ELM for the small
sized dataset.
2.8.4 Handwritten Character Recognition
Handwritten character recognition databases have continuously been used as standard
benchmark tests for developing new learning algorithms. In [72], Chacko et. al. used the
ELM, E-ELM, and OS-ELM for the handwritten Malayalam character recognition. The
character images are first pre-processed using the wavelet energy feature extraction
method. The ELM and its variants are then used to learn and classify the sample pattern
vectors. In the experiments, the authors suggested that the daubechies, symlet and
biorthogonal wavelets give the best performance. In terms of classification accuracy, the
basic ELM is seen to be consistently more efficient than its variants over a range of
wavelet types and levels of decomposition. Comparison of the ELM performance with
previously reported results in the literature showed that the ELM achieved the best
classification accuracy.
48
The authors in [71] developed new training algorithms for the SLFN using
unique SLFN structural properties and gradient information to improve on the ELM
algorithm. The experiment was conducted using the MNIST handwritten digits database
[82], which consists of the handwritten digits from 0 to 9. Each digit is a gray-scale
image of 28x28 pixels with the intensity range of 0 (black) to 255 (white). A sample set
of the digit images is given in Figure 2.4.
Figure 2.4 A set of handwritten digits from the MNIST database
The experiment results revealed that the proposed weighted accelerated upper-
layer-solution-aware (WA-USA) algorithm is capable of achieving the same accuracy as
the ELM using only 1/16 of the network size and the testing time. However, the WA-
USA requires up to 6 times longer training time compared to the ELM. It was also noted
that the classification performance of the ELM suffers significantly when the number of
hidden nodes is small compared to the newly proposed algorithms.
2.8.5 Sales Forecasting
Sales forecasting or modelling is a very attractive field of study for businesses. It is
important to predict whether a certain product or some fashion style will attract the
interest of customers. Yu et. al. proposed an Intelligent Fast Sales Forecasting Model
(IFSFM) for fashion products [168]. Using the attributes of a fashion sales record such
as the colour, size, and price, the sales amount can be forecasted. Conventional neural
network methods such as the BP and statistical models are found to be slow compared
to the IFSFM method which implements the ELM. In terms of forecasting error, the
IFSFM obtains the lowest MSE.
Xia et. al. [169] presented an Adaptive metrics of input ELM called the AD-
ELM for fashion retail forecasting. Different from the basic ELM, the AD-ELM uses
the adaptive metrics to avoid dramatic changes at the input of the SLFN in order to
avoid large fluctuations on unseen data samples. The final forecasting scheme employs
49
several ELMs and averages their results as the final sales amount forecasted. The
experiments on several real world fashion sales datasets shows that the AD-ELM and
ELM performs better than the conventional auto-regression and BP based neural
networks models. The AD-ELM significantly outperforms the ELM in all cases.
Chen and Ou [170] developed the Gray ELM (GELM) with Taguchi method for
a sales forecasting system. The Gray relation analysis is used to extract important
features from the raw sales data to be used as inputs to the ELM. The Taguchi method
was applied to find the optimal number of hidden nodes and the type of activation
function to obtain the best results. The experiment implemented the GELM and several
BP based neural networks to predict 120 days of sales data for lunchboxes. It was found
that the GELM achieves the smallest MSE and has much faster training time.
2.8.6 Parameter Estimation
There are a wide array of applications for parameter estimation of system variables and
attributes. In [171], Xu et. al used the ELM for the real-time frequency stability
assessment of electric power systems. Traditional methods of estimating the frequency
stability involved solving a large set of nonlinear differential-algebraic equations, which
can be computationally expensive and time consuming. The authors proposed the ELM
predictor which uses the power system operational parameters as inputs and outputs the
frequency stability margin. The training procedure involves an off-line training process
before inserting the ELM predictor into the real-time system. The fast training speed of
the ELM caused minimal delay, and the high accuracy and small standard deviation
given the randomly generated weights and biases made the system acceptable for
practical use.
In [172], Wu et. al. implemented the ELM for wind speed estimation in a wind
turbine power generation system (WTPGS). The estimation of wind speed is vital for
the optimal control of the turbine shaft in order to achieve the maximum power point
tracking. Different from the conventional neural networks based wind speed estimators
the new scheme is independent of the environmental air density. The ELM uses the real-
time turbine generator information to provide precise estimates of wind speed for the
50
control system. In addition, the ELM was also implemented in the pitch controller of the
wind turbine to replace the conventional linear controller. Both the experimental and
simulation results showed greatly improved performance compared to conventional
RBF neural networks and PID controllers. It is seen that the ELM based estimation
scheme can provide almost optimal control, clearly outperforming the conventional
RBF and PID models, thus producing optimal power generation capabilities among all
the methods.
In [173], Tang et. al. proposed the partial least squares-optimized ELM (PLS-
OELM) for mill load prediction. The PLS algorithm was used to extract frequency
spectrum features from the mill shell vibrations and the latent features are then learned
by the OELM for predicting the mill load. The experiments were conducted using a
laboratory sized ball mill (XMQL-420450) with an accelerometer attached to the middle
of the shell. Comparing the fitting and predictive performances of the PLS-OELM with
Gaussian kernel with PLS, PLS-ELM, PLS-BP, and PCA-SVM, it can be seen that the
PLS-OELM achieves better results than all the other algorithms except the PCA-SVM,
under the constraints of small sample size and high dimensionality.
2.8.7 Information Systems
With the growth in the use of digital systems to store information for easy access
globally, there is an increasing demand for intelligent systems to sort and extract
relevant data in large scale digital databases.
Wang et. al. [174] combined the OS-ELM with intuitionistic fuzzy sets for
predicting consumer sentiments from online reviews. They focused on the chinese
character reviews of 3 databases using single classifier and ensembles methods. The
reviews can be categorized into positive, negative or neutral. For the single classifier
methods, the OS-ELM and ELM achieves comparable results compared to the SVM and
performs much better than the Naïve Bayesian classifier. However, it is worth noting
that the OS-ELM and ELM has much smaller standard deviations and much faster
training speed. In the ensemble experiments, the authors focused on finding the best
multi-classifier output fusion technique. The experimental results show that the
51
conventional mathematical averaging methods perform better than the accuracy or
norms of weights weighting schemes.
Zheng et. al. [175] used the RELM for text categorization. They introduced a
three stage framework for the classification of text. The latent semantic analysis was
first used to reduce the dimensionality of the input patterns. Then the semantic features
were used to train the RELM classifier. Finally the RELM was evaluated in single label
(WebKB database) and multi label (Reuters-21578) text categorization cases against
other popular machine learning algorithms. The classification results indicate that the
RELM performs better than the ELM and BP in most cases and is comparable to the
SVM. However, it was confirmed that the RELM and ELM achieves much faster
training and classification speed. The authors suggested that the RBF function and
triangulation basis function be chosen in the SLFN for text categorization applications.
Zhao et. al. [76] proposed an XML document classification scheme using the
ELM. It is seen that, the reduced structured vector space model was first used to
generate a feature vector for each XML sample. Then the ELM and a newly developed
voting-ELM were used to learn and classify the sample patterns. The framework of the
voting-ELM is based on the OAO multi-class classification method and voting theory.
Therefore in the training and testing phases, the voting-ELM requires more time
compared to the single ELM classifier. In the simulation of 10 datasets, the voting-ELM
consistently outperforms the single ELM classifier in terms of classification
performance.
2.8.8 Control Systems
Neural networks have found many applications in adaptive control schemes that provide
fast responses to changes in the target system. In [176], Rong and Zhao used the ELM
to develop a direct adaptive neural controller for nonlinear systems. The control
framework consists of an ELM-based neural controller and a sliding mode controller.
The ELM-based neural controller model is compensated by the sliding mode controller
to reduce the effects of modelling error and system disturbances. In addition, the output
layer weights of the ELM are updated using the stable adaptive laws derived from a
52
Lyapunov function instead of the typical pseudo-inverse. The control scheme
guarantees the stability and convergence of the nonlinear system. In the experiments,
the inverted pendulum was tested using 2 trajectories, with the basic ELM and the ELM
with Lyapunov based solution. It was found that only the ELM with Lyapunov based
solution consistently converges to the reference asymptotically. Furthermore, the newly
proposed control framework produces similar results for both sigmoid and RBF nodes
and the control signals avoid the chattering problem.
In [177], Yang et. al. developed a neural networks based self-learning control
strategy for power transmission line de-icing robots. The control of these robots has
been hard due to the multiple nonlinearities, plant parameter variations, and external
disturbances. The proposed control framework consists of a fuzzy neural network
controller and an OS-ELM identifier. It can be seen that the fuzzy neural network
controllers are updated adaptively while using the OS-ELM to model the time-varying
plant dynamics and plant parameter variations. From the simulation results, the tracking
error of the new control scheme is seen to be much better than the conventional control
strategies such as the PD controller. The OS-ELM is also seen to perform better than
other online sequential learning schemes such as the RAN, MRAN, and BP. The
experimental results on the actual robot then confirmed that the new control strategy has
fast transient response and good accuracy.
There are still a large number of ELM applications which are not able to be
included in this brief survey. However, the general perceptions of the performance and
the learning characteristics of the ELM compared to current state-of-the-art machine
learning techniques can be summarized as follows:
(i) ELM based algorithms are much faster in training and testing; requiring less
computational resources.
(ii) ELM based algorithms are simpler to implement; having less meta-parameters
to tune.
(iii) ELM based algorithms tend to achieve comparable results to SVM in binary
cases but performs better than SVM in multi-class cases.
53
(iv) The additive type of neuron with sigmoidal activation function and the RBF
node are sufficient for the ELM in most applications.
(v) Regularized versions of ELM are insensitive to the number of neurons; hence
the SLFN is usually initialized with a large number of neurons, i.e. 1000.
(vi) Regularization parameters of the regularized ELM still require tedious search
algorithms to tune.
2.9 Conclusion
This chapter has thoroughly reviewed the SLFN based training algorithm referred to as
the ELM. It is seen that the ELM is a highly effective machine learning algorithm suited
for both binary and multi-class pattern classifications. The ELM is able to learn up to
thousands of times faster than the SVM and demonstrates the comparable or better
generalization performances. The regularized ELM extensions have the similar
optimization formulation compared to the SVM, and in some cases, the ELM achieves
the unified solution with SVM. In short, the ELM is a promising emerging technology
that is gaining more and more interest from researchers due to its simplistic but effective
algorithm.
54
55
Chapter 3
Finite Impulse Response
Extreme Learning Machine
A robust training algorithm for a class of single hidden layer feedforward neural
networks (SLFNs) with linear nodes and an input tapped-delay-line memory is
developed in this chapter. It is seen that, in order to remove the effects of the input
disturbances and reduce both the structural and empirical risks of the SLFN, the input
weights of the SLFN are assigned such that the hidden layer of the SLFN performs as a
pre-processor, and the output weights are then trained to minimize the weighted sum of
the output error squares as well as the weighted sum of the output weights squares. The
performance of an SLFN-based signal classifier trained with the proposed robust
algorithm is studied in the experiments section to show the effectiveness and efficiency
of the new scheme.
3.1 Introduction
The applications of single hidden layer feedforward neural networks (SLFNs) have been
receiving a great deal of attention in many engineering disciplines. As shown in [66-72],
[76], [158-177], by properly choosing the number of nodes in both the hidden layer and
the output layer and training the input and the output weights, one may use SLFNs for
function approximation, digital signal and image processing, complex system modeling,
adaptive control, data classification and information retrieval. In practical applications,
the techniques for training the weights of SLFNs are very important in order to
56
guarantee the good performance of SLFNs. The most popular training technique used
for SLFNs is the gradient-based backpropagation (BP) algorithm [1], [8]. It has been
seen that the BP can be easily implemented from the output layer to the hidden layer of
SLFNs in real-time. However, the slow convergence has limited the BP in many
practical applications where fast on-line training is required. In addition, the sensitivity
of the SLFNs, trained using the BP, with respect to the input disturbances and the large
spread of data is another important issue that needs to be further studied by the
researchers and engineers in neural computing.
In [3], [18], [147] a learning algorithm called extreme learning machine (ELM)
for SLFNs is proposed, where the input weights and the hidden layer biases of an SLFN
are randomly assigned, the SLFN is then simply treated as a linear network and the
output weights of the SLFN are then computed by using the generalized inverse of the
hidden layer output matrix. It has been noted that the ELM has an extremely fast
learning speed and produces good performance in many cases. However, the poor
robustness property of the SLFNs trained with the ELM has been observed as the
SLFNs are used for signal processing to handle noisy data. For instance, as the input
weights and the hidden layer biases are randomly assigned in an SLFN, the changes of
the hidden layer output matrix sometimes are very large because of the effects of the
input disturbances, which also result in significant changes of the output weight matrix
of the SLFN.
In [89] and [90], two modified ELM algorithms are proposed, where the cost
function consists of the sum of the weighted error squares and the sum of the weighted
output weight squares. In terms of the optimization of the cost function in the output
weight space and the proper choice of the weights of the error squares, the structural and
the empirical risks are balanced and reduced. However, the structural and the empirical
risks are not significantly reduced and the robustness property of the trained SLFN is
not significantly improved because of the random assignment of both the input weights
and the hidden layer biases. According to the statistical learning theory [86], [100-105],
significant changes in the output weight matrix will largely increase both the structural
risk and empirical risk of the SLFNs. Therefore, in order to substantially reduce the
structural and empirical risks and improve the robustness property of SLFNs with
57
respect to the input disturbances, the proper choice of the input weights of SLFNs is
absolutely necessary.
In this chapter, a new robust training algorithm is proposed for a class of SLFNs,
with both linear nodes and an input tapped-delay-line memory, for signal processing
purposes. Since the output of each linear hidden node in the SLFN is the sum of the
weighted input data, each node can be treated as a finite-impulse-response (FIR) filter.
Therefore, the hidden layer with linear nodes can be designed as the pre-processor of
the input data. For instance, based on the FIR filter design techniques in signal
processing [106-109], it is possible to design the hidden layer as a group of low-pass
filters or high-pass filters or band-pass filters or band-stop filters or other types of filters
for the purpose of the pre-processing of the input data with disturbances and undesired
frequency components. The advantages of the hidden layer’s pre-processing function
are that not only the input disturbances and the undesired frequency components can be
removed, also both the structural and empirical risks of the SLFNs can be greatly
reduced from the viewpoint of the output of the SLFNs.
For the design of the output weight matrix of the SLFNs, in this chapter, an
objective function which includes both the weighted sum of the output error squares and
the weighted sum of the output weight squares of the SLFNs is chosen [1], [89], [90],
[110-112]. By minimizing this objective function in the output weight space, as well as
the proper choice of the input weights based on the FIR filter design techniques, both
the structural and empirical risks can be balanced and reduced for signal processing
purposes. For the comparison with the ELM and the modified ELM algorithms in [89]
and [90], the new training scheme to be developed in this chapter is referred to as the
FIR-ELM algorithm. According to the ELM theory, the hidden nodes used in SLFNs
may not be neuron alike. It is then convenient to call the hidden linear nodes, with the
input weights trained with the FIR filtering techniques in this chapter, as the FIR nodes,
which are one type of the many possible hidden nodes mentioned in [3], [23-25], [90],
[93], [94].
It should be emphasized that the SLFNs considered in [3], [23-25], [89], [90],
[93], [94] use the non-linear hidden nodes and the linear output nodes without dynamics
58
and, with the proper choice of the output weights, the SLFNs can uniformly
approximate non-linear input-output mappings. However, the class of SLFNs
considered in this chapter use both the linear hidden nodes and the linear output nodes.
In order to make such SLFNs have universal approximation capability, an input tapped-
delay-line memory is added to the input layer. It has been shown in [113] that the SLFN
with linear (or non-linear) nodes, as well as an input tapped-delay-line memory, is
capable of approximating the maps that are causal, time invariant and satisfy certain
continuity and approximately-finite-memory conditions. Since, in many cases of signal
processing, the input and output data have some dynamic relationships, it is thus
convenient to train the SLFNs with linear nodes as well as an input tapped-delay-line
memory to perform as signal processors.
The rest of the chapter is organized as follows: In Section 3.2, a class of SLFNs,
with linear nodes and an input tapped-delay-line memory, as signal classifiers are
formulated, and the issues on the empirical and the structural risks, as well as the
robustness property of the SLFNs with respect to the input disturbances, trained with
the ELM algorithm and the modified ELM algorithm in [3], [23-25], [89], [90], [93],
[94], are studied. In Section 3.3, the design of the input weights using FIR filtering
technique, for reducing both the empirical and structural risks, improving the robustness
of the SLFNs with respect to the input disturbances, and removing some undesired
frequency components is presented. In Section 3.4, the design of the output weights by
the minimization of the weighted sum of the output error squares as well as the
weighted sum of the output weight squares of the SLFNs is discussed in detail. In
Section 3.5, the SLFN-based signal classifiers, trained with the ELM in [3], [23-25],
[93], [94], the modified ELM in [89] and the FIR-ELM developed in this chapter are
simulated and compared in order to show the effectiveness of the proposed FIR-ELM
algorithm for signal processing. Section 3.6 gives the conclusions and some further
work.
3.2 Problem Formulation
The architecture of a class of SLFNs with the linear hidden nodes and an input tapped-
delay-line memory is presented in Figure 3.1, where the output layer has linear nodes,
59
the hidden layer has linear nodes, D is the unit-delay element, the time-delay
elements, added to the input of the neural network, form the tapped-delay-line memory,
which indicates that the input sequence ( ) ( ) ( ) represent a
time series consisting of the present observation ( ) and the past observations
of the process.
Figure 3.1 A single hidden layer neural network with linear nodes
From Figure 3.1, the input data vector ( ) and the output data vector ( ) can be
expressed as follows:
( ) [ ( ) ( ) ( )] ( )
( ) [ ( ) ( ) ( )] ( )
the output of the th hidden neuron is computed as:
∑ ( ) ( )
( )
with
[ ] ( )
60
and the th output of the neural network, ( ), is of the form:
( ) ∑ ( )
( )
Thus, the output data vector ( ) can be expressed as:
( ) ∑ ( )
( )
with
[ ] ( )
In this chapter, distinct sample signal data vector pairs ( ) are used to
train the SLFN given in Figure 3.1, where [ ] and
[ ] , for , are the desired input and output training data
vectors, respectively. For the th input data vector , the corresponding neural output
vector can be expressed as:
∑ ( )
( )
and all equations can then be written as the following matrix form:
( )
where
[
] [
] ( )
61
[
]
and [
]
( )
Matrix is called the hidden layer output matrix [3], and the th column of is the th
hidden neuron output corresponding to the input vectors .
Remark 3.2.1: It is seen from [3], [23-25], [93], [94] that, as the ELM is used to train an
SLFN, the input weights and the biases of the hidden layer of the SLFN are randomly
assigned. The SLFN is then treated as a linear network and the output weight matrix of
the SLFN is computed using the generalized inverse of the hidden layer output matrix as
follows:
( )
where the is the Moore-Penrose generalized inverse of the matrix , and is the
desired output data matrix, expressed as:
[
]
( )
However, it has been noted that, when the input weights of an SLFN are
randomly chosen, both the empirical risk and the structural risk of the SLFN are greatly
increased. This undesired characteristic of the SLFN with the ELM can be clearly seen
from Figure 3.2 and Figure 3.3, respectively, where a single output SLFN with 5 hidden
neurons, trained with the ELM, is used to approximate a straight line within
the range . As the signal , disturbed by a random noise ( )
( ) , is input to the trained SLFN, the error between the output of the neural
network and the straight line is very large sometimes. Figure 3.4 and Figure
3.5 show the simulation results where the SLFN is trained with the modified ELM in
[89]. It is seen that, although the performance is improved a little bit because of the
balance of the structural risk and the empirical risk seen from the output of the SLFN,
the error between the output of the neural network and the straight line is
62
not significantly reduced because of the random choice of the input weights of the
SLFN.
Figure 3.2 Output of the SLFN with the ELM
Figure 3.3 Output error of the SLFN with the ELM
15 16 17 18 19 20 2130
32
34
36
38
40
42
44
Out
put
15 16 17 18 19 20 21-1.5
-1
-0.5
0
0.5
1
1.5
Erro
r
63
Figure 3.4 Output of the SLFN with the modified ELM
Figure 3.5 Output error of the SLFN with the modified ELM
On the contrary, when the input weights of the SLFN in Figure 3.1 are assigned
in such a way that each hidden node performs as a linear-phase lower-pass FIR filter
and the output weights are computed based on the improved ELM in [89], with the same
training data as in Figure 3.2 and Figure 3.3 (or Figure 3.4 and Figure 3.5), the
robustness of the SLFN with respect to the input disturbances has been greatly
improved and at the same time, the structural risk and the empirical risk are well
balanced and reduced, as seen in Figure 3.6 and Figure 3.7, respectively.
64
Figure 3.6 Output of the SLFN with the FIR hidden nodes
Figure 3.7 Output error of the SLFN with the FIR hidden nodes
Therefore, it is necessary to properly assign the input weights in order to
improve the robustness property with respect to disturbances and reduce both structural
and empirical risks of SLFNs.
In the following sections, a new robust training algorithm for a class of SLFNs
in Figure 3.1 will be developed. The weight training of the SLFN is divided into two
steps: First, the input weights of the SLFN are designed off-line in the sense that every
15 16 17 18 19 20 2130
32
34
36
38
40
42
44
Out
put
15 16 17 18 19 20 21-1.5
-1
-0.5
0
0.5
1
1.5
Erro
r
65
hidden node performs as a linear FIR filter and the whole hidden layer plays the role of
a pre-processor of input data to remove the effects of the input disturbances and
significantly reduce the structural and empirical risks. Then the output weights of the
SLFN are designed to minimize the output error of the SLFN and further balance and
reduce the effects of the empirical and the structural risks.
3.3 Design of the Robust Input Weights of SLFNs
By rewriting ( ), the output of the th hidden node, as follows:
( ) ∑ ( )
( ) ( )
It is seen that (3.3.1) has the typical structure of an FIR filter [106-109], where the input
weight set { } can be treated as the set of the filter coefficients or the impulse
response coefficients of the filter, and the output ( ) is the result of the convolution
sum of the filter impulse response and the input (time series) data, the filter length is
equal to the number of the input data of the neural network. According to the signal
processing theory in [106-109], if the elements of the input weight vector are chosen
to be positive and symmetrical, that is, and
( ) {
( )
( )
( )
(3.3.1) is a non-recursive linear phase FIR filter, which has the advantages that all
outputs of the hidden nodes are stable because of the absence of the poles, and the
finite-precision errors are less severe than in other filter types.
Remark 3.3.1: Without loss of the generality, in the following, only how each hidden
node is designed to perform the function of a low-pass filter is considered. The similar
design methods from the signal processing in [106-109] can be used for designing the
66
hidden nodes as high-pass filters or band-pass filters or band-stop filters or other types
of filters for the purpose of the pre-processing of the input data to remove the effects of
the input disturbances and the undesired frequency components.
Remark 3.3.2: For practical application, Matlab can be used to develop a look-up table
containing all possible sets of input weights of the SLFN with the characteristics of low-
pass filter, high-pass filter, band-pass filter and other specified filters, respectively.
Based on some observation and understanding of the frequency-spectrum of the input
data, it is then possible to determine what frequency components should be eliminated
or retained, and then a proper set of parameters can be chosen from the look-up table
and assigned to the input weights of the SLFN.
Suppose that the desired frequency response of the th hidden node of the SLFN
can be represented by the following discrete-time Fourier transform (DTFT):
( ) ∑ [ ]
( )
where [ ] is the corresponding impulse response in the time domain, which can be
expressed as:
[ ]
∫ ( )
( )
It is noted that the unit sample response [ ] in (3.3.4) is infinite in duration and must
be truncated at the point , in order to yield an FIR filter of length . Truncation of
[ ] to a length is equivalent to multiplying [ ] by a rectangular window
[ ] defined as:
[ ] {
( )
67
Since the Fourier transform of [ ] is given by
( ) ∑ [ ] ( ) ( )
( )
( )
the frequency response of the truncated FIR filter can then be computed by using the
following convolution:
( )
∫ ( ) ( )
( )
If it is desired that the th hidden node in the SLFN performs as a low-pass filter with
the following desired frequency response:
( ) {
( ) | |
| | ( )
where is the cut-off frequency of the low-pass filter to separate the low frequency
pass-band and the high frequency stop-band, the impulse response of the truncated low-
pass filter can be obtained using (3.3.7) as follows:
[ ]
∫ ( )
[ ( ( ) )]
( ( ) ) ( )
It is easy to see from (3.3.4) that
[ ] [ ] ( )
Then, the weights for the th hidden node can be obtained as follows:
[ ] [ ] [ ] ( )
68
Remark 3.3.3: In the above, the problem of how the rectangular window method can be
used to design the input weights so that the hidden layer of the SLFN performs as a pre-
processor to filter away the high frequency disturbances added to the input data and
significantly reduce both the structural and empirical risks of the SLFN is considered. In
fact, many other window methods, such as Kaiser method, Hamming method and the
Bartlett method in signal processing, can also be used for designing the input weights so
that the hidden layer functions as the pre-processor of the input data for the purpose of
removing the input disturbance.
Remark 3.3.4: In order to explain why the proper choice of the input weights can
significantly reduce both the structural and the empirical risks of the SLFN, the
optimization of the output weight matrix of the SLFN in Figure 3.1 is first examined,
where is computed using the standard least squares method with the estimate, given in
(3.2.12). It is clearly seen that, in the optimization process for determining the optimal
value of in the output weight parameter space, the sensitivity issue of with respect
to the change of the hidden layer output matrix is not considered. In fact, a very
sensitive may bring a very high structural risk of the SLFN when input data is
disturbed by various noises.
For further analysis, let represent the change of the hidden layer output
matrix , caused by some input disturbance or noise, and the corresponding change of
the output weight matrix be represented by . Then, from (3.2.9), the equation
below is obtained
( )( ) ( )
Multiplying out the left hand side of (3.3.12) and subtracting on both sides, the
equation becomes
( ) ( )
and then
( ( )) ( )
69
where is the Moore-Penrose generalized inverse of matrix .
From (3.3.14), the following inequality can be obtained:
‖ ‖ ‖ ‖‖ ‖‖ ‖ ( )
and the sensitivity of the output weight matrix can be obtained as follows:
‖ ‖
‖ ‖
‖ ‖
‖ ‖ ‖ ‖‖ ‖ ‖ ‖‖ ‖
‖ ‖
‖ ‖ ( )
Similar to the definition of the condition number of a squared matrix in [1], the
generalized condition number of the hidden layer output matrix is defined as follows:
( ) ‖ ‖‖ ‖ ( )
The sensitivity of the output weight matrix can then be expressed as:
‖ ‖
‖ ‖
‖ ‖
‖ ‖ ( )
‖ ‖
‖ ‖ ( )
It is seen from (3.3.18) that, if, in the ideal case, all high frequency disturbance
components can be removed by the hidden nodes with the characteristics of low-pass
FIR filters, the change of the hidden layer output matrix , caused by the high
frequency noise components, is reduced to zero, that is, ‖ ‖ . Then the sensitivity
of the output weight matrix is reduced to zero.
However, it should be addressed that, in practice, a well-designed real-time FIR
filter can remove most disturbance components of the input signal and makes the value
of ‖ ‖ very small (but ‖ ‖ ) and thus the output weight matrix is non-
sensitive to the changes of the hidden layer output matrix .
70
Remark 3.3.5: In fact, not only the input disturbance increases the sensitivity of the
output weight matrix in (3.3.18), but also the randomly assigned input weights in both
the ELM and the modified ELM in [3], [23-25], [89], [90], [93], [94] largely increase
the sensitivity of the output weight matrix . This is because the random choice of the
input weights in SLFNs may result in a significant change of the hidden layer output
matrix . A similar analysis as in (3.3.12)~(3.3.18) can be done.
Remark 3.3.6: It is noted that, if the hidden layer output matrix is square, the
definition of the generalized condition number in (3.3.17) is the same as the ordinary
one defined in [1]. Although the condition number of a non-square matrix has been
defined in terms of its singular values as seen in [1], the assumption that is full-
rank is not always valid for the SLFNs, because, in many cases, some singular values of
matrix are zero. Therefore, the definition of the generalized condition number of
matrix in (3.3.17) can find a wide application in signal processing and neural
computing.
3.4 Design of the Robust Output Weight Matrix
In this section, the non-linear optimization technique is used to design the optimal
output weight matrix such that the structural and the empirical risks are well balanced
and the effects of the structural and the empirical risks are further reduced. For this
purpose, the optimization problem is stated as follows [1], [89], [90], [110-112]:
Minimize { ‖ ‖
‖ ‖ } ( )
Subject to ( )
This problem can be solved conveniently by the method of Lagrange multipliers. For
this, construct the Lagrange function as:
∑∑
∑∑
∑∑ ( )
( )
71
where is the th element of the error matrix , is the th element of the output
weight matrix , is the th element of the output data matrix , is the th column
of the hidden layer output matrix , is the th column of the output weight matrix ,
is the th Lagrange multiplier, and are constant parameters used to adjust the
balance of the structural risk and the empirical risk.
Remark 3.4.1: It is noted from the cost function defined in (3.4.1) that, compared with
the modified ELM algorithm in [89], the norm of the output weight matrix has been
weighted by a constant . This is because the structural risk and the empirical risk of the
SLFN can be easily balanced and reduced through the adjustment of the values of and
in the optimization process. The effects of the ratio on the performance of the SLFN
will be discussed in the simulation section.
Differentiating L in (3.4.3) with respect to , the equation becomes
(∑∑ (
)
)
(∑
) ( )
Since
[ ]
[ ]
∑
( )
(3.4.4) becomes
(∑
∑
)
∑
( ) ( )
72
Letting
and using (3.4.6),
[ ]
[ ]
( )
and
[ ] [ ] [
] ( )
Thus
[
] [
] [
] ( )
or
( )
In addition, differentiating with respect to ,
( )
Let
. It is then possible to obtain the following relationship:
( )
or
( )
73
Considering the constraint in (3.4.2), (3.4.13) can be expressed as:
( ) ( )
and using (3.4.14) in (3.4.10) leads to
( ) ( )
Then, the output layer weight matrix is derived as follows:
(
)
( )
Remark 3.4.2: Based on the discussions in Section 3.3 and Section 3.4, the proposed
FIR-ELM algorithm can be summarized as follows:
Step 1: Assign the input weights according to (3.3.9) ~ (3.3.11);
Step 2: Calculate the hidden layer output matrix using (3.2.10);
Step 3: Calculate the output weight matrix based on (3.4.16).
3.5 Experiments and Results
To illustrate the FIR-ELM algorithm proposed in this chapter, consider an SLFN-based
classifier, with 20 hidden linear nodes, one output linear node and an input tapped-
delay-line memory, to classify the computer-generated low frequency tones. The input
data vectors to the SLFN are computer-generated low frequency sound clips with
frequencies 100 Hz, 150 Hz, 200 Hz, 300 Hz, …, 900 Hz, respectively, and each of
which is 0.5 seconds long, modulated by an envelope function to create a tone. Figure
3.8 shows the 700 Hz sound clip modulated by the envelope function ( )
( ). The desired output reference values of the SLFN, so as to provide the desired
values of the signal classifier states, for all input data vectors, are generated by a sine
function, ( ) for with equal increments.
74
The training data are 10 pairs of the input sound clip data vectors and the
corresponding desired output classifier states. The SLFN is then trained with the ELM
in [3], [23-25], [93], [94], the modified ELM in [89] and the FIR-ELM proposed in this
chapter, respectively. To examine the robustness performances of the SLFN-based
signal classifier trained with the above algorithms, the random noise ( )
( ) ( ) is added to all sound clips, as seen in Figure 3.9.
Figure 3.8 A sound clip modulated by the envelope function
Figure 3.9 The disturbed sound clip
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Seconds
Am
plitu
de
Clear Clip
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
Seconds
Am
plitu
de
High Noise Clip
75
Figure 3.10 shows the classification result of the SLFN-based signal classifier
trained with the ELM algorithm, where the simulation is repeated 50 times, showing the
accuracy of the classification, and Figure 3.11 shows the corresponding RMSEs. It is
clearly seen that, as the input weights of the SLFN are arbitrarily assigned with the
ELM in [3], [23-25], [93], [94], both the structural risk and empirical risk of the SLFN
are very high and the effects of the input noise cannot be reduced through the
minimization of the output weights.
Figure 3.10 Signal classification with the ELM algorithm
Figure 3.11 The RMSE with the ELM algorithm
0 1 2 3 4 5 6 7 8 9 10 110.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
ELM
targetoutput
0 1 2 3 4 5 6 7 8 9 10 11-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Error
76
Figure 3.12 and Figure 3.13 show the classification result and the RMSE,
respectively, of the SLFN-based signal classifier trained with the modified ELM
algorithms in [89] with the constant parameter = 0.9 for adjusting the balance of the
structural risk and the empirical risk. Since the input weights of the SLFN are arbitrarily
assigned, the high structural and empirical risks result in very poor robustness with
respect to the input disturbances and no noticeable performance improvement is seen,
compared with Figure 3.10 and Figure 3.11 with the ELM algorithm.
Figure 3.12 Signal classification with the modified ELM algorithm
Figure 3.13 The RMSE with the modified ELM algorithm
0 1 2 3 4 5 6 7 8 9 10 110.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
RLM
targetoutput
0 1 2 3 4 5 6 7 8 9 10 11-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Error
77
Figure 3.14 and Figure 3.15 show the classification result and the RMSE,
respectively, of the SLFN-based signal classifier trained with the FIR-ELM algorithm
proposed in this chapter, where the input weights of the SLFN are assigned by using the
rectangular window method with the window length of 1001 and the cut-off frequencies
50 Hz and 1.2kHz, and the constant parameters = 0.9 and = 1, respectively, to
balance and reduce the structural and empirical risks. It is seen that, compared with the
classification results in Figure 3.8 and Figure 3.9, trained with the ELM algorithm, and
Figure 3.10 and Figure 3.11, trained with the ELM algorithm, the accuracy of the
classification and the robustness of the SLFN with respect to the input noise have been
significantly improved due to the fact that the input weights are assigned using the
rectangular window low-pass filtering technique and both the structural and the
empirical risks have been greatly reduced.
Figure 3.14 Signal classification with the FIR-ELM algorithm
with the rectangular window
0 1 2 3 4 5 6 7 8 9 10 110.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
FIR-ELM
targetoutput
78
Figure 3.15 The RMSE with the FIR-ELM algorithm with the rectangular window
Figure 3.16 and Figure 3.17 show the classification result and the corresponding
RMSE of the SLFN trained with the FIR-ELM algorithm, where the input weights of
the SLFN are assigned based on the Kaiser window method [106-109], with the window
length of 1001, the cut-off frequencies 50 Hz and 1.2kHz, the transition width 0.5 Hz,
and the pass band ripple 0.1db, and the constant parameters and are chosen as the
same as in Figure 3.14 and Figure 3.15. It is seen that the accuracy of the classification
and the robustness of the SLFN with respect to the input disturbance are even better
than the ones in Figure 3.14 and Figure 3.15, trained with the rectangular window
method. This is because the low-pass filters designed with the Kaiser window method in
practice have better low-pass filtering performance than the one using the rectangular
methods as commonly known in signal processing.
0 1 2 3 4 5 6 7 8 9 10 11-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Error
79
Figure 3.16 Signal classification with the FIR-ELM algorithm
with the Kaiser window
Figure 3.17 The RMSE with the FIR-ELM algorithm with the Kaiser window
Table 3.1 shows the comparison results of the average RMSEs of the SLFN
trained with the FIR-ELM (rectangular), the FIR-ELM (Kaiser), and the ELM
algorithms, respectively, as the number of the hidden neurons is increased from 20 to 50:
0 1 2 3 4 5 6 7 8 9 10 110.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
FIR-ELM
targetoutput
0 1 2 3 4 5 6 7 8 9 10 11-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Error
80
Table 3.1: Comparison of averaged RMSE for sound clip recognition experiment
Neurons
FIR-SLFN (Rectangular)
FIR-SLFN (Kaiser) ELM Modified ELM
low noise
high noise
low noise
high noise
low noise
high noise
low noise
high noise
20 0.0067 0.0335 0.0056 0.0287 0.0667 0.2206 0.0408 0.2100
30 0.0055 0.0324 0.0047 0.0281 0.0472 0.2199 0.0373 0.1998
40 0.0049 0.0285 0.0040 0.0219 0.0454 0.2161 0.0328 0.1804
50 0.0049 0.0171 0.0030 0.0214 0.0352 0.1851 0.0283 0.1688
It is seen that, as the number of the hidden neurons is increased, the performance
of the SLFN with the FIR-ELM has been greatly improved and the SLFN with the FIR-
ELM has demonstrated highly desirable characteristics for real-world signal processing
applications, compared with the SLFN with the ELM and the modified ELM.
In order to analyse the effect of the ratio in (3.4.16) on the performance of
the SLFNs trained with FIR-ELM algorithm, Figure 3.18 shows the RMSE via the
for the SLFN with 20 hidden linear nodes, a single output linear node and an input
tapped-delay-line memory, where is changed from 0.1 to 10 with a step size of 0.1.
Without losing the generality, more than 20 different groups of input and output training
data have been used for deriving Figure 3.18. It is seen that the RMSE tends to be
reduced as the value of the is increased. However, the large value of the will
narrow the output range of the neural classifier and thus, some output states of the
neural classifier may be lost in the signal classification process. According to our
experience, it is reasonable to choose the value of the between 0.7 and 2.0, as in the
simulation results shown in Figure 3.14 ~ Figure 3.17 and Table 3.1.
In addition, the simulation of the SLFN classifier with the non-linear sigmoid
hidden nodes trained with the FIR-ELM with the Kaiser window is also presented here.
81
It has been noted that some non-linear sigmoid hidden nodes can help to reduce the
effects of the disturbances and lower both the structural and empirical risks compared
with the ELM and the modified ELM, as seen in Figure 3.19 and Figure 3.20, where the
non-linear sigmoid function has a large linear region as shown in Figure 3.21. However,
some opposite results have been seen as in Figure 3.22 and Figure 3.23 where the non-
linear sigmoid function has a narrower linear region as seen in Figure 3.24.
Figure 3.18 The RMSE via d/γ for the FIR-ELM algorithm
Figure 3.19 Signal classification using the SLFN classifier with the non-linear hidden
nodes and trained with the FIR-ELM algorithm
0 1 2 3 4 5 6 7 8 9 100.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
d/
RM
SE
RMSE vs d/ plot for FIR-ELM
82
Figure 3.20 The RMSE of the SLFN classifier with the non-linear hidden nodes and
trained with the FIR-ELM algorithm
Figure 3.21 The non-linear sigmoid function ( ) ( ( ))⁄
83
Figure 3.22 Signal classification using the SLFN classifier with the non-linear hidden
nodes and trained with the FIR-ELM algorithm
Figure 3.23 The RMSE of the SLFN classifier with the non-linear hidden nodes and
trained with the FIR-ELM algorithm
84
Figure 3.24 The non-linear sigmoid function ( ) ( ( ))⁄
3.6 Conclusion
In this chapter, a robust training algorithm, the FIR-ELM, for a class of SLFNs with
linear nodes and an input tapped-delay-line memory has been developed. It has been
seen that the linear FIR filtering technique has been successfully used to design the
input weights, which makes the hidden layer of the SLFN perform as a pre-processor of
the input data, to remove the input disturbance and the undesired signal components.
Then, the non-linear optimization technique has been used to design the optimal output
weight matrix to minimize the output error and balance and reduce the empirical and
structural risks of the SLFNs. The simulation results have shown the excellent
robustness performance of the SLFNs with the proposed FIR-ELM training algorithm
under noisy environments. The further work is to consider the SLFNs with the non-
linear nodes and combine the FIR filtering technique with the pole-placement-like
method in [114-117] to enhance the function of the FIR-ELM algorithm with
application to signal processing, image processing and robust control.
85
Chapter 4
Classification of Bioinformatics Datasets
with Finite Impulse Response Extreme
Learning Machine for Cancer Diagnosis
In this chapter, the classification of the two binary bioinformatics datasets, leukemia and
colon tumor, is further studied by using the neural network based finite impulse
response extreme learning machine (FIR-ELM) developed in Chapter 3. It is seen that a
time series analysis of the microarray samples is first performed to determine the
filtering properties of the hidden layer of the neural classifier with FIR-ELM for feature
identification. The linear separability of the data patterns in the microarray datasets is
then studied. For improving the robustness of the neural classifier against noise and
errors, a frequency-domain gene feature selection (FGFS) algorithm is also proposed. It
is shown in the simulation results that the FIR-ELM algorithm has an excellent
performance for the classification of bioinformatics data in comparison with many
existing classification algorithms.
4.1 Introduction
The analysis of cancer diagnosis data is one of the most important research fields in
medical science and bioengineering [118], [119]. As the complete treatments of cancers
have not been achieved, early diagnosis plays an important role for doctors to help
patients control the metastatic cancer growth. Recently, the microarray gene expression
86
datasets, consisting of thousands of gene expressions that can be used for the molecular
diagnosis of cancer, have been established [119], [120]. The sample vectors in these
datasets contain a large amount of information about the origin and development of
cancers. However, due to the fact that all of these data contain different levels of noises
and measurement errors from sampling processes, it is difficult for the existing
classification techniques to accurately identify the patterns from the samples [120]. In
order to use the existing classification techniques for pattern classification of the
microarray gene expression datasets, a gene selection algorithm is usually implemented
to reduce the effects of the noise, error, and the high dimensionality of samples [121-
123]. It has been shown in [118] that the gene selection process can help to improve the
performance of the classifier by identifying representative gene expression subsets.
Some of the popular gene selection algorithms include the principal component analysis
(PCA) [121], the singular value decomposition (SVD) [122], the independent
component analysis (ICA) [123], genetic algorithm (GA) [124], and the recursive
feature elimination method [69], [125].
Other related works on the classifiers are the GA-based cancer detection system
proposed in [124], the support vector machine (SVM) in [19], [125], [126] and the
extreme learning machine (ELM) in [66], [70], [127], [128]. In [124], the GA gene
selection algorithm is applied to the microarray dataset to obtain the most informative
gene subsets that are classified using the multilayer perceptron (MLP). In [19], [125],
[126], the support vector machine first maps the input space into the high dimensional
feature space, and then constructs an optimal hyperplane using selected support vectors
to separate the classes. In [66], [70], [127], [128], the ELM simplifies model selection
by randomly generating the hidden layer weights and biases, and performs very fast
batch training using the Moore-Penrose pseudoinverse to deterministically obtain the
output weights. Most importantly, the ELM performs well in several cancer diagnosis
applications [66], [70], [127], [128]. However, all the algorithms mentioned in the
above are used for the classification of a small subset of genes that does not sufficiently
represent the whole process of the origin and the development of cancers in general
[119]. In addition, the robustness issues with respect to noises and disturbances are not
discussed in detail.
87
Please note that there has not been a standard guideline in the biomedical
industry for producing microarrays, and the microarrays produced by different labs may
contain different profiles of error or different sets of genes. Furthermore, even samples
taken from the same lab may sometimes contain different composition of cells which
may bias the accuracy of a classifier [119], [120]. Therefore, the motivation of this
chapter is to develop a general microarray classification technique that is capable of
classifying a variety of genes based on the finite impulse response extreme learning
machine (FIR-ELM) developed by Man et al. [9]. The FIR-ELM algorithm implements
a single hidden layer feedforward neural network (SLFN) as the classifier where the
well known filtering methods such as the finite length low-pass filtering, high-pass
filtering and band-pass filtering in digital signal processing, are adopted to train the
input weights in the hidden layer to extract features from the dataset. The readers may
find the details of the filtering techniques from [9], [106-109].
As seen in [9], the hidden layer of the neural classifier with the FIR-ELM is
based on FIR filter designs. The sample vectors from the microarray datasets are thus
treated as time series input pattern vectors. To show the validity of applying FIR filter
theory in microarray gene feature detection, a time series analysis for a binary
microarray gene expression dataset is first explored with the gene expression dataset
denoted as the time series type data. Then the FIR filtering function is applied to the
time series to reveal spectrum features. The filtered time series are then analysed using
the cross-correlation to show the importance of proper filter design selection to achieve
optimal separation of the classes. In addition, in order to provide a quantitative measure
of the gene features within each dataset, a linear separability analysis on the microarrays
is also discussed in detail based on the recent study in [129] on the linear separability of
microarray gene pairs.
It will be seen that, in order to determine the filtering properties of the hidden
layer of the neural classifier with the FIR-ELM, a frequency domain gene feature
selection (FGFS) algorithm is developed for analyzing the frequency characteristics of
all datasets. The outcomes of the FGFS are then used to determine the optimal FIR
filtering strategy for the input weights’ design. In addition, the effects of the time series
88
with the randomly ordered gene samples, which have not been previously considered in
similar works [130], [131], are also examined in this chapter.
The rest of this chapter is organized as follows. Section 4.2 describes the
characteristics of the time series type patterns of the microarray samples. Section 4.3
presents the analysis on the linear separability of the patterns. Section 4.4 outlines the
FIR-ELM as well as the FSGS algorithm. Section 4.5 shows the experimental results
and the performance analysis of the FIR-ELM compared with other existing algorithms.
Section 4.6 and 4.7 give the discussions and conclusions respectively.
4.2 Time Series Analysis of Microarrays
In order to analyse and classify the data patterns in the microarray gene expression
datasets using the FIR-ELM, in this chapter, it can be seen that all samples are
expressed as the time series type data. For the binary dataset with samples, assume
that the first samples belong to class 1 and the other samples belong to class 2,
with . The gene expressions in the two classes can then be denoted as:
[
]
[
] ( )
[
]
[
] ( )
where
{
} ( )
{
} ( )
89
and are the sample matrices for class 1 and class 2 respectively, and are
the gene indices, and is the number of genes in a sample.
For the purpose of conducting a bivariate time series analysis, the cells in the
microarray tests are treated as a black box that outputs two types of time series, ( )
and ( ) , to represent non-cancerous and cancerous states, or any other binary
phenotypes, as follows:
( ) { ( ) } ( )
( ) { ( ) } ( )
The samples from each of the binary classes and can then be aggregated
to represent their respective classes as in (4.2.5) and (4.2.6). It is well known that the
gene expression values in (4.2.3) and (4.2.4) have a Lorentzian-like distribution with
many outliers [120], [132]. Hence, the median that is a better maximum likelihood
estimator under impulsive noise conditions is preferred, as compared with the mean to
represent the distribution of each gene in a specific class [118], [133]. Each gene in
(4.2.3) and (4.2.4) is then aggregated by taking the median to be a temporal datum as
follows:
( ) ( ) ( )
( ) ( ) ( )
Figure 4.1 shows two time series type data plots, ( ) and ( ), for the colon tumor
dataset.
It is seen that both pattern classes in Figure 4.1 have the similar trend and are
very noisy. In order to determine the input weights of the neural classifier for the pattern
classification purpose, the low pass, high pass, and band pass filters are used to filter
( ) and ( ), respectively, where three filters use the normalized cut-off frequency of
0.4, and the band pass filter uses a bandwidth of ±0.05. As the microarray samples are
converted from data vectors to time series, there is no directly derivable sampling
90
frequency. Thus the normalized frequency is used to represent the unit of cycles per
sample.
Figure 4.1 Aggregated time series for the colon dataset
The filtered time series are defined as:
( ) ∑ ( ) ( )
( )
( ) ∑ ( ) ( )
( )
where { } are the ( )th order filter coefficients derived from
the impulse response of the respective FIR filters, a detailed discussion on the
generation of filter coefficients is given in [106-109]. The cross-correlation between the
two filtered time series (4.2.9) and (4.2.10) can then be obtained. The cross-correlation
may be interpreted as a measure of efficiency between the different filters used to
extract important features and reduce the effects of noise. It has been shown in [134]
that the generation of weakly correlated class features improves the machine learning
performance. Also, Yang et al. in [135] shows that the weak correlation can be used as a
decision-making condition to differentiate samples.
0 200 400 600 800 1000 1200 1400 1600 1800 20000
2000
4000
6000Y1 (Class 1)
genes
abun
danc
e
0 200 400 600 800 1000 1200 1400 1600 1800 20000
2000
4000
6000Y2 (Class 2)
genes
abun
danc
e
91
Before computing the cross-correlations, the non-stationary disturbances in the
two classes need to be removed. A first-order forward finite difference approximation
[136] is applied to the time series to remove the trend and attain stationarity. The
detrended time series is defined as:
( )
( ) ( ) ( )
( )
( ) ( ) ( )
Remark 4.2.1: In order to avoid introducing random errors which may be accumulated
throughout the differencing calculations, the first gene term is used as the initial state
instead of an arbitrary value. Thus, the time series (4.2.11) and (4.2.12) are reduced in
length by one to produce the forward finite difference approximation.
Figure 4.2 The filtered and detrended time series and
Figure 4.2 shows the new time series and
. The cross-correlation of the
two time series can then be obtained using the sample correlation coefficient defined in
[136], which is the well known Pearson correlation coefficient. The sample correlation
coefficient is denoted as:
0 200 400 600 800 1000 1200 1400 1600 1800 2000
-2000
0
2000
Y1 (Class 1)
genes
abun
danc
e
0 200 400 600 800 1000 1200 1400 1600 1800 2000
-2000
0
2000
Y2 (Class 2)
genes
abun
danc
e
92
∑ ( ( )
)(
( ) )
√[∑ ( ( )
)
∑ ( ( )
)
]
( )
where and
are the means of and
, respectively.
The correlation coefficient in (4.2.13) calculates the level of linear association
between the two time series within [-1, 1] where 0 is completely uncorrelated and 1 is
proportionally correlated, while -1 is inversely correlated. Table 4.1 shows the
correlation coefficient of the two times series and
, respectively, after low pass,
high pass, and band pass filtering at the same normalized cut-off frequency of 0.4. The
correlation coefficient of the non-filtered data is also given as a reference. The results in
Table 4.1 show that the high pass filtering produces the time series pairs which are least
correlated among the filtered time series.
Table 4.1: Correlation coefficient of colon dataset binary classes
with different FIR filters
Filter type Low pass High pass Band pass No filter
0.7480 0.6746 0.6816 0.6386
Remark 4.2.2: Although the correlation coefficients of the filtered time series are higher
than those of the non-filtered case, their practicality in comparing the effectiveness of
different filter types is still relevant. The higher correlation coefficients among the
filtered time series are mainly due to the similarity of the non-distinctive residue
components that remain after the filtering process as seen in the first half of the plots in
Figure 4.2. This is consistent with the motivation of using the filtering process to
discover features at specific frequency ranges. Therefore it is adequate to compare the
results among the filtered time series only to determine the suitability of each filter
design.
To visualize the differences in the actual time series (4.2.11) and (4.2.12), the
residual that is the squared differences between them can be computed as follows:
93
( ) ( ( )
( ))
( )
It can be seen from the plot of the residual in Figure 4.3 that a certain region of
genes contains expression values which are very different between the two classes. The
genes within the indices of 1000 to 1100 show significant residual magnitudes with the
highest at the index of 1015. Closer inspection of the two filtered time series overlaid on
each other to show the gene expressions between indices 1000 to 1040 in Figure 4.4
confirms the observations. The three genes 1014, 1015, and 1016 are seen to behave
dissimilarly (opposite of y axis), while the genes 1025 and 1026 have significant
differences in magnitudes between the two classes.
Remark 4.2.3: Generally the samples from both classes in a binary microarray dataset
tend to look similar. Therefore the genes with large residual magnitudes contribute to
the linear separability of the samples in the microarray dataset. However, it is quite
obvious that it is not sufficient to identify the genes with the cross-correlation. This
point can be seen in Table 4.1. Hence a more comprehensive test for linear separability
is required.
Figure 4.3 Plot of residual between and
0 200 400 600 800 1000 1200 1400 1600 1800 20000
1
2
3
4
5
6
7
8x 10
6
genes
y resi
dual
X: 1015Y: 7.135e+006
94
Figure 4.4 Overlaid plot of and
for genes 1000 to 1040
4.3 Linear Separability of Microarrays
The time series analysis using the cross correlation in the previous section reveals some
features of the separability of the classes in bioinformatics datasets. In this section,
however, the linear separability will be further investigated to provide a quantitative
measurement of linear separability of the data patterns.
Please note that the single gene test was first developed in [137] where each
gene was tested using all samples to find the total number of linearly separable genes.
Then the gene pair linear separability analysis was developed in [129], where pairs of
genes are tested using all possible combinations. The genes that are found to be linearly
separable in the gene pair analysis are also guaranteed to include the linearly separable
single genes. Therefore the gene pair analysis provides extra information on genes that
may only show the linear separability characteristic in pairs.
The gene pair analysis algorithm in [129] is used here because of the relatively
low computational cost of using an incremental testing approach. First, the
samples are separated into their respective classes as in (4.2.1) and (4.2.2). Then
each pair of genes in the dataset can be defined as ( ), and for a dataset with
1000 1005 1010 1015 1020 1025 1030 1035 1040-4000
-3000
-2000
-1000
0
1000
2000
3000
4000
genes
abun
danc
e
X: 1014Y: -1637 X: 1016
Y: -2018
X: 1015Y: 2348 X: 1025
Y: 3011
X: 1026Y: -3811
Class 1Class 2
95
genes, there would be possible combinations. The pairs of genes can then be
projected on the 2D plane. The algorithm states that a pair of genes is linearly separable
if there exists a line where all the points of class 1 are located on one side of and
all the points of class 2 are located on the other side (no point is allowed to reside on
itself). Each gene pair sample is added incrementally and the algorithm stops
whenever a new gene pair introduced violates the separability condition.
However, as stated in [129], it might be impossible to find linearly separable
gene pairs in medium to large datasets even if they are highly separable. In order to
solve this problem, a new sample selection process is proposed for testing the leukemia
and colon tumor datasets described in Table 4.2, as follows: Choose the number of
samples according to the guidelines provided in [138], which states the minimum
number of genes required for statistical significance. Finally, the test is repeated 20
times to obtain averaged results.
Table 4.2: Summary of leukemia and colon datasets
Dataset Samples Genes Class 1 Class 2
Leukemia 72 7129 47 (ALL) 28 (AML)
Colon Tumor 62 2000 40 (tumor) 22 (normal)
Table 4.3: Linearly separable gene pairs for leukemia and colon datasets
Dataset Samples (N1+N2) Mean Std. Dev
Leukemia 30 (15+15) 29009 15875
Colon Tumor 30 (15+15) 93 171
Table 4.3 shows the sample selection from each class and the number of linearly
separable gene pairs for each dataset. The leukemia dataset has a large number of
linearly separable gene pairs, therefore it should be more easily classified. The colon
tumor dataset however is found to consist of only a small number of linearly separable
96
gene pairs. Hence the colon tumor dataset is defined intuitively as ‘harder’ to classify
than the leukemia dataset. The results presented here are descriptive of the original data
itself. These results will later be used as a benchmark in the analysis of the FIR-ELM in
the experiments section.
Remark 4.3.1: The standard deviations for the linearly separable gene pairs given in
Table 4.3 are large when compared to their mean values. This is typical for the
microarray datasets as the gene values are often disturbed by systematic and random
noise and hence follow a Lorentzian-like distribution with wider tails [132]. The
random selection of samples during each iteration also contributes to a wider
distribution of the trials as there may be samples with a large number of outliers.
4.4 Outline of the FIR-ELM
Recently, a new breed of SLFN introduced by Huang et al. in [3], called the ELM, has
been proven to simplify the neural network training process into solving a set of linear
equations. The ELM algorithm first initializes the hidden layer weights and biases
randomly and then proceeds to compute the output weights deterministically using the
Moore-Penrose pseudoinverse. The learning capabilities of the ELM have been shown
in [66], [70], [127], [128] to produce good results in terms of classification accuracy.
However the randomly generated weights provide highly sensitive classifiers which are
prone to noise and other disturbances within the data [9], [139].
The FIR-ELM in [9] is a modified version of the ELM with the purpose of
improving the robustness. The hidden layer weights of the SLFN are designed using the
FIR filter theory and the output layer weights are derived using convex optimization
methods. A brief overview of the FIR-ELM is as follows:
4.4.1 Basic FIR-ELM
For a set of distinct samples {( )| [ ] [ ]} where
is an input vector and is an target vector, the neuron SLFN,
with the activation function ( ) can be modeled as
97
( ) ∑ ( )
( )
where [ ] is the -dimensional weight vector connecting the th
hidden node and the input nodes, [ ] is the output weights vector
connecting the th hidden node and the output nodes, is the bias vector of the th
hidden node, and, [ ] is the target output vector with respect to
[ ] .
For any bounded non-constant piecewise continuous activation function ( ), it
has been proven that SLFNs with hidden nodes can approximate samples with
zero error such that ∑ | | [25]. Therefore there exists , , that
satisfies (4.4.1), and the equation can be simplified into matrices shown below
[ ( ) ( )
( ) ( )
] ( )
[ ] and [ ]
The definition of the SLFN up until (4.4.2) is similar to the ELM. However, in
addition to the typical SLFN architecture, the FIR-ELM introduces an input tapped
delay line with delay units at the front of the SLFN and uses both the linear
hidden nodes and linear output nodes. The input tapped delay line represents a finite
depth memory where the current state and past states of a variable are used as the
input to the SLFN. Such SLFN architecture introduces system dynamics in the training
process and is capable of universal approximation [9]. A diagram of the SLFN
architecture used in this chapter is given in Figure 4.5, where D is the unit delay
element, k is the index for the input sample , | are the
hidden layer weights, | are the output layer weights, and ( ) is the
output function.
98
Figure 4.5 A single hidden layer feedforward neural network with linear nodes and
an input tapped delay line
Remark 4.4.1: The FIR-ELM algorithm requires that the hidden layer weights be
assigned using FIR filter design techniques to reduce disturbances in the data. Hence
given that it is possible to have prior knowledge of the frequency responses from the
training datasets, appropriate hidden layer weights can be designed.
Without loss of generality a low pass filter for the th hidden layer node can be
represented in time domain as in [9] as
( )
∫ ( )
( ( ( ) ))
( ( ) ) ( )
where is the filter length and is the cut-off frequency. The filter coefficients in
(4.4.3) can then be assigned in the th hidden layer node as shown below
( ) ( ) ( ) ( )
It is also possible to design other types of filters such as the band pass and high pass
filters depending on the requirement of the dataset.
D
D
D
+
+
++ +
……
.
……
.
+
ci (k-1)
ci (k-n+2)
ci (k-n+1)
icNf ~
ci(k)
11w
N~
1
Nnw ~
99
The optimal output weights for the FIR-ELM are then calculated based on the
minimization of the norms of the output error and the output weights matrix, with two
risk balancing constants and introduced to balance the empirical and structural risk.
The output weights can be obtained as
(
)
( )
The FIR-ELM algorithm can be summarized as:
1) Given a training data set [ ], design the hidden layer weights of the SLFN as
in (4.4.3).
2) Calculate the hidden layer output .
3) Solve for using (4.4.5).
4.4.2 Frequency Domain Gene Feature Selection
It is usually hard to define specific frequency specifications to filter the microarray gene
expression data even when prior knowledge of the frequency response is available as
they contain many components of similar magnitude. Figure 4.6 shows an example of
the frequency response for a sample in the colon tumor dataset. Therefore, in order to
analyse the frequency profiles of the respective datasets, an exhaustive FIR filter design
search algorithm called frequency domain gene feature selection (FGFS) is proposed. In
this chapter, a frequency profile is defined as a collection of filter designs and their
respective classification performances.
First, the samples in a dataset are divided into training and testing sets, and
. Then within the training set , the samples are split once more into subsets and
specifically for filter design selection. The subsets and will be used iteratively
to evaluate the suitability of different FIR filter designs for the dataset over a range of
normalized cut-off frequencies from 0.1 to 0.9 with a step size of 0.1. A band width of ±
0.05 is assigned for the band pass filter. Finally, the frequency profile of the dataset can
be generated from the testing accuracies achieved for each filter design using the
training samples . The best performing combination of FIR filter design is
100
then selected to train the FIR-ELM using all the samples in , and the trained classifier
is tested on samples in . A flow chart of the algorithm is given in Figure 4.7.
Remark 4.4.2: In the above, a gene feature selection algorithm that preserves the
microarray vectors and utilizes all the genes within a sample for classification is
developed. Hence it is different from the conventional gene selection algorithm which
selects subsets of genes. The proposed method is more robust in terms of handling the
noise that may severely affect parts of the gene expression readings, such as
experimental errors which produce outliers and other disturbances which may cause
parts of the sample to be unusable.
Figure 4.6 Frequency response of a sample from the colon dataset
Figure 4.7 An FIR filter design search algorithm for FIR-ELM
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-8
-6
-4
-2
0x 10
4
Normalized Frequency ( rad/sample)
Pha
se (d
egre
es)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 140
60
80
100
120
Normalized Frequency ( rad/sample)
Mag
nitu
de (d
B)
Input Data (N)
Train (p1) Test (p2)
F-train (q1) F-test (q2)
FIR-ELM FIR-filters
- low pass - high pass - band pass (0.1<ωc<0.9)
Trained Classifier
Accuracy
Tested all?
no
Frequency Profile
yes
101
4.5 Experiments and Results
The performance of the FIR-ELM with the FGFS algorithm is investigated in this
section for both the leukemia and colon tumor datasets. The classification results will
then be compared with popular algorithms such as the MLP, ELM, and SVM. A
conventional frequency based gene selection process known as the discrete cosine
transform (DCT) is implemented for the MLP, ELM, and SVM algorithms to compare
the two different approaches to gene feature selection. Lastly the linear separability of
the hidden layer output of the SLFN for ELM and FIR-ELM is discussed.
4.5.1 Biomedical Datasets
Two binary microarray gene expression datasets are investigated, namely, leukemia and
colon tumor from the Kent Ridge biomedical data repository [140]. The leukemia
dataset consists of two classes of acute leukemia known as acute lymphoblastic
leukemia (ALL), arising under lymphoid precursors and acute myeloid leukemia (AML),
arising under myeloid precursors. There are 72 bone marrow samples in the dataset with
47 ALL and 25 AML cases and each contains 7129 gene probes. For the colon tumor
dataset a total of 62 samples were collected from colon-cancer patients where 40
biopsies are tumors and 22 others are normal samples from healthy parts of the colon.
As conventional classifiers tend to have problems classifying microarray data
due to the high number of variables [119], [121], a frequency transformation based gene
feature selection method known as the DCT will be used to perform feature selection for
the MLP, ELM, and SVM tests. The DCT is a well-known method in pattern
recognition to compress the energy in a sequence and it has been successfully
implemented in cancer classification [141] and character recognition [191], [192]
applications. In [141], the author implemented the DCT with neural networks for the
detection of stomach cancer and achieved the classification accuracy of 99.6%, which is
among the highest ever reported. In addition, the DCT belongs to the similar frequency
transform based feature selection method as the FIR filters with the advantage of
dimension reduction. Therefore the DCT is the preferred feature selection algorithm in
this chapter. The DCT for one dimensional array is defined as
102
( ) ( )∑ ( ) ( ( )( )
)
( )
( )
{
√
√
where is the length of the array sequence .
The DCT generates coefficients that will then be used as input data for
evaluating the performance of the MLP, ELM, and SVM algorithms. In order to select
the most relevant coefficients the 90% criterion is employed to select coefficients that
represent 90% of the total energy. Although a lower percentage can be chosen, this
criterion is selected to avoid losing too much information from the dataset and reducing
the classification performance. A summary of properties for the datasets and the DCT
feature selection is presented in Table 4.4.
Table 4.4: Selection of DCT coefficients for leukemia and colon datasets
Dataset Samples Genes DCT Coefficients
Leukemia 72 7129 2325
Colon Tumor 62 2000 613
4.5.2 Experimental Settings
As the classification of microarray data concerns the critical diagnosis of cancer, the
misclassification rate for each class must also be minimized, hence both the
classification accuracy and minimum sensitivity are usually considered [127], [142].
The sensitivity is defined as the number of correct patterns predicted to be in a class
with respect to the total number of patterns in the class. The minimum sensitivity is
selected from the class with the lowest sensitivity measure within the confusion matrix.
In order to properly select the meta-parameters for each classification algorithm, a
classification performance measurement based on the classification accuracy ( ) and
103
minimum sensitivity ( ) is used. For each of the meta-parameter
[ ] considered, the optimal value is selected from (4.5.2)
( )
where is the mean of , and is the mean of . All the parameters are
evaluated using the repeated 10-fold stratified cross validation (CV) process with the
training data only. The CV process is repeated 20 times for each considered parameter.
For both the MLP and ELM, the number of neurons is considered from 1 to 200
in increments of 1, the sigmoidal activation function is used in the hidden layer, and the
linear activation function is used in the output layer. The scaled conjugate gradient
algorithm is used to train the MLP. The linear kernel is implemented in the SVM after
testing using several other popular kernels such as the RBF and polynomial gave poor
results. The regularization parameter for the SVM is considered within the range of
to with logarithmic increments of 1. The FIR-ELM filter length is similar to
the dimensions of the gene sample, and the regularization parameters are selected as
and based on the prior knowledge in [9]. Lastly, the targets for both
datasets are defined as [1, -1].
A repeated 2-fold stratified CV is implemented to train the classifier algorithms
after the meta-parameter selection process is completed. The CV cycle is repeated 20
times for each algorithm to obtain the mean classification accuracy. The confusion
matrices show the mean number of correctly classified as well as misclassified samples
for each algorithm.
4.5.3 Leukemia Dataset
The frequency profile of the leukemia dataset for the low pass, high pass, and band pass
filters are shown in Figures 4.8, 4.9, and 4.10 respectively, with error bars showing the
standard deviation of the classification performance. The optimal filter design based on
(4.5.2) is the high pass filter with a mean normalized cut-off frequency of 0.29.
104
However, it is not possible to state which filter type is better due to the large standard
deviations in the frequency profile plots. Instead, the results show that at each iteration
of testing, the optimal filter design is based on the selection of samples for training.
Different filtering criteria may be derived in the meta-parameter selection process based
on the training samples. The possibility of using different filter designs to produce
similar classification performances indicate that different filter design criteria produces
vastly different data patterns which can still be mapped by the output layer of the SLFN.
Therefore the selection of the appropriate filter remains subjective and dependent on the
classification requirements (e.g. type of noise present).
Figure 4.8 Classification performance for leukemia with low pass filter
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 150
55
60
65
70
75
80
85
90
95
100
normalized cut-off frequency c
clas
sific
atio
n pe
rform
ance
%
105
Figure 4.9 Classification performance for leukemia with high pass filter
Figure 4.10 Classification performance for leukemia with band pass filter
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 150
55
60
65
70
75
80
85
90
95
100
normalized cut-off frequency c
clas
sific
atio
n pe
rform
ance
%
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 150
55
60
65
70
75
80
85
90
95
100
normalized cut-off frequency c
clas
sific
atio
n pe
rform
ance
%
106
The classification performance in Table 4.5 shows that the FIR-ELM has
achieved the best result, with an accuracy of 96.53% and a standard deviation of 1.79%
which is better than the benchmark of the SVM. The worst performing algorithm is the
ELM with an accuracy of 76.90% and the largest standard deviation. The confusion
matrix is shown in Table 4.6, where the ALL cases are labeled as class 1 and AML
cases are labeled as class 2. The FIR-ELM has the most similar sensitivities for both
cases. From the meta-parameter selection process, the number of neurons for the MLP
is 6, the ELM requires 174 neurons, and the SVM regularization parameter is 0.0046.
Table 4.5: Classification performance for leukemia dataset
Algorithm MLP SVM ELM FIR-ELM FIR-ELM (R)
Accuracy (%) 88.01 95.50 76.90 96.53 94.49
Std. Dev.(%) 3.78 2.42 6.39 1.79 1.91
Time (s) 10.87 0.66 0.4 1.12 1.12
(R): FIR-ELM with random gene order
Table 4.6: Confusion matrix for classification of leukemia dataset
Algorithm MLP SVM ELM FIR-ELM FIR-ELM (R)
Prediction 1 2 1 2 1 2 1 2 1 2
1 44.3 5.9 46.2 2.45 39.1 8.65 46.6 2.1 46 3
2 2.6 19.1 0.8 22.6 8.0 16.4 0.4 22.9 1 22
Sen* (%) 94.2 76.4 98.3 90.2 83.1 65.4 99.2 91.5 97.9 88.0
(R): FIR-ELM with random gene order, * Sen: sensitivity
107
4.5.4 Colon Tumor Dataset
The frequency profiles for the colon tumor dataset using the low pass, high pass, and
band pass filters are shown in Figures 4.11, 4.12, and 4.13, with error bars showing the
standard deviation of the classification performance. The optimal filter design based on
(4.5.2) is the high pass filter with a mean normalized cut-off frequency of 0.47. Similar
to the leukemia dataset, it is not possible to state which filter type is better due to the
large standard deviations in the frequency profile plots. The classification performance
of the colon tumor dataset is then presented in Table 4.7. The SVM achieves the highest
mean accuracy followed by the FIR-ELM, which is the best performing algorithm
among the neural network based algorithms. However, due to the large standard
deviations of the classification accuracy, it is indeed impossible to declare a best
classifier for the colon tumor dataset. From the meta-parameter selection process, the
number of neurons for the MLP is 42, the number of neurons for the ELM is 142, and
the SVM regularization parameter is 0.0082.
Figure 4.11 Classification performance for colon dataset with low pass filter
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 150
55
60
65
70
75
80
85
90
95
100
normalized cut-off frequency c
clas
sific
atio
n pe
rform
ance
%
108
Figure 4.12 Classification performance for colon dataset with high pass filter
Figure 4.13 Classification performance for colon dataset with band pass filter
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 150
55
60
65
70
75
80
85
90
95
100
normalized cut-off frequency c
clas
sific
atio
n pe
rform
ance
%
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 150
55
60
65
70
75
80
85
90
95
100
normalized cut-off frequency c
clas
sific
atio
n pe
rform
ance
%
109
The confusion matrix for the colon tumor dataset is shown in Table 4.8. The
tumor cells are labeled as class 1 and the healthy cells are labeled as class 2. It can be
seen from Table 4.8 that the SVM has the best sensitivity for class 1 while the FIR-
ELM has the best sensitivity for class 2.
Table 4.7: Classification performance for colon dataset
Algorithm MLP SVM ELM FIR-ELM FIR-ELM (R)*
Accuracy (%) 69.76 79.76 71.53 76.85 76.61
Std. Dev.(%) 8.52 3.57 6.09 6.18 4.1
Time (s) 5.72 0.41 0.12 22.74 22.74
(R): FIR-ELM with random gene order
Table 4.8: Confusion matrix for classification of colon dataset
Algorithm MLP SVM ELM FIR-ELM FIR-ELM (R)
Prediction 1 2 1 2 1 2 1 2 1 2
1 30.3 9.1 34.5 7.1 39.1 8.7 32.5 6.8 32.5 7
2 9.7 13 5.6 15 8 16.4 7.6 15.2 7.5 15
Sen* (%) 75.8 58.9 86.3 68 78.1 59.6 81.1 69.1 81.3 68.2
(R): FIR-ELM with random gene order, * Sen: sensitivity
110
4.6 Discussions
Overall the performance of the FIR-ELM has been shown to achieve comparable or
better results in both the leukemia and colon tumor classification problems. While the
design of the FIR filters remain as an art and is still widely open to interpretation, the
method proposed in this chapter gives a straightforward suggestion based on the
conventional training of neural networks. For both microarray datasets, the ELM is seen
to be the fastest followed by SVM, FIR-ELM, and MLP. The time recorded represents
20 iterations of training and testing for each algorithm. The results for the randomly
permuted gene order case for both datasets have shown that the classification accuracy
remains similar to that of the original gene order. Based on these results, the FIR-ELM
with FGFS seems to be insensitive to the gene ordering and is capable of learning from
different variants of the datasets. This is acceptable as large standard deviations have
been observed in Figures 4.8 to 4.13, which indicate that the FGFS algorithm adapts to
the sample characteristic in selecting the optimal filter design.
4.6.1 Linear Separability of the Hidden Layer Output for SLFN
Without loss of generality, SLFNs typically utilize the hidden layer as a pre-processor to
map the input data into the desired feature space so that the data points will be easily
separable. The output layer then maps the features to the target classes. In order to
compare the performance of the chosen hidden layer weights for the ELM and FIR-
ELM the linear separability gene pair testing algorithm is implemented for the outputs
of the hidden layer for both the classifiers, where the outputs of the hidden layer are
defined as
[
] ( )
It is seen that (4.6.1) omits the activation function from the earlier defined
hidden layer output (4.4.2). This is because the sigmoid activation functions would
squish the outputs into a much smaller range and therefore discards the original
mappings of the hidden layer weights.
111
Using the same allocation of samples as in Section 4.3, Table 4.9 shows the
linearly separable gene pair testing results of the hidden layer outputs for ELM and FIR-
ELM. It can be seen from Table 4.9 that the hidden layer of the FIR-ELM reveals more
linearly separable gene pairs compared to the ELM. The empirical results achieved
suggest that the hidden layer design of the FIR-ELM improves the performance in terms
of feature discovery and it is consistent with the improved classification accuracy
obtained for the leukemia and colon tumor dataset. It should be noted that this result is
only comparable relatively between the two classifiers under our constraints.
However, when the results in Table 4.9 are compared with Table 4.3 which
shows the linearly separable pairs for the original dataset, the positive correlation
between the number of linearly separable pairs and classification accuracy does not hold.
It is seen that the linearly separable pairs for the leukemia dataset have increased while
the linearly separable pairs for the colon tumor dataset decreased. Ideally the number of
linearly separable pairs is expected to increase to indicate the discovery of more features
at the hidden layer output. This may be due to the transformation of the original data
into the feature space which inhibits direct comparisons.
Table 4.9: Linearly separable gene pairs for the hidden layer output in ELM
and FIR-ELM for leukemia and colon datasets
Algorithm Dataset Mean Std. Dev
ELM Leukemia 17 46
Colon Tumor 3 7
FIR-ELM Leukemia 35420 19396
Colon Tumor 33 63
From the results obtained, it can be concluded that the positive correlation
between the number of linearly separable pairs and classification accuracy holds only
when comparing different SLFN training algorithms. The hidden layer output of SLFNs
need not be more linear separable than the original dataset to achieve good classification
112
performance. The criterion above could find many applications in the selection of the
optimal hidden layer weights for SLFNs.
4.7 Conclusion
In this chapter, the FIR-ELM has been implemented for the binary classification of two
biomedical datasets. It has been seen that the microarray gene expression samples are
treated as time series to form the input patterns for the classification with the FIR-ELM.
To assign the optimal input weights of the neural classifier, a frequency domain gene
feature selection (FGFS) algorithm has been proposed to evaluate the suitability of
different FIR filter designs. For both the leukemia and colon tumor datasets, the FIR-
ELM with the FGFS has shown a better performance compared to the other neural
networks based algorithms while achieving comparable results with the SVM. It has
been further shown in the simulation section that the FIR-ELM achieves much better
results in terms of gene feature discovery as compared to the ELM. Some future works
include the testing of more filter types for the hidden layer designs of the neural
classifier and extension to the multi-class pattern classifications.
113
Chapter 5
Frequency Spectrum Based
Learning Machine
In this chapter, a new robust single hidden layer feedforward network (SLFN) based
pattern classifier is developed. It is shown that the frequency-spectrums of the desired
feature vectors can be specified in terms of the discrete Fourier transform (DFT)
technique. The input weights of the SLFN are then optimized with the regularization
theory such that the error between the frequency components of the desired feature
vectors and the ones of the feature vectors from the outputs of the hidden layer is
minimized. For linearly separable input patterns, the hidden layer of the SLFN plays the
role of removing the effects of the disturbance from the noisy input data and providing
the linearly separable feature vectors for the accurate classification. However, for non-
linearly separable input patterns, the hidden layer is capable of assigning the DFTs of all
feature vectors to the desired positions in the frequency-domain such that the
separability of all non-linearly separable patterns is maximized. In addition, the output
weights of the SLFN are also optimally designed so that both the empirical and the
structural risks are well balanced and minimized under a noisy environment. Two
simulation examples are presented to show the excellent performance and effectiveness
of the proposed classification scheme.
114
5.1 Introduction
Neural networks consist of many single processing nodes operating in parallel. It is such
a parallel architecture of neural networks that allows engineers and scientists to develop
the data-based systems to process vast amounts of data in manufacturing, transportation,
process control, dynamic system modeling, digital signal and image processing and
information retrieval [1], [91], [92], [95-99], [143-146]. It has been further seen from
the extreme learning machine (ELM) and its applications [3], [18], [23-25], [89], [93],
[94] that the neural networks with a single hidden layer and the randomly assigned input
weights and hidden layer biases can perform accurate universal approximation and
powerful parallel processing for complex non-linear mappings and pattern
classifications with vast amounts of data.
In the area of neural computing, the commonly used weight training method is
the gradient-based backpropagation (BP) algorithm [1], [91], [143]. In order to train an
SLFN to learn a predefined set of input and output data pairs with the BP algorithm, an
input pattern is applied as a stimulus to the hidden layer of the neural network and then
is propagated to the output layer to generate an output pattern. The BP can then be
implemented, based on the computed error between the output pattern of the SLFN and
the desired one, to update the weights from the output layer to the hidden layer such that,
as the second input pattern is applied, the error between the generated output pattern and
the desired one is reduced. Such a training process is repeated until the error converges
to zero (or sufficiently small). Obviously, the BP training process is time-consuming
and the slow convergence has limited the BP in many practical applications where vast
amount of data are presented and fast on-line training is required.
Compared with the BP algorithm, the ELM algorithm developed in [3], [18],
[23-25], [89], [93], [94] has revolutionized the training of neural networks in the
following ways: (i) The input weights and the hidden layer biases of the SLFNs can be
randomly assigned if the activation functions of the hidden nodes are infinitely
differentiable; (ii) The SLFNs are simply treated as linear systems and the output
weights of the SLFNs can be analytically determined by using the generalized inverse
of the hidden layer output matrix; (iii) Because of the characteristics of the ELM’s batch
115
learning, that is, all examples in the training sample set are used in one operation of the
global optimization, the learning speed of the ELM can be much faster than those of the
BP as well as the BP-like algorithms. In view of these remarkable merits, recently the
ELM has received a great deal of attention in the area of computational intelligence with
application and extension to many other areas [18], [147].
However, from the perspective of engineering applications, engineers may be
concerned with whether the SLFNs with the ELM behave with a strong robustness
property with respect to the input disturbances and the random input weights in practice.
As a matter of fact, it has been noted from both the simulations and experiments that the
SLFNs trained with the ELM in many cases behave with a poor robustness property
with respect to the input disturbances. For instance, when the input weights and the
hidden layer biases of an SLFN are randomly assigned, the changes of the hidden layer
output matrix of the SLFN may be very large. This in turn will result in large changes of
the output weight matrix. According to the statistical learning theory [86], [100-103],
the large changes of the output weight matrix will greatly increase both the structural
and empirical risks of the SLFNs, which will in turn degenerate the robustness property
of SLFNs with respect to the input disturbances. Therefore, it is necessary to properly
design the input weights such that the hidden layer of SLFNs plays an important role in
improving the robustness with respect to disturbances, and reducing both structural and
empirical risks of SLFNs. In addition, although the SLFNs with random input weights
and hidden biases can exactly learn distinct observations, it seems that the high
potentiality of the hidden layer for information processing in the SLFNs has not been
fully explored as seen in [3], [18], [23-25], [89], [93], [94].
In order to improve the robustness of the SLFNs trained with the ELM, a new
training algorithm, called FIR-ELM, has been developed recently in [9]. It is seen that
the linear FIR filtering technique is used to design the input weights such that the
hidden layer of the SLFN performs as a pre-processor of the input data for removing the
input disturbance and the undesired signal components. The regularization theory [1] is
then adopted to design the output weight matrix to minimize the output error, balance
and reduce both the empirical and structural risks of the SLFN. The simulation results in
[9] and [11] have shown excellent robustness performance of the SLFNs with the FIR-
116
ELM under a noisy environment. It has been noted from [9] that both the hidden nodes
and the output nodes in the SLFN are linear. However, in order to make the SLFN with
the linear nodes have the learning capability, a tapped-delay line with a number of unit-
delay elements is added to the input layer of the SLFN. According to the filtering and
system theory [2], [107], [113], [148], such an SLFN with both the linear nodes and the
input tapped-delay line is equivalent to the SLFN with non-linear nodes regarding the
universal learning capability. Furthermore, the design and the analysis of the SLFNs
with the linear nodes as well as the input tapped-delay line are much easier than those of
the SLFNs with non-linear nodes.
In this chapter, the SLFNs with both the linear nodes and the input tapped-delay
line for pattern classifications will be further studied based on [9], and a new training
algorithm called DFT-ELM, will be developed. It will be seen that, unlike the FIR-ELM,
for the input data vectors with patterns, the desired feature vectors are specified in
terms of their frequency-spectrums. The input weights of the SLFN are then trained
with the regularization theory [1], [2] to minimize the error between the frequency
components of the desired feature vectors and the ones of the feature vectors from the
hidden layer of the SLFN.
It will be shown in the simulation section that, if all input patterns are linearly
separable, the hidden layer of the SLFN with the optimal input weights will play a role
of removing the effects of the disturbance from the noisy input data and providing the
linearly separable feature vectors for the accurate pattern classification. However, if the
input patterns are non-linearly separable, the input weights of the SLFN will be trained
in the sense that the hidden layer is capable of assigning the DFTs of all feature vectors
to the “desired positions” in the frequency-domain, specified by the DFTs of the desired
feature vectors, so that the separability of all non-linearly separable patterns are
maximized in the feature space.
In order to make sure that the SLFN can properly learn the maps between the
input and output training data pairs and reduce both the structural and empirical risks,
the regularization technique is also adopted to optimize the output weights. It is due to
the significant reduction of both the structural and the empirical risks through
117
optimizing both the input and the output weights, the SLFN classifier with the DFT-
ELM behaves with a much stronger robustness property with respect to the input
disturbances, compared with the SLFNs, trained with the regularized ELM (R-ELM)
[18] and the FIR-ELM [9], respectively. This merit will be confirmed in the simulation
section through the performance comparisons among the SLFN classifiers with the
DFT-ELM, the FIR-ELM and the R-ELM, respectively.
The rest of the chapter is organized as follows: In Section 5.2, the SLFN
classifier with both linear nodes and a tapped-delay line is formulated, the frequency-
domain presentation of a feature vector is defined, and the relationship between a
feature vector and its frequency-domain presentation is specified in terms of the DFT. In
Section 5.3, the optimizations of both the input and the output weights with the
regularization theory are discussed in detail and the effects of the regularization
parameters and the number of the hidden nodes on the performance of the SLFN
classifier are also explored. In Section 5.4, the SLFN based classifiers, trained with the
DFT-ELM, the FIR-ELM and the R-ELM are compared to demonstrate the
effectiveness and strong robustness of the SLFN classifier with the DFT-ELM. Section
5.5 gives the conclusion and some further work.
5.2 Problem Formulation
A single hidden layer feedforward network based pattern classifier is described in
Figure 5.1:
Figure 5.1 A single hidden layer network with linear nodes and an input delay line
118
where the output layer has linear nodes, the hidden layer has linear nodes, (for
and ) are the input weights, (for and )
are the output weights, ( ) (for ) are the outputs of the hidden nodes, a
string of s, seen at the input layer, are unit-delay elements that ensure the input
sequence ( ) ( ) ( ) represents a time series, consisting of both
the present and the past observations of the process.
Remark 5.2.1: It is well known from [113] that an SLFN with both linear nodes and a
string of unit-delay elements at the input layer, as seen in Figure 5.1, has the
capability of universal approximation of any continuous function. It is because of the
unit-delay elements added to the input layer that every hidden node in the SLFN
performs as an th-order FIR filter that can approximate any linear or non-linear
function by properly choosing the input weights. In addition, the design and analysis of
the SLFNs with the linear nodes and the unit-delay elements are much easier than the
ones of the SLFNs with non-linear nodes.
As seen in Figure 5.1, the input and the output data vectors of the SLFN can be
expressed in the forms:
( ) [ ( ) ( ) ( )] ( )
( ) [ ( ) ( ) ( )] ( )
the output of the th hidden node can be computed as:
( ) ∑ ( ) ( )
( )
with
[ ] ( )
and the th output of the network, ( ), is of the form:
119
( ) ∑ ( )
( ) ( )
with
[ ] ( )
and
[ ] ( )
Thus, the output data vector ( ) can be expressed as:
( ) [ ( ) ( ) ( )]
[ ( )
( ) ( )]
( ) ( )
with
[ ] ( )
Suppose that there are input pattern vectors ( ) ( ) ( ) and
corresponding desired output data vectors ( ) ( ) ( ) , respectively, for
training the SLFN in Figure 5.1, where
( ) [ ] ( )
Then, using (5.2.8), it can be seen that
[ ( ) ( )] [ ( )
( )] ( )
or
( )
with
[ ( ) ( ) ( )]
[ ( )
( ) ( )
( )
( ) ( )
( )
( ) ( ) ]
( )
120
and
[ ( ) ( ) ( )] ( )
Remark 5.2.2: The matrix in (5.2.13) is the hidden layer output matrix of the SLFN
that contains all output feature vectors corresponding to the input data vectors
, respectively. For instance, for the th given input data vector ( ), the
corresponding feature vector from the outputs of the hidden layer is the th row of the
matrix :
[ ( )
( ) ( ) ] ( )
Since the output of the th hidden node, corresponding to the th input pattern
vector, can be written as:
( ) ( ) ( )
the th output feature vector in (5.2.15) can then be written as:
[ ( )
( ) ( ) ] [ ( ) ( ) ( )] ( )
For further analysis, the frequency components of the th output feature vector
in (5.2.17) can be expressed, in terms of its discrete Fourier transform (DFT), as follows
[107], [148]:
[ ] ∑ ( )
( )
[ ] ∑ ( )
( ) ( )
[ ] ∑ ( )
( ) ( )
121
[ ] ∑ ( )
( ) ( )
where [ ] [ ] [ ] are the samples of the frequency spectrum of the th
output feature vector in (5.2.17) at equally spaced frequencies , for
, respectively.
For further processing, (5.2.18)-(5.2.21) can be expressed in the following
matrix form:
[ [ ]
[ ]
[ ]
[ ]]
[
( )
( ) ( )
( )
( )( ) ]
[ ( )
( )
( )
( )]
( )
Define
[ [ ] [ ] [ ] [ ]]
( )
[
( )
( ) ( )
( )
( )( ) ]
( )
and
[ ( ) ( ) ( ) ( )] ( )
(5.2.22) can then be written as:
( )
Using (5.2.3) in (5.2.25) leads to:
122
[ ( )
( ) ( )
( )]
[
]
( )
Then, (5.2.26) becomes
( )
Considering all input data pattern vectors ( ) ( ) ( ), it can be
seen that
[ ] [ ] ( )
or
( )
with
[ ] ( )
Let the frequency spectrums of the desired feature vectors be described by
[ ] ( )
Remark 5.2.3: The selection of the frequency spectrums of the desired output feature
vectors in (5.2.32) is mainly based on one’s understanding of the characteristics of the
input patterns. For instance, if all input patterns are linearly separable, the frequency
spectrums of the desired feature vectors can be chosen as those of the filtered input
pattern vectors, by removing all frequency components of the disturbances from the
frequency spectrums of the input pattern vectors. In this case, the hidden layer of the
SLFN with the optimal input weights plays a role of pre-processing and providing the
linearly separable feature vectors for the accurate classification in the output layer. In
practice, the frequency spectrums of the desired output feature vectors in (5.2.32) can
be provided by a reference filter for designing the optimal input weight matrix in
(5.2.27). This point can be seen from the first example in the simulation section.
123
Remark 5.2.4: However, if the input patterns are non-linearly separable, the frequency
spectrums of the desired feature vectors in (5.2.32) should be assigned to the “desired
location” in the frequency-domain in the sense that, through the optimization of the
input weights of the SLFN, the separability of all feature vectors from the outputs of the
hidden layer is maximized in the feature space. This point will be explored in detail in
the second example of the simulation section.
In the next section the regularization theory in [1], [2] and [149] will be used to
design both the optimal input weight matrix and the optimal output weight matrix .
Also, the effects of the regularization parameters and the number of the hidden nodes in
the performance of the SLFN classifier will be explored in detail.
5.3 Design of the Optimal Input and Output Weights
As described in Section 5.2, the input weight matrix in the hidden layer of the SLFN
should be trained such that the error between the frequency components of the desired
feature vectors and the ones of the feature vectors from the outputs of the hidden layer is
minimized. For this purpose, the design issue can be formulated by the following
regularization problem [1], [2], [9], [149]:
Minimize { ‖ ‖
‖ ‖ } ( )
Subject to ( )
where is the error between the frequency components of the desired feature vectors
and the ones of the corresponding feature vectors from the outputs of the hidden layer,
and are the positive real regularization parameters, ‖ ‖ is the regularizer which,
through the proper choice of the regularization parameters and ,is used to solve
the ill-posed inverse of the data matrix (see the later discussion).
The optimization problem in (5.3.1) with the constraint in (5.3.2) can be solved
conveniently by using the method of Lagrange multipliers [2], [112] and [149]. For this,
construct the Lagrange function as:
124
∑∑
∑∑
∑∑ ( ∑
)
( )
where is the th element of the error matrix defined in (5.3.2), is the th
element of the input weight matrix , is the th element of , the desired
frequency spectrums of the output feature vectors defined in (5.2.34),
∑ is the th element of , the matrix of the frequency-
spectrum samples of the output feature vectors of the hidden layer, is the th row of
the transformation matrix in (5.2.25), is the th column vector of , the
transpose of the input weight matrix , is the th element of the input pattern
vector ( ), is the th Lagrange multiplier.
Differentiating with respect to , it can be seen that
(∑∑ (∑
)
) ( )
It is noted that and can be expressed as:
[ ] ( )
and
[ ] ( )
(5.3.4) can then be written as:
∑∑
( )
125
Letting
,
∑∑
∑ ( )
∑ [ ] [
]
[ ] [
] [
] ( )
[ ]
[ ] [
] [
]
( )
Thus
[
]
[
] [
] [
]
( )
or
( )
126
In addition, differentiating with respect to ,
( )
Using the Kuhn-Tucker condition,
, [110], [112], the following
relationship can be obtained:
( )
thus,
( )
Considering the constraint in (5.3.2), (5.3.14) can be expressed as:
( ) ( )
and using (5.3.15) in (5.3.11) leads to
( )
( )
Considering the fact that
( )
where is the unity matrix, (5.3.16) can then be expressed as
( )
Then, the optimal input weight matrix is derived as follows:
(
)
( )
127
Remark 5.3.1: In conventional regularization theory [1], the regularization parameter
in (5.3.1) is set to 1 and the regularization for solving the ill-posed inverse problem
of the data matrix depends only on the proper choice of the small regularization
parameter in the sense that the regularization term ‖ ‖ makes the augmented
cost function in (5.3.1) smoother around the minimum point (or vertex) of the cost
function in the input weight space. However, the dynamic behaviors of the hidden layer
of the SLFN are also related the slope of the augmented cost function in (5.3.1) that is
adjusted by the regularization parameter . This point can be indirectly seen from
(5.3.19).
Remark 5.3.2: The Lagrange multiplier matrix in (5.3.15) describes the sensitivity of
the objective function in (5.3.1) with respect to the constraint in (5.3.2), that is, how
tightly the constraint in (5.3.2) is binding at the optimal point with the optimal input
weights in (5.3.19) [2], [110], [112]. It is seen from (5.3.15) that, as the value of the
regularization parameter becomes smaller, the cost function in (5.3.1) is less
sensitive with respect to the change of the constraint in (5.3.2). Thus, the Lagrange
multiplier matrix plays a role of qualitatively estimating the effects of the input
disturbance, the structural risk and the empirical risk on the robustness of the SLFN
classifier with the DFT-EIM, seen from the output of the hidden layer.
Remark 5.3.3: Also, the effects of the regularization parameter on the robustness of
the output of the hidden layer can be seen from the slope of the regularized cost function
in (5.3.1), because the values of the regularization parameter affect the width or the
steepness of the regularized cost function in (5.3.1). For instance, if the value of the
regularization parameter is very large, the slope of the cost function in (5.3.1) will be
very steep. The hidden layer of the SLFN is thus very sensitive to the change of the
input weights and input disturbances. However, as the value of the regularization
parameter is reduced, the changing rate of the cost function in (5.3.1) becomes
smaller and the hidden layer of the SLFN is thus less sensitive to the change of the input
weights and input disturbances. Therefore, it is necessary to properly choose the
regularization parameter so that the robustness of the hidden layer can be guaranteed.
128
Remark 5.3.4: The effects of the number of hidden nodes, , on the smoothness of ill-
posed data matrix and the sensitivity of the output weight matrix have been
clearly revealed in (5.3.19). (i) As the number of the hidden nodes is large, the
regularization factor
in (3.19) is small, and thus the well-posed matrix term
in (5.3.19) can well approximate the ill-posed data matrix in vicinity of its
singular point; (ii) Although the inverse term ( )
in (5.3.19) may be
sensitive to the changes of the input data because of the input disturbances, the
sensitivity of the optimal output weight matrix is greatly reduced because the inverse
term (
)
has been weighted by a small factor
in (5.3.19). However, it
should be noted that the larger number of hidden nodes may need a longer training time.
This point can be clearly seen in the simulation section. Therefore, in practice, the
number of hidden nodes must be chosen properly so that the SLFN classifier can
provide a tradeoff between sensitivity and training time.
Next, the problem of how to obtain the optimal output weight matrix to
minimize the error between the desired output pattern and the actual output pattern of
the SLFN classifier is considered. Similar to the discussion in the above on the
optimization of the input weight matrix . The optimization of the output weight matrix
is stated as follows:
Minimize { ‖ ‖
‖ ‖ } ( )
Subject to ( )
with the corresponding Lagrange function :
∑∑
∑∑
∑∑
( )
( )
where is the th element of the error matrix , is the th element of the output
weight matrix , is the th element of the desired output data matrix , is the th
129
column of the hidden layer output matrix , is the th column of the output weight
matrix , is the th Lagrange multiplier, and are real positive regularization
parameters and ‖ ‖ is the regularizer of the output layer.
Similar to the discussion from (5.3.4) to (5.3.19), it is required to first compute
the partial derivatives
and
, and, by solving equations and
, the
optimal output layer weight matrix can then be obtained as:
( )
( )
with the corresponding sensitivity matrix :
( ) ( )
Remark 5.3.5: It is seen from (5.3.23) that the optimal output layer weight matrix
depends on the ratio
instead of values of and . However, the dynamic behavior
of the output layer of the SLFN depends not only on the ratio
, but also on the values
of and . As discussed in Remarks 5.3.1 and 5.3.2, the regularization parameters
and as well the ratio
should be chosen to be properly small for ensuring the trade-
off between the sensitivity and the smoothness of the SLFN classifier.
Remark 5.3.6: Based on the discussions in the above, the proposed DFT-ELM
algorithm can be summarized as follows:
Step 1: Define the frequency-spectrum sample matrix of the desired
feature vectors in (5.2.32);
Step 2: Compute the transformation matrix from (5.2.24);
Step 3: Compute the optimal input weight matrix from (5.3.19);
Step 4: Compute the hidden layer output matrix from (5.2.13); Step 5: Compute the optimal output weight matrix from (5.3.23).
130
5.4 Experiments and Results
In order to verify the effectiveness of the proposed DFT-ELM, the following two
classification examples are presented in this simulation section:
5.4.1 Example 1: Classification of Low Frequency Sound Clips
Consider 10 given input pattern vectors that contain the samples of the computer-
generated low frequency sound clips with the frequencies of 100 Hz, 150 Hz, 200 Hz,
300 Hz,…, 900 Hz, respectively, and modulated by an envelope function to create tones.
In order to sufficiently reflect the characteristics of the signals to be classified, each
input pattern vector contains 1000 sample data. Figure 5.2 shows the 700 Hz sound clip
modulated by the envelope function ( ) (
). For the classification
of the sound clips, consider the SLFN classifiers, with a linear output node, 30 hidden
nodes and a tapped-delay-line memory with 29 unit-delay elements at each of the input
layers. The desired output pattern values of the SLFNs, representing the signal classifier
states, for the corresponding 10 training input pattern vectors, are generated by a sine
function, ( ) , for with equal increments. The SLFN classifiers
are then trained with the R-ELM, the FIR-ELM and the DFT-ELM, respectively. To
examine the robustness performance of these SLFN classifiers, an additive white
Gaussian noise (AWGN) with the signal-to-noise ratio (SNR) of 10 dB is added to all
sound clips during testing. The sound clip with SNR of 10 dB is shown in Figure 5.3.
131
Figure 5.2 A sound clip modulated by the envelope function
Figure 5.3 The disturbed sound clip with SNR of 10 dB
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-4
-3
-2
-1
0
1
2
3
4
Seconds
Am
plitu
de
Clear Clip
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-4
-3
-2
-1
0
1
2
3
4
Seconds
Am
plitu
de
High Noise Clip
132
Figure 5.4 and Figure 5.5 show the classification results of the SLFN classifier
trained with the R-ELM ( ) in [18], where the input weights and the
hidden layer biases are generated randomly within [-1 1], and the logarithmic sigmoid is
used as the activation function of the non-linear nodes in the hidden layer. It is seen that
the output values of the classifier deviate largely from the desired values of the classifier
states. Such unsatisfied classification results are mainly due to the fact that the random
input weights, the random hidden layer biases and the AWGN make the classifier with
the R-ELM experience both the large structural and the large empirical risks.
Figure 5.6 and Figure 5.7 show the classification results and the corresponding
RMSEs of the SLFN classifier trained with the FIR-ELM developed in [9], where the
input weights are assigned such that each linear hidden node performs as a low-pass
filter with the Kaiser window of length 1001 and the cut-off frequencies 200 Hz to
4kHz, respectively. In order to balance and reduce the effects of both the structural and
empirical risks, the regularization parameters and are chosen as and
, respectively. It has been seen that, compared to Figure 5.4 and Figure 5.5
with the R-ELM, the effects of the input white Gaussian noise have been greatly
reduced since the hidden layer, as a pre-processor, is capable of removing the high
frequency background noise, and thus, the accuracy of classification with FIR-ELM has
been improved with much smaller deviation between the classifier outputs and the
desired values of the classifier states as seen from the RMSEs in Figure 5.7.
The classification results of the SLFN classifier with the DFT-ELM are shown
in Figure 5.8 and Figure 5.9, respectively. In this simulation, the frequency spectrum
sample matrix of the desired feature vectors is chosen from the DFTs of the outputs
of the hidden nodes of the SLFN trained with the FIR-ELM in Figure 5.6 and Figure 5.7.
It is noted that, after training with the DFT-ELM, the hidden layer of the SLFN
classifier has eliminated the effects of the input white Gaussian noise, and most
importantly, with the optimal choice of the regularization parameters
through the experiments, the high accurate classification
performance has been achieved with much smaller deviation between the classifier
outputs and the desired values of the classifier states, compared with the ones with the
133
R-ELM in Figure 5.4 and Figure 5.5, and the ones with the FIR-ELM in Figure 5.6 and
Figure 5.7, respectively.
Figure 5.4 Classification using R-ELM with SNR of 10 dB ( and )
Figure 5.5 The RMSE using R-ELM with SNR of 10 dB ( and )
0 1 2 3 4 5 6 7 8 9 10 110.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
ELM
sound clip
ampl
itude
targetoutput
0 1 2 3 4 5 6 7 8 9 10 11-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Error
sound clip
ampl
itude
134
Figure 5.6 The classification using FIR-ELM with SNR of 10 dB
( )
Figure 5.7 The RMSE using FIR-ELM with SNR of 10 dB ( )
0 1 2 3 4 5 6 7 8 9 10 110.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
FIR-ELM
sound clip
ampl
itude
targetoutput
0 1 2 3 4 5 6 7 8 9 10 11-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Error
sound clip
ampl
itude
135
Figure 5.8 Classification using DFT-ELM with SNR of 10 dB
(
)
Figure 5.9 The RMSE using DFT-ELM with SNR of 10 dB
( )
0 1 2 3 4 5 6 7 8 9 10 110.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
DFT-ELM
sound clip
ampl
itude
targetoutput
0 1 2 3 4 5 6 7 8 9 10 11-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1Error
sound clip
ampl
itude
136
For further analysis, the performance comparisons of the SLFN classifiers,
trained with the R-ELM, the FIR-ELM and the DFT-ELM, respectively, are carried out,
in terms of the averaged RMSE over 50 iterations and the different SNRs, as shown in
Table 5.1. It is seen that, when the SNR is small (10 dB), the R-ELM is very sensitive to
the noisy environment with the result that both the mean and the standard deviation are
much larger than the ones of the FIR-ELM and the DFT-ELM. It is further noted that,
although the hidden layers of the SLFNs, trained with the FIR-ELM and the DFT-ELM,
have the same low pass filtering property, the DFT-ELM algorithm behaves with the
stronger robustness property then the FIR-ELM. The reason is that the input weights of
the SLFN with the DFT-ELM are optimized, which ensures that the hidden layer of the
SLFN is less sensitive than the one of the FIR-ELM, with respect to different SNRs.
Table 5.1: Comparisons of the R-ELM ( ),
FIR-ELM ( ) and DFT-ELM ( )
SNR (dB) R-ELM FIR-ELM DFT-ELM
Mean Std. Dev. Mean Std. Dev. Mean Std. Dev.
10 0.1761 0.0389 0.0586 0.0138 0.0146 0.0041
20 0.0519 0.0154 0.0216 0.0052 0.0072 0.0010
30 0.0214 0.0057 0.0066 0.0011 0.0062 0.0003
Figure 5.10 shows the curves of both the RMSE and the classification error of
the SLFN classifier with the DFT-ELM, where the number of the hidden nodes is fixed
to 30, but the regularization parameter ratio
is changed from 0.01 to 0.1. It is
seen that the curves of the RMSE and the classification error intersect at the point with
the ratio of
. Obviously, this intersection is located at the point where the
optimal choices of the regularization parameters are , ,
and , respectively.
137
The following facts have been noted from Figure 5.10: (i) As the regularization
parameter ratio is less than 0.02, the classification error is small but the RMSE is
relatively large. In this case, the SLFN classifier is sensitive to the input disturbances; (ii)
As the regularization parameter ratio is greater than 0.02, the classification error is
increased but the RMSE is reduced. The small RMSE means that the SLFN classifier is
robust with respect to the input disturbances. Therefore, for the tradeoff between the
classification error and the RMSE, it is better to choose the values and the ratio of the
regularization parameters around the intersection of the curves of the RMSE and the
classification error as seen in Figure 5.10.
Figure 5.10 The RMSE and class error via
for the DFT-ELM
Figure 5.11 shows the curves of the RMSE and the classification error of the
SLFN with the DFT-ELM, where the regularization parameters are chosen as
and , that is, the regularization parameter ratio
is fixed
at 0.02, but the number of the hidden nodes is changed from 4 to 50. It is seen that,
when the number of hidden nodes is less than 30, the classification error is large. This
is because the feature vectors provided by the hidden layer could not sufficiently
represent the input patterns. However, as the number of hidden nodes is increased, the
classification performance is gradually improved and this is agreed with the discussions
in Remark 5.3.3 that, as the number of the hidden nodes is large, both the excellent
0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
d1/1=d2/2
Erro
r
RMSE and Class Error vs d/
RMSEClass Error
138
classification results and strong robustness with respect to the input disturbance can be
achieved with the DFT-ELM. It is also noted that the decrease in the class error as the
number of neurons increase is not monotonic. The increase in the number of neurons
translates to the increase in the bandwidth of the frequency spectrum matrix , which
affect the feature mapping of the DFT-ELM hidden layer. One reason for the non-
monotonic decrease of the class error may be due to the effects of the nonlinear feature
mapping performed at the DFT-ELM hidden layer. Comparing the independent effects
of tuning the regularization ratio and the number of hidden nodes, it can be said that
finding the optimal number of hidden nodes is more important.
Figure 5.11 The RMSE and classification error via the number of hidden nodes with
the DFT-ELM
5.4.2 Example 2: Classification of Handwritten Digits
In this example, a handwriting recognition experiment, using the SLFN classifier,
trained with the DFT-ELM, is implemented. The training data are taken from the
MNIST handwritten digits database, where 60,000 training samples and 10,000 testing
samples are collected from multiple disjoint sets of writers [82]. The MNIST
handwritten digits database consists of the handwritten digits from 0 to 9, respectively.
Each digit is a gray-scale image of 28x28 pixels with the intensity range of 0 (black) to
255 (white). In order to conduct an unbiased experiment, the training and testing sets are
5 10 15 20 25 30 35 40 45 500
0.1
0.2
0.3
0.4
0.5
0.6
Number of Neurons
Erro
r
RMSE and Class Error vs Number of Neurons
RMSEClass Error
139
all randomly selected from the pool of both the training and the testing samples. A
sample set of digits is shown in Figure 5.12.
Figure 5.12 A set of handwritten digits from the MNIST database
In order to classify the handwritten digits with the DFT-ELM algorithm, each
image, as in Figure 5.13(a), is first divided into 14 rows by 14 columns, that is, the
image is segmented into 196 small images, as seen in Figure 5.13(b).
Figure 5.13(a) Image of digit Figure 5.13(b) Segmented image
The mean of the pixel intensities of each segment in Figure 5.13(b) is computed as:
∑ ∑ ( )
( )
where ( ) is the intensity of the th pixel of th segment, (= 2) and (= 2) are
the numbers of rows and columns of pixels in each segment, respectively.
The means of all of the segments, from row 1 to row 14, in Figure 5.13(b), are
then arranged as the elements of the following time series type data vector:
[ ] ( )
A sample set of the time series type data vectors, corresponding to a set of the 10
handwritten digits in Figure 5.12, are given in Figure 5.14(a) to Figure 5.14(j),
respectively:
14
14
140
Figure 5.14 (a)
Figure 5.14 (b)
0 20 40 60 80 100 120 140 160 1800
50
100
150
200
250
sample index
ampl
itude
sample digit 0
0 20 40 60 80 100 120 140 160 1800
50
100
150
200
250
sample index
ampl
itude
sample digit 1
141
Figure 5.14 (c)
Figure 5.14 (d)
0 20 40 60 80 100 120 140 160 1800
50
100
150
200
250
sample index
ampl
itude
sample digit 2
0 20 40 60 80 100 120 140 160 1800
50
100
150
200
250
sample index
ampl
itude
sample digit 3
142
Figure 5.14 (e)
Figure 5.14 (f)
0 20 40 60 80 100 120 140 160 1800
50
100
150
200
250
sample index
ampl
itude
sample digit 4
0 20 40 60 80 100 120 140 160 1800
50
100
150
200
250
sample index
ampl
itude
sample digit 5
143
Figure 5.14 (g)
Figure 5.14 (h)
0 20 40 60 80 100 120 140 160 1800
50
100
150
200
250
sample index
ampl
itude
sample digit 6
0 20 40 60 80 100 120 140 160 1800
50
100
150
200
250
sample index
ampl
itude
sample digit 7
144
Figure 5.14 (i)
Figure 5.14 (j)
Figure 5.14(a) ~ Figure 5.14(j) Encoded sample data set for images 0 to 9
0 20 40 60 80 100 120 140 160 1800
50
100
150
200
250
sample index
ampl
itude
sample digit 8
0 20 40 60 80 100 120 140 160 1800
50
100
150
200
250
sample index
ampl
itude
sample digit 9
145
Obviously, the coded time series type data vectors that represent 10 written
digits in Figure 5.14(a)~Figure 5.14(j) are non-linearly separable. Considering the fact
that, for pattern classification purpose, a feature vector is simply used to represent an
input pattern in the feature space, theoretically, one may assign an arbitrary vector in the
feature space to represent an input pattern. Recall the pole-placement method in control
engineering [114] where the poles of a closed-loop continuous system can be assigned
at any desired locations on the left-half of the complex s-plane. Similarly, one may
assign the DFTs of all desired feature vectors at the “desired positions” in the frequency
domain in the sense that, after training with the DFT-ELM, the separability of all feature
vectors from the outputs of the hidden layer of the SLFN classifier is maximized in
feature space.
Based on the above idea concerning the assignment of the DFTs of the desired
feature vectors in frequency domain, the desired feature assignment and the input
weights’ optimization can be summarized in the following three steps:
1) Assume that the SLFN classifier reads each time series type sample data vector in
(5.4.2) in one second, and the virtual frequency components of the data vectors
are then distributed between 0Hz and 100Hz.
2) Use the DFTs of 10 sine waves, whose lengths are all equal to the number of
hidden nodes and the frequencies are 2Hz, 12Hz, 22 Hz, …, 92 Hz, respectively,
as the frequency spectrums of the desired feature vectors, representing the
corresponding handwritten digits from 0 to 9, respectively, in feature space.
3) Minimize (5.3.1) with the constraint in (5.3.2) to derive the optimized input
weights in (5.3.19).
Table 5.2 shows the comparison of the classification accuracies of the SLFN
classifiers. It is seen that the SLFN with the DFT-ELM (
) achieves a high degree of classification accuracy as well as very small deviations
consistently over the 50 iterations, as the number of hidden nodes is increased from 50
to 200. The classification performances of the SLFNs with both the R-ELM and the
146
FIR-ELM are improved with the increase in the number of hidden nodes. However, the
SLFN with the R-ELM shows the much larger deviation. The reason is that the input
weights of the SLFN classifier with the R-ELM are chosen randomly, which makes the
output of the SLFN classifier very sensitive to the changes of the input pattern vectors.
This point has been discussed in detail in both Example 1 of this experiment section and
the reference [9].
Table 5.2: Classification accuracies of the handwritten digit classification
with R-ELM, FIR-ELM and DFT-ELM
Neurons R-ELM FIR-ELM DFT-ELM
(%) (%) (%) (%) (%) (%)
50 68.55 1.59 42.49 0.56 86.81 0.24
100 78.57 0.78 74.80 0.33 86.58 0.24
150 82.80 0.51 84.14 0.29 86.30 0.28
200 85.07 0.47 85.24 0.27 86.52 0.24
Please note that the DFT-ELM needs a longer training time compared with both
the R-ELM and the FIR-ELM, especially as the number of hidden nodes is large. This is
because the optimization of the input weights of the SLFN with the DFT-ELM takes a
longer time during the training. However, it is the optimization of the input weights of
the SLFN that makes the DFT-ELM have the high degree of classification accuracy and
very small deviations as seen in Table 5.2.
Figure 5.15 shows the comparisons of the sample means of the classification
accuracies of the SLFN classifiers with 50 hidden nodes, trained with the DFT-ELM,
the FIR-ELM and the R-ELM, respectively, and tested with 20 samples, as the
regularization parameter ratio is changed from 0.0001 to 100. It is seen that the SLFN
with the DFT-ELM achieves the best classification accuracy up to 86.77%, the SLFN
with the R-ELM achieves the second best up to 68.56% and, however, the SLFN with
the FIR-ELM (where the hidden layer is designed as the band-pass filter with the cut-off
147
frequencies 30Hz and 70Hz, respectively) achieves up to 47% only. It is also seen that
the performance of the SLFN with the FIR-ELM is further degenerated, as the
regularization parameter ratio is increasing. The reason is that the changing rate of the
cost function in (5.3.1) is getting smaller as the regularization parameter ratio is
increasing. In such a case, the dominant factor for determining the output weights is the
DC component rather the feature vectors, embedded in the data matrix, as seen
in (4.16) of [9]. From the viewpoint of the signals and systems [107], [148], the larger
value of makes the frequency band of the output layer narrower, centered at the DC
component, the useful frequency components of the feature vectors from the outputs of
the hidden layer are then removed by the output layer of the SLFN with the FIR-ELM.
Figure 5.15 Classification accuracies of the SLFN classifiers with DFT-ELM, FIR-
ELM and R-ELM via the regularization parameter ratio
Figure 5.16 shows the sample means of the classification accuracies of the
SLFN classifiers with the DFT-ELM ( ) , the FIR-ELM
( ) and the R-ELM ( ), respectively, versus the number
of training samples. It is noted that the classifier with the DFT-ELM performs
significantly better than the ones with both the FIR-ELM and the R-ELM when the
number of training samples is greater than 50 sets. This point agrees with the ones in
Figure 5.16. The classification performance curve for the DFT-ELM also reveals that
0.0001 0.001 0.01 0.1 1 10 1000
10
20
30
40
50
60
70
80
90
100
d/
accu
racy
DFT-ELMFIR-ELMR-ELM
148
the SLFN with the DFT-ELM is capable of extracting core features with a small sample
pool.
Figure 5.16 Classification accuracies of the SLFN classifiers with DFT-ELM, FIR-
ELM and R-ELM via the number of training samples
5.5 Conclusion
In this chapter, a robust SLFN pattern classifier has been developed with the DFT-ELM.
Because of the optimal designs of both the input and the output weights, the SLFN
classifier has demonstrated excellent performance for the classifications of both the
linearly separable patterns and the non-linearly separable patterns. The excellent
classification performance of the SLFN with the DFT-ELM has been evaluated and
compared with those of the R-ELM and the FIR-ELM in the simulation examples. The
further work to apply the SLFN classifier with the DFT-ELM for the molecular
classification of cancers, the biomedical image processing and the fault detection of
mechatronic systems are under investigation.
10 50 100 250 500 1000 1500 2000 3000 4000 50000
10
20
30
40
50
60
70
80
90
100
number of training sets
accu
racy
DFT-ELMFIR-ELMR-ELM
149
Chapter 6
An Optimal Weight Learning Machine
for Handwritten Digit Image Recognition
An optimal weight learning machine for a single hidden layer feedforward network
(SLFN) with application to handwritten digit image recognition is developed in this
chapter. It is seen that both the input weights and the output weights of the SLFN are
globally optimized with the batch learning type of least squares. All feature vectors of
the classifier can then be placed at the prescribed positions in the feature space in the
sense that the separability of all non-linearly separable patterns can be maximized, and a
high degree of recognition accuracy can be achieved with a small number of hidden
nodes in the SLFN. An experiment for the recognition of the handwritten digit image
from both the MNIST database and the USPS database is performed to show the
excellent performance and effectiveness of the proposed methodology.
6.1 Introduction
Neural network-based pattern classification techniques have been widely used for
handwritten digit image recognition over the last twenty years [82], [150-153]. The
merits of the neural classifiers for image recognition are attributed to (i) their powerful
learning abilities, through training, from a large amount of training data, (ii) their
capability of accurately approximating unknown functions with complex dynamics
embedded in the training data, and (iii) their parallel structures to perform fast and
150
efficient parallel computing during the training as well as in the process of image
recognition.
It has been noted that most neural pattern classifiers for the handwritten digits
recognition are designed with multi-layered neural networks, trained with the recursive
gradient-based backpropagation (BP) algorithms, to perform the pre-filtering, feature
extraction and pattern recognition. Because the BP training process is time-consuming
with a slow convergence [1], [2], [91], [144], [145], these types of neural classifiers are
hard to be used in many practical applications where fast on-line training is required. In
addition, most existing neural classifiers require a large number of hidden nodes in their
hidden layers in order to obtain highly separable features in the feature space. Such a
requirement, in fact, will greatly increase the physical size of the neural classifiers’
hardware as well as the training time in practice.
In view of all the above issues, the researchers in the areas of pattern
classification and computational intelligence have been exploring the new types of
neural classifiers with a single hidden layer, a small number of hidden nodes, and the
fast training algorithms to fulfill the industrial requirements such as small size hardware
and easy implementation in industrial environments. One of the state-of-the-art neural
classifiers is a single hidden layered feedforward neural network based classifier
developed in [18], [29], [147] where the input weights and the hidden layer biases are
randomized and the output weights are computed by using a generalized inverse of the
hidden output matrix. Although the neural networks with randomized weights have
been studied by many researchers [18], [29], [147], [154-156], the theoretical
background of randomizing the input weights of the SLFN classifiers can be traced to
the functional approximation with the Monte Carlo method [29]. It is seen that, if a
continuous function can be represented by a limit integral, the limit integral can then be
approximated by a sum of weighted activation functions that are the functions of a set of
random samples of a parameter vector in the limit integral domain. With the increase of
the number of random samples of the parameter vector, the accuracy of the Monte Carlo
approximation can be further improved. Obviously, such a Monte Carlo approximation
can be implemented by using an SLFN, where the activation functions in the Monte
Carlo approximation can be represented by the non-linear hidden nodes, the parameter
151
vector that has been randomly sampled in the Monte Carlo approximation is actually the
input weights and the hidden layer biases, and the output weights of the SLFN play the
role of the weights in the Monte Carlo approximation. The universal approximation
capability of the SLFNs with the random input weights and the trained output weights
has been reported in [18], [29], [156].
Batch learning has been actually used for many years for training neural
networks [1], [2]. Unlike the sample-by-sample based recursive learning methodologies,
batch learning uses all the training samples at the same time for deriving the optimal
weights. If a given set of the training samples can sufficiently represent the dynamics of
an unknown complex system to be learnt, the weights derived from the batch training
are globally optimal. Thus, the issues concerning the slow convergence and the local
minima, occurred frequently in recursive learning schemes, are all avoided in the batch
learning process. It has been noted that, for the SLFNs with the randomized weights and
hidden layer biases, only the output weights are trained with the batch learning. Thus,
the training speed is extremely fast compared with those of all existing recursive
learning algorithms [1], [2], [82], [91], [144], [145], [150-153]. Recently such a batch
learning technique for the SLFNs with randomized input weights and hidden layer
biases has been called the Extreme Learning Machine (ELM).
However, with regard to the reality of the pattern classifications, the researchers
have noted many insightful problems of using the ELM. For instance, no guidance has
been proposed on how the upper and the lower bounds of the random input weights and
the hidden layer biases should be chosen. And the researchers just use the trial and error
method to determine the bounds of the random input weights. In many cases, the upper
and the lower bounds of the random input weights are simply set to -1 and 1,
respectively, which results in the feature vectors whose elements are -1 or 1, as the
number of the sigmoid hidden nodes is very large. The advantage of this sort of upper
and lower bounds’ selection for the random input weights and hidden layer biases is that
the feature vectors generated by the hidden layer of the SLFN can widely spread out in
the high dimensional feature space in most cases, and the output layer, as the pattern
classifier, can then easily recognize the patterns represented by the corresponding
feature vectors. However, the researchers have still been trying to design the optimal
152
input weights to maximize the separability of the feature vectors, enhance the
robustness of the feature vectors with respect to the changes of the input disturbances,
and thus further improve the classification/recognition accuracy [9].
In this chapter, an optimal weight learning machine (OWLM) for a class of
SLFN classifiers with both the linear nodes and the input tapped-delay line for
handwritten digit image recognition will be developed. Borrowing the concept of model
reference control from control engineering [114], the SLFN classifier with the non-
linear hidden nodes, trained with the ELM in [18], is used as the reference classifier that
provides the reference feature vectors for the outputs of the hidden layer of the proposed
SLFN classifier to follow. The input weights are then optimized to assign the feature
vectors of the neural classifier to the “desired reference positions” defined by the
reference feature vectors, and maximize the separability of all feature vectors in the
feature space. In terms of the optimization of the output weights, the recognition
accuracy at the output layer of the SLFN classifier can be further improved. It will be
seen that both the input weights and the output weights of the proposed SLFN classifier
are optimized by using the regularization theory [1], [2], [149] with the results that (i)
the error between the reference feature vectors and the feature vectors generated by the
outputs of the hidden layer of the new classifier is minimized globally; (ii) the
singularity of the correlation matrix of the feature vectors is avoided by smoothing the
singular point of the correlation matrix of the feature vectors with the regularization
term in the cost function; (iii) the sensitivity of the proposed SLFN classifier to the
input disturbances is reduced by adjusting the regularization parameters; (iv), most
importantly, only a small number of hidden nodes are required in the new SLFN
classifier to achieve a high degree of recognition accuracy for handwritten digit image
recognition.
The rest of the chapter is organized as follows: In Section 6.2, the concept of the
Monte Carlo approximation and the basics of the ELM are formulated. In Section 6.3,
the SLFN classifier with both linear nodes and a tapped-delay line is described, the
optimizations of both the input weights and output weights with the regularization
theory are explored in detail, and the effects of the regularization parameters on the
classification performance of the SLFN classifier are also studied. In Section 6.4, the
153
SLFN classifiers with the OWLM, the ELM, and the R-ELM are compared to
demonstrate the effectiveness of the new SLFN classifier with the OWLM for the
handwritten digit recognition. Section 6.5 gives conclusions and some further work.
6.2 Problem Formulation
Consider a data set, from a handwritten digits database of interest, with the input
pattern vectors ( ), ( ), , ( ) and the output data vectors ( ), ( ), ,
( ), respectively, where
( ) [ ( ) ( ) ( )] ( )
and
( ) [ ( ) ( ) ( )] ( )
for
It is assumed that the above input pattern vectors and the output data vectors are
generated by the following continuous vector function:
( ) ( ( )) ( )
with
( ( )) [ ( ( )) ( ( )) ( ( ))] ( )
The th element of ( ( )) is represented by the following limit integral [29]:
( ( ))
∫ [ ( )] ( ( ))
( )
where is a scalar parameter, is a high dimensional parameter vector, is an
activation function, is an operator, and is the domain of the parameter vector .
154
In order to compute ( ( )) represented by the limit integral in (6.2.5), the
right side of (6.2.5) can be approximated as follows:
∫ [ ( )] ( ( ))
∫ [ ( )] ( ( ))
( )
where . Considering the complexity of the integrand in (6.2.6), the Monte-Carlo
method [29] can be used to approximate the right side of (6.2.6) as follows:
( ( )) ∫ [ ( )] ( ( ))
∑
[ ( )] ( ( ))
∑
( ( )) ( )
with
| |
[ ( )] ( )
It is noted that the vectors in (6.2.7) and (6.2.8) are random samples
drawn from uniformly. On the other hand, can be treated as a set of
random variables that are uniformly distributed on [29].
Thus, the function ( ( )) can be approximated as:
( ( ))
[ ∑ ( ( ))
∑ ( ( ))
∑ ( ( ))
]
[ ( ( ))
( ( ))
( ( ))]
155
( ( )) ( )
with
[ ] ( )
[ ] ( )
[ ] ( )
and
( ( )) [ ( ( )) ( ( )) ( ( ))] ( )
It is easy to confirm, based on the discussion in [29], that the approximation error in
(6.2.9) is of the order of √ , and thus the approximation error will converge to zero
as .
Remark 6.2.1: The functional approximation using the Monte Carlo method in (6.2.9)
can be implemented by using a multiple-inputs-multiple-outputs (MIMO) single hidden
layer feedforward neural network (SLFN), with the non-linear hidden nodes, and the
randomized input weights and hidden layer biases, as described in Figure 6.1, where the
output layer has linear nodes, the hidden layer has non-linear nodes with the non-
linear activation function ( ) ( ( )) with , (for
and ) are the random input weights, (for ) are the
random biases of the hidden layer, (for and ) are the output
weights to be optimized later, ( ) (for ) are the outputs of the hidden
nodes.
156
Figure 6.1 A single hidden layer neural network with both random input weights and
random hidden layer biases
As seen in Figure 6.1, the input and the output data vectors of the SLFN can be
expressed in the forms:
( ) [ ( ) ( ) ( )] ( )
( ) [ ( ) ( ) ( )] ( )
the output of the th hidden node can be computed as:
( ) (∑ ( )
) ( ( ) ) ( )
for , with
[ ] ( )
157
and the th output of the network, ( ), is of the form:
( ) ∑ ( ( ) )
( ( )) ( )
for , with
[ ] ( )
and
( ( )) [ ( ( ) ) (
( ) )] ( )
Thus,
( ) [ ( ) ( ) ( )] ( ( )) ( )
with the output weight matrix
[ ] ( )
Remark 6.2.2: Over the last twenty years, many researches have been done by using the
SLFNs in Figure 1, with both the random input weights and the random hidden layer
biases, for pattern classifications. It has been noted that a good performance can be
achieved only when the number of hidden nodes is large enough [18], [29], [147]. In
[18], [147], batch learning was used to train the output weights of the SLFNs with both
the random input weights and random hidden layer biases, for a given set of training
data pairs. Because the batch training of the output weights is extremely fast and the
global minimum training error can be achieved, such a batch learning technique for the
SLFNs with the randomized input weights and hidden layer biases has been called the
Extreme Learning Machine (ELM). The basics of the ELM are briefly outlined as
follows:
158
Consider a set of training pairs with the input pattern vectors ( ) ,
( ) ( ), and the desired output vector ( ), ( ) ( ), respectively.
For the input pattern vectors, the output vectors ( ) ( ) ( ) can be
generated based on (6.2.21) as follows:
( )
with
[ (
( ) ) ( ( ) )
( ( ) ) (
( ) )
( ( ) )
( ( ) )
(
( ) ) ( ( ) ) (
( ) )] ( )
[ ( ) ( ) ( )] ( )
Assume that
( ) ( ) ( ) ( ) ( ) ( )
(6.2.23) can then be written as:
( )
with
[ ( ) ( ) ( )] ( )
By solving (6.2.26), the optimal output weight matrix is obtained as follows:
( )
where is the Moore-Penrose generalized inverse of the matrix .
Figure 6.2 ~ Figure 6.4 show the recognition results of the handwritten digit 1 by
using the SLFN classifiers trained with the ELM, where the numbers of the hidden
159
nodes are 10, 50, and 100, respectively, and both the training data and testing data are
randomly selected from the MNIST database [82].
Figure 6.2 Recognition of the handwritten digits by using the SLFN classifier with 10
hidden nodes trained with the ELM
Figure 6.3 Recognition of the handwritten digits by using the SLFN classifier with 50
hidden nodes trained with the ELM
0 1 2 3 4 5 6 7 8 9 10 11-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Output neurons
Mag
nitu
de
Digit 1
targetoutput
0 1 2 3 4 5 6 7 8 9 10 11-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Output neurons
Mag
nitu
de
Digit 1
targetoutput
160
Figure 6.4 Recognition of the handwritten digits by using the SLFN classifier with 100
hidden nodes trained with the ELM
It has been seen that, when the number of the hidden nodes is very small, say 10,
the recognition accuracy is only about 36%. However, with the increase of the number
of hidden nodes, the recognition accuracy is gradually improved. For instance, the
recognition accuracies are about 67% and 78% when the numbers of the hidden nodes
are 50 and 100, respectively, as seen in Figure 6.3 and Figure 6.4.
Remark 6.2.3: Unlike the concept of conventional feature vectors that contain the
primary information of the corresponding input pattern vectors, the generated feature
vectors from the output layer of the SLFNs with the ELM have little similarities with
the input patterns. Obviously, this characteristic of the feature vectors is mainly due to
the randomness of the input weights and the hidden layer biases. In fact, the main
concern in handwritten digit recognition is whether the generated features can properly
represent the corresponding input handwritten digit patterns, in the sense that the input
handwritten digit patterns can be accurately recognized from the output layer of the
neural classifiers under a noisy environment. Thus, from this viewpoint, it is not
important whether or not the input patterns and the corresponding feature vectors have
similarities.
0 1 2 3 4 5 6 7 8 9 10 11-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Output neurons
Mag
nitu
de
Digit 1
targetoutput
161
Remark 6.2.4: In order to avoid the singularities that may occur in the inverse of the
feature vector data correlation matrix for deriving the optimal output weights by using
the batch learning type of least squares method in the ELM, the regularization theory
has been used to smooth the cost function at the singular point of the correlation matrix
of the feature vector data from the output of the hidden layer [18]. Such a modified
ELM is called the Regularized Extreme Learning Machine (R-ELM). However, when
the feature vector data correlation matrix has no singularity, the recognition
performance of the SLFN classifiers with the R-ELM has no significant difference from
the SLFNs with the ELM. This point can be clearly seen from Figure 6.5 ~ Figure 6.7,
where the SLFNs trained with the R-ELM, as the numbers of the hidden nodes are 10,
50, and 100, respectively.
Figure 6.5 Recognition of the handwritten digits by using the SLFN classifier with 10
hidden nodes trained with the R-ELM
0 1 2 3 4 5 6 7 8 9 10 11-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Output neurons
Mag
nitu
de
Digit 1
targetoutput
162
Figure 6.6 Recognition of the handwritten digits by using the SLFN classifier with 50
hidden nodes trained with the R-ELM
Figure 6.7 Recognition of the handwritten digits by using the SLFN classifier with 100
hidden nodes trained with the R-ELM
0 1 2 3 4 5 6 7 8 9 10 11-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Output neurons
Mag
nitu
deDigit 1
targetoutput
0 1 2 3 4 5 6 7 8 9 10 11-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Output neurons
Mag
nitu
de
Digit 1
targetoutput
163
Remark 6.2.5: Although the ELM and the R-ELM can work well as the number of the
hidden nodes is large enough as shown in [18], [29], [147], a few open problems have
been noted recently. For instance, (i) how can the upper and the lower bounds of the
random input weights as well as the random hidden layer biases be chosen in the sense
that the separability of the feature vectors from the outputs of the hidden layer is
maximized?; (ii) is it possible to minimize the input weights so that the SLFN classifiers
can achieve a better performance than those of the SLFNs trained with both the ELM
and the R-ELM for image recognition?; (iii) can the SLFN classifiers with a small
number of hidden nodes achieve a high degree of classification accuracy in practice?
The first open issue in the above is related to the determination of the integral
domain of the limit integral in (6.2.5) or the domain of the parameter vector .
Obviously, if the domain of the parameter vector cannot be chosen properly, the limit
integral in (6.2.5) cannot represent the function ( ( )) well and thus, the function
( ( )) cannot be properly approximated as seen in (6.2.9).
The second open issue in Remark 6.2.5 is to search for the optimal design of the
input weights so that the SLFN classifiers can perform better than the ones with the
ELM and the R-ELM regarding the recognition accuracy, the robustness with respect to
the input disturbances, and the separability of the feature vectors. In [9], the authors
used the signal processing techniques to design the input weights of the SLFN
classifiers for improving the approximation accuracy, the robustness with respect to the
input disturbances and the separability of the feature vectors. However, the optimal
design of the input weights was not explored.
The third open question in Remark 6.2.5 is to look for the optimal designs of the
SLFN classifiers with a small number of hidden nodes to achieve high recognition
accuracy for practical applications. In the following, the second and the third open
problems in Remark 6.2.5 will be explored through optimizing both the input and output
weights of the SLFNs with a small number of hidden nodes, to achieve a high degree of
accuracy for the handwritten digit image recognition.
164
6.3 Optimal Weight Learning Machine
Figure 6.8 shows a single hidden layer feedforward neural network with linear nodes
and an input tapped delay line, where the output layer has linear nodes, the hidden
layer has linear nodes, (for and ) are the input weights,
(for and ) are the output weights, ( ) (for ) are
the outputs of the hidden nodes, a string of s, seen at the input layer, are unit-delay
elements that ensure the input sequence ( ) ( ) ( ) represents
a time series, consisting of both the present and the past observations of the process.
Figure 6.8 A single hidden layer neural network with linear nodes
and an input tapped delay line
Remark 6.3.1: One of the many advantages of using linear hidden nodes in the SLFNs
is that the biases are not required. This is because adding the biases to the linear hidden
nodes is equivalent to shifting the outputs of the hidden nodes up or down. It can be
further seen from the later discussions that the function of the biases can also be
alternatively achieved by the proper adjustment of the positions of the reference feature
vectors in the feature space.
Remark 6.3.2: It has been shown in [9] that an SLFN with both linear nodes and a string
of unit-delay elements at the input layer in Figure 6.8 has the capability of
universal approximation of any continuous function as the number of the hidden nodes
165
is large enough [113]. On the other hand, because of the unit-delay elements
added to the input layer, every hidden node in the SLFN performs as an th-order finite
impulse response (FIR) filter that is often used to approximate a linear or a non-linear
function by properly choosing the input weights in signal processing and control
engineering. Also, in practice, the design and analysis of the SLFNs with linear nodes
are much easier than the ones of the SLFNs with non-linear nodes in Figure 6.1.
Similar to the discussion in Section 6.2, for the given input pattern vectors
( ) ( ) ( ) and the corresponding desired output data vectors
( ) ( ) ( ), respectively, linear output equations of the SLFN in Figure
6.8 can be obtained as:
( )
with
[ ( ) ( ) ( )]
[ ( )
( )
( )
( )
( ) ( )
( )
( ) ( ) ]
( )
and
[ ( ) ( ) ( )] ( )
Remark 6.3.3: The matrix in (6.3.2) contains all feature vectors ,
corresponding to the input data vectors , respectively. For instance, for
the th given input pattern vector ( ), the corresponding feature vector generated by
the outputs of the hidden layer is the th row of the matrix :
[ ( )
( ) ( ) ]
( )
166
Since the output of the th hidden node, corresponding to the th input pattern
vector, can be written as:
( ) ( ) ( )
the th feature vector in (6.3.4) can then be written as:
[ ( )
( ) ( ) ]
[ ( ) ( ) ( )]
( )
Considering all input training data pattern vectors ( ) ( ) ( ), it
can be seen that,
[ ] [ ] ( )
or
( )
with
[ ] ( )
Let the reference feature vectors be described by
[ ] ( )
Conventionally, the selection of the desired feature vectors in (6.3.10) is
mainly based on one’s understanding of the characteristics of the input patterns.
However, in this chapter, borrowing the idea of model reference control from modern
control engineering [114], the feature vectors generated by the SLFNs trained with the
R-ELM will be used as the “desired reference feature vectors” for the ones generated by
the outputs of the SLFN in Figure 6.8 to follow. Through optimizing the input weights
of the SLFN in Figure 6.8, the feature vectors of the SLFN in Figure 6.8 can be placed
at the “desired position” in feature space. The purpose of such feature vectors’
167
assignment is to further maximize the separability of the feature vectors so that the
classification or recognition accuracy, seen from the output layer of the SLFN, can be
greatly improved, compared with the SLFN classifiers in Figure 6.1, trained with both
the ELM and the R-ELM.
In the following, the modified regularization technique in [9] and the batch
training methodology in [1], [2] will be used to develop an optimal weight learning
machine that optimizes both the input weight matrix and the output weight matrix
in the sense that (i) the error between the reference feature vectors and the feature
vectors generated by the hidden layer of the SLFN classifier in Figure 6.8 can be
minimized and then (ii) the error between the desired output pattern and the actual
output pattern of the SLFN classifier is minimized. For convenience, the learning
machine described below with both the optimal input weights and the optimal output
weights for the SLFN classifier in Figure 6.8 is called the optimal weights learning
machine (OWLM).
The design of the optimal input weights can be formulated by the following
optimization problem [9]:
Minimize { ‖ ‖
‖ ‖ } ( )
Subject to ( )
where is the error between the reference feature vectors and the feature vectors
generated by the hidden layer of the SLFN in Figure 6.2, and are the positive real
regularization parameters, ‖ ‖ is the regularizer that, through the proper choice of the
regularization parameters and , is used to smooth the cost function at the singular
point of the correlation matrix of the feature vectors to avoid the ill-posed inverse of the
data matrix.
The optimization problem in (6.3.11) with the constraint in (6.3.12) can be
solved by using the method of Lagrange multipliers, with the following Lagrange
function :
168
∑∑
∑∑
∑∑ ( ∑
)
( )
where is the th element of the error matrix defined in (6.3.12), is the th
element of the input weight matrix , is the th element of the reference feature
vector matrix , defined in (6.3.10), ∑
, and is the th
Lagrange multiplier.
Differentiating with respect to ,
(∑∑ (∑
)
)
∑
( )
Letting
leads
∑
[ ] [
] ( )
and
[ ] [ ] [
]
( )
169
Thus
[
] [
] [
]
( )
or
( )
In addition, differentiating with respect to ,
( )
Solving
, the following relationship can be obtained:
( )
or,
( )
Considering the constraint in (6.3.12), (6.3.21) can be expressed as:
( ) ( )
and using (6.3.21) in (6.3.18) leads to
( )
( )
Then, the optimal input weight matrix is derived as follows:
( )
( )
170
Remark 6.3.4: It is important to note that, in the conventional regularization theory [1],
the regularization parameter is set to 1 and the regularization for solving the ill-posed
inverse problem of the data matrix depends only on the small regularization parameter
for smoothing around the minimum point of the cost function in the input weight
space. However, it has been noted that the value of the regularization parameter
affects the width or the sharpness of the regularized cost function in (6.3.11) [9]. For
instance, if the value of the regularization parameter is very large, the slope of the
cost function in (6.3.11) will be very steep. The hidden layer of the SLFN is thus very
sensitive to the change of the input weights and input disturbances. However, if the
value of the regularization parameter is very small, the changing rate of the cost
function in (6.3.11) will be very small, and the hidden layer of the SLFN is thus
insensitive to the changes of the input weights and input disturbances. Therefore, it is
essential to choose the regularization parameter properly so that the sensitivity of the
hidden layer can be controlled and also the separability of the feature vectors at the
outputs of the hidden layer of the SLFN can be maximized.
Remark 6.3.5: As described in optimization theory [110], the Lagrange multiplier
matrix in (6.3.22) describes the sensitivity of the cost function in (6.3.11) with respect
to the constraint in (6.3.12), that is, the Lagrange multiplier matrix determines how
tightly the constraint in (6.3.12) is binding at the optimal point with the optimal input
weights in (6.3.19). Thus, the Lagrange multiplier matrix qualitatively specifies the
effects of the noisy environment on the input weights’ optimization, and the effects of
the structural risk and the empirical risk on the robustness of the SLFN classifier with
the OWLM.
Similarly, the output weight matrix can be optimized to minimize the error
between the desired output pattern and the actual output pattern of the SLFN classifier.
The optimization is formulated as [9]:
Minimize { ‖ ‖
‖ ‖ } ( )
Subject to ( )
171
with the corresponding Lagrange function :
∑∑
∑∑
∑∑
( )
( )
where is the th element of the error matrix , is the th element of the output
weight matrix , is the th element of the desired output data matrix , is the th
column of the hidden layer output matrix , is the th column of the output weight
matrix , is the th Lagrange multiplier, and are the real positive
regularization parameters and ‖ ‖ is the regularizer of the output layer.
Following the discussions from (6.3.14) to (6.3.24), the optimal output layer
weight matrix is obtained as:
( )
( )
with the sensitivity matrix :
( ) ( )
Remark 6.3.6: It is seen from (6.3.28) that the optimal output layer weight matrix
depends only on the ratio
. However, the sensitivity and the robustness property of the
output layer of the SLFN with respect to the changes of the feature vectors depend on
the value of . Generally, the value of should be chosen properly to ensure the
trade-off between the sensitivity and the smoothness of the SLFN classifier.
172
6.4 Experiments and Results
In this section, the handwritten digit recognition, using the SLFN classifiers, trained
with the ELM, the R-ELM and the OWLM, are implemented. The training data are first
taken from the MNIST handwritten digits database with 60,000 training samples as well
as 10,000 testing samples from multiple disjoint sets of writers [82]. The MNIST
handwritten digits database consists of the handwritten digits from 0 to 9, respectively.
Each digit is a gray-scale image of 28x28 pixels with the intensity range of 0 (black) to
255 (white). For conducting an unbiased experiment, the training and testing sets are all
randomly selected from the pools of the training and the testing samples. A sample set
of handwritten digits is shown in Figure 6.9.
Figure 6.9 A set of handwritten digits from the MNIST database
Before classifying the handwritten digits, each digit image, as in Figure 6.10(a),
is first divided into 14 rows by 14 columns, that is, the image is segmented into 196
small images, as seen in Figure 6.10(b).
Figure 6.10(a) Image of digit Figure 6.10(b) Segmented image
The mean of the pixel intensities of each small segment is computed as:
∑ ∑ ( )
14
14
173
where ( ) is the intensity of the th pixel of the th segment, (= 2) and (= 2)
are the numbers of rows and columns of pixels in each segment, respectively.
The means of all of the segments, from row 1 to row 14, in Figure 6.10(b), are
then arranged as the elements of the following time series type data vector:
[ ]
In this experiment, the randomly generated input weights and the hidden layer
biases for the SLFNs with both the ELM and the R-ELM are set within [1, -1]. The
SLFN classifiers with the ELM, the R-ELM and the OWLM are trained with the 1000
digit sets from the training set pool, and tested with the 800 digit sets from the testing
pool of the MNIST database, to evaluate the classifiers’ performances. The
regularization parameters for the OWLM are selected as and
, respectively, while the regularization ratio for R-ELM is set as 0.01.
The sample reference feature vectors for digit 0 from the hidden layer outputs of
the SLFNs with 10, 50, and 100 hidden nodes, respectively, trained with the R-ELM,
are shown in Figure 6.11(a) ~ Figure 6.11(c), respectively:
Figure 6.11 (a)
0 1 2 3 4 5 6 7 8 9 10 11-1.5
-1
-0.5
0
0.5
1
1.5
sample index
ampl
itude
sample feature of digit 0 with 10 nodes
174
Figure 6.11 (b)
Figure 6.11 (c)
Figure 6.11(a) ~ Figure 6.11(c) Sample feature vectors for digit 0
0 5 10 15 20 25 30 35 40 45 50-1.5
-1
-0.5
0
0.5
1
1.5
sample index
ampl
itude
sample feature of digit 0 with 50 nodes
0 10 20 30 40 50 60 70 80 90 100-1.5
-1
-0.5
0
0.5
1
1.5
sample index
ampl
itude
sample feature of digit 0 with 100 nodes
175
It is seen from Figure 6.11(a) ~ Figure 6.11(c) that the amplitudes of the feature
vectors’ components are always 1 or -1, the hidden nodes behave like the standard
binary perceptron and the SLFN classifier with the R-ELM is more like the well-known
threshold networks. Through a detailed observation, it is seen that the values of the dot
products of the input pattern vectors and the input weights are always greater than 1 or
less than -1, and thus all the hidden nodes work in their saturation regions with the
outputs of 1s or -1s.
The handwritten digit recognition results of the SLFNs trained with the ELM
and the R-ELM have been presented in Figure 6.2 ~ Figure 6.7, respectively. The
handwritten digit recognition results of the SLFNs trained with the OWLM are given in
Figure 6.12 ~ Figure 6.14, where the numbers of the hidden nodes are 10, 50, and 100,
respectively. It is seen that, compared to the recognition results with the ELM and the
R-ELM in Figure 6.2 ~ Figure 6.7, the recognition accuracy of the SLFN classifier
trained with the OWLM has been greatly improved.
Figure 6.12 Recognition of the handwritten digits by using the SLFN classifier with 10
hidden nodes trained with the OWLM
0 1 2 3 4 5 6 7 8 9 10 11-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Output neurons
Mag
nitu
de
Digit 1
targetoutput
176
Figure 6.13 Recognition of the handwritten digits by using the SLFN classifier with 50
hidden nodes trained with the OWLM
Figure 6.14 Recognition of the handwritten digits by using the SLFN classifier with
100 hidden nodes trained with the OWLM
0 1 2 3 4 5 6 7 8 9 10 11-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Output neurons
Mag
nitu
deDigit 1
targetoutput
0 1 2 3 4 5 6 7 8 9 10 11-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
Output neurons
Mag
nitu
de
Digit 1
targetoutput
177
Table 6.1 shows the summary of the comparisons of the recognition accuracies
of the SLFN classifiers with the ELM, R-ELM and the OWLM, where the feature
vectors of the SLFN with the R-ELM are used as the reference feature vectors of the
SLFN with the OWLM, and the input weights of the SLFN with the OWLM are
computed by using (6.3.24). It is seen that the SLFN with the OWLM achieves the
highest degree of recognition accuracy and the smallest standard deviation compared to
those of the SLFNs with both the ELM and the R-ELM, over 50 iterations, as the
number of hidden nodes is increased from 10 to 150. The smallest standard deviations
of the SLFN classifier with the OWLM also show a stronger robustness with respect to
the changes of the digit writing styles, which are equivalent to the bounded input noises.
Obviously, the optimization of both the input weights and the output weights of the
SLFN with the OWLM plays a very important role in greatly improving the recognition
performance.
Table 6.1: Classification accuracies of the handwritten digit classification
with the ELM, R-ELM and OWLM for the MNIST dataset
Nodes ELM R-ELM OWLM
(%) (%) (%) (%) (%) (%)
10 36.55 3.59 37.52 3.51 49.46 3.19
25 54.51 2.37 55.71 2.65 69.58 1.66
50 68.24 1.70 68.86 1.60 79.57 0.76
100 78.58 0.86 78.74 0.71 84.36 0.32
150 82.83 0.55 82.96 0.53 85.16 0.11
178
Table 6.2 summarizes the recognition accuracies and the standard deviations of
the SLFN classifiers with the ELM, the R-ELM and the OWLM, respectively, for the
dataset from the USPS database [157]. It is seen that, similar to the results in Table 6.1
for the MNIST database, the OWLM achieves the highest degree of classification
accuracy and the smallest standard deviations compared to those of both the ELM and
the R-ELM. Such a consistency of the OWLM in achieving the best classification
performance for both the MNIST and USPS databases in the above has confirmed its
effectiveness for pattern classification and robustness with respect to the changes of the
handwriting styles.
Table 6.2: Classification accuracies of the handwritten digit classification
with the ELM, R-ELM and OWLM for the USPS dataset
Nodes ELM R-ELM OWLM
(%) (%) (%) (%) (%) (%)
10 36.69 3.07 36.02 3.04 46.66 2.69
25 54.27 2.30 54.38 2.34 68.07 2.30
50 68.29 1.53 67.54 1.75 79.06 0.91
100 78.47 0.85 78.47 0.84 84.00 0.41
150 82.40 0.67 82.66 0.75 85.45 0.24
From the viewpoint of industrial applications, a recognition accuracy up to about
85% is acceptable in many cases. Most importantly, the SLFN classifier with about 150
hidden nodes that is capable of learning the dataset well is generally favorable in
practical applications.
In the experimental results presented in Table 6.1 and Table 6.2 for the MNIST
dataset as well as the USPS dataset, the regularization ratios for both the R-ELM and
the OWLM are set to the same (
). In order to study the effects of the
changes of the regularization ratios on the classification performances, Figure 6.15
179
shows the classification accuracies of the SLFN classifiers trained with the R-ELM and
OWLM, respectively, for the MNIST dataset, where each SLFN classifier has 150
hidden nodes and the regularization ratios (
) are changed from 0.001 to
1000. It is seen that, when the regularization ratios (
) are chosen between
0.1 and 1, the highest classification accuracy of about 86% can be achieved by the
SLFN classifier with the OWLM. However, the highest classification accuracy achieved
by the SLFN classifier with the R-ELM is only about 82.5%. Figure 6.16 shows the
classification accuracies of the SLFN classifiers trained with the R-ELM and OWLM,
respectively, for the USPS dataset. It is seen that, when the regularization ratios
(
) are about 100, the highest classification accuracies can be achieved by
the SLFN classifiers trained with the R-ELM and OWLM, respectively. Again, the
SLFN classifier with the OWLM is confirmed to perform much better than the one with
the R-ELM.
Figure 6.15 Classification accuracy versus regularization ratio for the MNIST dataset
0.001 0.01 0.1 1 10 100 100080
81
82
83
84
85
86
87
88
89
90
d1/1=d2/2
accu
racy
OWLMR-ELM
180
Figure 6.16 Classification accuracy versus regularization ratio for the USPS dataset
Figure 6.17 and Figure 6.18 show the comparisons of the sample means of the
recognition accuracies of the SLFN classifiers with 150 hidden nodes, trained with the
OWLM, the R-ELM and the ELM, respectively, versus the number of training samples
from both the MNIST and USPS databases. It is noted again that the classifier with the
OWLM performs the best compared to the ones with both the ELM and the R-ELM.
Figure 6.17 Classification accuracies with OWLM, R-ELM and ELM via the number
of training samples for the MNIST dataset
0.001 0.01 0.1 1 10 100 100080
81
82
83
84
85
86
87
88
89
90
d1/1=d2/2
accu
racy
OWLMR-ELM
10 50 100 250 500 1000 1500 2000 3000 4000 500030
40
50
60
70
80
90
100
number of training sets
accu
racy
OWLMR-ELMELM
181
Figure 6.18 Classification accuracies with OWLM, R-ELM and ELM via the number
of training samples for the USPS dataset
6.5 Conclusion
In this chapter, a new optimal weight learning machine for the handwritten digit image
recognition has been developed. Because of the optimal designs of both the input and
the output weights, the SLFN classifier with a small number of hidden nodes has
demonstrated an excellent performance for the recognition of the handwritten digit
images. The excellent recognition performance of the SLFN classifier equipped with the
OWLM has been evaluated and compared with those of the SLFN classifiers with both
the ELM and the R-ELM.
10 50 100 250 500 60030
40
50
60
70
80
90
100
number of training sets
accu
racy
OWLMR-ELMELM
182
183
Chapter 7
Conclusions and Future Work
In this chapter, the contributions in this thesis are summarized, and some of the
interesting research directions are given as possible topics for future research.
7.1 Summary of Contributions
This thesis has investigated the robust designs of the training algorithms for SLFN
pattern classifiers. The focus of this research is on the methods of optimally designing
both the input and output weights of the SLFNs, in order to minimize the prediction
risks, and reduce the effects of noise and undesired pattern components when learning
from a finite sample set. Different from conventional neural network training algorithms,
the proposed new algorithms have not only demonstrated a fast batch learning
characteristic, but also achieved strong robustness with respect to disturbances and
highly non-linear separability. The key contributions of this thesis are summarized as
follows.
In Chapter 3, the FIR-ELM training algorithm has been proposed by using the
FIR filtering techniques to assign the input weights of the SLFNs, while the output layer
weights have also been optimally designed with the regularization method for balancing
and reducing both the structural and empirical prediction risks. The robustness and
generalization capability of the FIR-ELM have been verified with the excellent
classification results in the classification of the audio clips.
184
Chapter 4 has investigated the performance of the FIR-ELM in the classification
of bioinformatics datasets for cancer diagnosis. It has been seen that the frequency
domain feature selection method proposed in this chapter plays a very important role of
determining the filtering type for the input weights’ design. The simulation results have
further shown that the FIR-ELM is very prominent for the pattern classification of both
the leukemia and colon cancer datasets.
In Chapter 5, the DFT-ELM training algorithm has been developed for the
SLFNs. The DFT-ELM addresses the feature assignment in the frequency-domain in
terms of the optimal design of the input weights. Different from the FIR-ELM, the
regularization theory is used to design both the input weights and output weights in
order to balance and reduce the structural and empirical prediction risks in the training
phase. The two stage optimization process has significantly reduced the structural
sensitivity of the SLFN classifiers, making it even more robust in its implementations,
as seen in the simulation example.
In Chapter 6, an optimal weights training algorithm, called the OWLM, has been
developed for the SLFNs based on the ELM framework. The reason of using the feature
vectors of an ELM based classifier as the reference feature vectors is that, most of the
time, the ELM algorithm can produce separable features as the number of hidden nodes
is large enough. Both the theoretical analysis and the simulation examples have shown
the excellent classification performances with OWLM.
7.2 Future Research
This section lists some interesting research directions for possible future research.
7.2.1 Ensemble Methods
All of the theories developed in this thesis relates to the single SLFN classifiers only,
however, there are interesting works on combinations of SLFNs in the literature that
exploit the strength of neural classifier groups in making better classification decisions.
One of the recent works includes the voting based ELM ensemble [64], where the
185
improved performance is obtained for almost every case tested using real-world datasets
when a number of SLFNs trained with the ELM work parallely to make co-operative
classification decisions. It would be interesting to implement the FIR-ELM, the DFT-
ELM, or the OWLM as an ensemble architecture, in order to investigate the possible
gains in performance. Especially in the cases of FIR-ELM and DFT-ELM, the ensemble
architecture expands the possible design space for the hidden layer weights in terms of
the digital filtering theory.
7.2.2 Analytical Determination of Regularization Parameters
For all of the SLFN training algorithms developed in this thesis, the weights trainings
are performed with the regularization method, where the regularizer is commonly
defined as the sum of weight magnitudes, as seen in chapters 3, 5, and 6. Two
regularization constants are introduced to balance both the structural and empirical risks
respectively in the optimization processes. However, throughout the simulation studies,
the optimal values of these regularization constants are determined empirically using
some version of cross validation. It is therefore highly desired that the optimal values of
the regularization constants can be calculated deterministically in a finite time, such that
the chosen values can automatically balance both the structural and empirical risks
based on the characteristics of the sample datasets, or some a priori information.
7.2.3 Analysis of the Effects of Non-linear Nodes
It can be seen from chapters 3, 5, and 6 that the designs of the training algorithms use
the SLFNs with the linear hidden nodes, as well as the linear output nodes, and an input
tapped delay line memory. This SLFN architecture contains finite depth memory which
makes the learning with dynamics possible. Moreover, the analysis of these
architectures has convenient interpretations in terms of frequency domain and spatial
domain mappings. However, the conventional neural network architecture uses the non-
linear nodes at the hidden layer of the SLFNs, in order to allow the SLFNs with no
dynamics to learn the complex non-linear mappings. It is then interesting to examine the
theoretical interpretation of the SLFNs used in chapters 3, 5, and 6 if non-linear nodes
are used instead. Some ideas of the possible outcomes are investigated in the simulation
186
section of chapter 3, where it was found that different non-linear nodes produce vastly
different results, even if only the meta-parameters of the activation function are changed.
A more thorough investigation is required to test the effects of different types of
activation functions at the hidden layer of the SLFNs, in order to specify the theoretical
significance in the choice of nodes to be selected.
7.2.4 Multi-class Classification of Real-World Dataset
The analysis of the practical implementation of the FIR-ELM in chapter 4 shows that
the FIR-ELM has good performance in classifying bioinformatics datasets with binary
outcomes. However, it can be seen in [69], [125], [126] that the current advances in the
field of cancer diagnosis have shown that it is actually possible to predict multiple types
of cancer diseases from the same sample datasets. Therefore an important question to
answer is whether the FIR-ELM can be easily extended to the multi-class case, where
the possibility of implementing a single unit classifier, the OAA classifier, the OAO
classifier, as well as the ensemble type classifier architecture should be considered.
Lastly, it is also interesting to expand the real-world applications of the DFT-
ELM and the OWLM so that more complex problems can be solved in order to
demonstrate the effectiveness of these learning schemes.
187
Bibliography
[1] S. Haykin, Neural networks and learning machines (3rd Edition), Pearson,
Prentice Hall, New Jersey, 2009.
[2] S. Kumar, Neural networks, McGraw-Hill Companies, Inc., 2006.
[3] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine:
Theory and applications,” Neurocomputing, vol. 70, no. 1-3, pp. 489-501,
Dec. 2006.
[4] G. Cybenko, “Approximation by superpositions of sigmoidal function,”
Mathematics of Control, Signals, and Systems, vol. 2, pp. 303-314, 1989.
[5] K. I. Funahashi, “On the approximate realization of continuous mappings by
neural networks,” Neural Networks, vol. 2, pp. 183-192, 1989.
[6] K. Hornik, “Approximation capabilities of multilayer feedforward networks,”
Neural Networks, vol. 4, pp. 251-257, 1991.
[7] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, “Multilayer feedforward
networks with a nonpolynomial activation function can approximate any
function,” Neural Networks, vol. 6, pp. 861-867, 1993.
[8] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal
representations by error propagation,” Parallel Distribution Processing:
Explanations in the Microstructure of Cognition, vol. 1, pp. 318-362, 1986.
188
[9] Z. Man, K. Lee, D. H. Wang, Z. Cao, and C. Miao, “A new robust training
algorithm for a class of single hidden layer neural networks,”
Neurocomputing, vol. 74, pp. 2491-2501, 2011.
[10] Z. Man, K. Lee, D. H. Wang, Z. Cao, and C. Miao, “A modified ELM
algorithm for single-hidden layer feedforward neural networks with linear
nodes,” 6th IEEE Conference on Industrial Electronics and Applications,
ICIEA 2011, pp. 2524-2529, 21-23 Jun. 2011.
[11] K. Lee, Z. Man, D. H. Wang, and Z. Cao, “Classification of Bioinformatics
Dataset using Finite Impulse Response Extreme Learning Machine for
Cancer Diagnosis,” Neural Computing & Applications, Available online 30
Jan. 2012, Doi: 10.1007/s00521-012-0847-z.
[12] K. Lee, Z. Man, D. H. Wang, and Z. Cao, “Classification of microarray
datasets using finite impulse response extreme learning machine for cancer
diagnosis,” 37th Annual Conference on IEEE Industrial Electronics Society,
IECON 2011, pp. 2347-2352, 7-10 Nov. 2011.
[13] W. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in
nervous activity,” Bulletin of Mathematical Biophysics, vol. 7, pp. 115-133,
1943.
[14] J. C. Principe, N. R. Euliano, and W. C. Lefebvre, Neural and Adaptive
Systems: Fundamentals through Simulations, John Wiley & Sons, Inc., New
York, 1999.
[15] S. Kleene, “Representation of events in nerve nets and finite automata,” In
C. Shannon and J. McCarthy, editors, Automata Studies, pages 3-42,
Princeton University Press, Princeton, N.J., 1956.
189
[16] F. Rosenblatt, “The Perceptron: A Probabilistic Model for Information
Storage and Organization in the Brain,” Cornell Aeronautical Laboratory,
Psychological Review, vol. 65, no. 6, pp. 386-408, 1958.
[17] M. Minsky and S. Papert, “Perceptrons: An Introduction to Computational
Geometry,” M.I.T. Press, Cambridge, Mass., 1969.
[18] G.-B. Huang, D. H. Wang, and Y. Lan, “Extreme learning machines: a
survey,” Int. J. Mach. Learn. & Cyber, vol. 2, no. 2, pp. 107-122, 2011.
[19] S. Abe, Support Vector Machines for Pattern Classification, Springer, 2005.
[20] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning,
vol. 20, pp. 273-297, 1995.
[21] D. Lowe, “Adaptive radial basis function nonlinearities, and the problem of
generalisation,” First IEE International Conference on Artificial Neural
Networks, 1989, (Conf. Publ. No. 313), pp. 171-175, 16-18 Oct. 1989.
[22] S. Chen, C. F. N. Cowan, and P. M. Grant, “Orthogonal least squares
learning algorithm for radial basis function networks,” IEEE Transactions
on Neural Networks, vol. 2, no. 2, pp. 302-309, Mar. 1991.
[23] G.-B. Huang and L. Chen, “Enhanced random search based incremental
extreme learning machine,” Neurocomputing, vol. 71, no. 16-18, pp. 3460-
3468, Oct. 2008.
[24] G.-B. Huang and L. Chen, “Convex incremental extreme learning machine,”
Neurocomputing, vol. 70, no. 16-18, pp. 3056-3062, Oct. 2007.
[25] G.-B. Huang, L. Chen, and C.-K. Siew, “Universal approximation using
incremental constructive feedforward networks with random hidden nodes,”
IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879-892, 2006.
190
[26] D. S. Broomhead and D. Lowe, “Multi-variable Functional Interpolation
and Adaptive Networks,” Complex Systems, vol. 2, no. 3, pp. 269-303,
1988.
[27] Y. H. Pao and Y. Takefuji, “Functional-link net computing, theory, system
architecture, and functionalities,” IEEE Comput., vol. 25, no. 5, pp. 76-79,
May 1992.
[28] Y. H. Pao, G. H. Park, and D. J. Sobajic, “Learning and generalization
characteristics of random vector functional-link net,” Neurocomputing, vol.
6, pp. 163-180, 1994.
[29] B. Igelnik and Y. H. Pao, “Stochastic choice of basis functions in adaptive
function approximation and the functional-link net,” IEEE Transactions on
Neural Networks, vol. 6, no. 6, pp. 1320-1329, Nov. 1995.
[30] G.-H. Park, Y.-H. Pao, B. Igelnik, K. G. Eyink, and S. R. LeClair, “Neural-
net computing for interpretation of semiconductor film optical ellipsometry
parameters,” IEEE Transactions on Neural Networks, vol. 7, no. 4, pp. 816-
829, Jul. 1996.
[31] B. Igelnik, Y.-H. Pao, S. R. LeClair, and C. Y. Shen, “The ensemble
approach to neural-network learning and generalization,” IEEE Transactions
on Neural Networks, vol. 10, no. 1, pp. 19-30, Jan. 1999.
[32] Z. Meng and Y. H. Pao, “Visualization and self-organization of
multidimensional data through equalized orthogonal mapping,” IEEE
Transactions on Neural Networks, vol. 11, no. 4, pp. 1031-1038, Jul. 2000.
[33] Y. Wang, F. Cao, and Y. Yuan, “A study on effectiveness of extreme
learning machine,” Neurocomputing, vol. 74, no. 16, pp. 2483-2490, Sept.
2011.
191
[34] X. Tang and M. Han, “Partial Lanczos extreme learning machine for single-
output regression problems,” Neurocomputing, vol. 72, no. 13-15, pp. 3066-
3076, Aug. 2009.
[35] G. Zhao, Z. Shen, C. Miao, and Z. Man, “On improving the conditioning of
extreme learning machine: A linear case,” 7th International Conference on
Information, Communications and Signal Processing, ICICS 2009, pp. 1-5,
8-10 Dec. 2009.
[36] H. T. Huynh and Y. Won, “Evolutionary algorithm for training compact
single hidden layer feedforward neural networks,” IEEE International Joint
Conference on Neural Networks, IJCNN 2008, pp. 3028-3033, 1-8 Jun.
2008.
[37] Q.-Y. Zhu, A.-K. Qin, P. N. Suganthan, and G.-B. Huang, “Evolutionary
extreme learning machine,” Pattern Recognition, vol. 38, no. 10, pp. 1759-
1763, Oct. 2005.
[38] X.-K. Wei and Y.-H. Li, “Linear programming minimum sphere set
covering for extreme learning machines,” Neurocomputing, vol. 71, no. 4-6,
pp. 570-575, Jan. 2008.
[39] Y. Yuan, Y. Wang, and F. Cao, “Optimization approximation solution for
regression problem based on extreme learning machine,” Neurocomputing,
vol. 74, no. 16, pp. 2475-2482, Sept. 2011.
[40] F.-N. Francisco, H.-M. César, S.-M. Javier, and A. G. Pedro, “MELM-
GRBF: A modified version of the extreme learning machine for generalized
radial basis function neural networks,” Neurocomputing, vol. 74, no. 16, pp.
2502-2510, Sept. 2011.
192
[41] H. T. Huynh and Y. Won, “Extreme Learning Machine with Fuzzy
Activation Function,” Fifth International Joint Conference on INC, IMS and
IDC, NCM 2009, pp. 303-307, 25-27 Aug. 2009.
[42] X.-Z. Wang, A. Chen, and H. Feng, “Upper integral network with extreme
learning mechanism,” Neurocomputing, vol. 74, no. 16, pp. 2520-2525, Sept.
2011.
[43] M.-B. Li, G.-B. Huang, P. Saratchandran, and N. Sundararajan, “Fully
complex extreme learning machine,” Neurocomputing, vol. 68, pp. 306-314,
Oct. 2005.
[44] G.-B. Huang, M.-B. Li, L. Chen, and C.-K. Siew, “Incremental extreme
learning machine with fully complex hidden nodes,” Neurocomputing, vol.
71, no. 4-6, pp. 576-583, Jan. 2008.
[45] J.-S. Lim, “Recursive DLS solution for extreme learning machine-based
channel equalizer,” Neurocomputing, vol. 71, no. 4-6, pp. 592-599, Jan.
2008.
[46] F. Han and D.-S. Huang, “Improved extreme learning machine for function
approximation by encoding a priori information,” Neurocomputing, vol. 69,
no. 16-18, pp. 2369-2373, Oct. 2006.
[47] Y. Lan, Y. C. Soh, and G.-B. Huang, “Constructive hidden nodes selection
of extreme learning machine for regression,” Neurocomputing, vol. 73, no.
16-18, pp. 3191-3199, Oct. 2010.
[48] S.-S. Kim and K.-C. Kwak, “Incremental modeling with rough and fine
tuning method,” Applied Soft Computing, vol. 11, no. 1, pp. 585-591, Jan.
2011.
193
[49] J. Deng, K. Li, and G. W. Irwin, “Fast automatic two-stage nonlinear model
identification based on the extreme learning machine,” Neurocomputing, vol.
74, no. 16, pp. 2422-2429, Sept. 2011.
[50] L. Chen, G.-B. Huang, and H. K. Pung, “Systemical convergence rate
analysis of convex incremental feedforward neural networks,”
Neurocomputing, vol. 72, no. 10-12, pp. 2627-2635, Jun. 2009.
[51] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, and A. Lendasse, “OP-
ELM: Optimally Pruned Extreme Learning Machine,” IEEE Transactions
on Neural Networks, vol. 21, no. 1, pp. 158-162, Jan. 2010.
[52] Y. Miche, M. Heeswijk, P. Bas, O. Simula, and A. Lendasse, “TROP-ELM:
A double-regularized ELM using LARS and Tikhonov regularization,”
Neurocomputing, vol. 74, no. 16, pp. 2413-2421, Sept. 2011.
[53] J. Yin, F. Dong, and N. Wang, “Modified Gram-Schmidt Algorithm for
Extreme Learning Machine,” Second International Symposium on
Computational Intelligence and Design, ISCID 2009, vol. 2, pp. 517-520,
12-14 Dec. 2009.
[54] H.-J. Rong, Y.-S. Ong, A.-H. Tan, and Z. Zhu, “A fast pruned-extreme
learning machine for classification problem,” Neurocomputing, vol. 72, no.
1-3, pp. 359-366, Dec. 2008.
[55] N.-Y. Liang, G.-B. Huang, P. Saratchandran, and N. Sundararajan, “A fast
and accurate online sequential learning algorithm for feedforward networks,”
IEEE Transactions on Neural Networks, vol. 17, no. 6, pp. 1411-1423, 2006.
[56] Y. Jun and M. J. Er, “An enhanced online sequential extreme learning
machine algorithm,” in Proc. of Control and Decision Conference, CCDC
2008, pp. 2902-2907, Yantai, Shandong, 2-4 Jul. 2008.
194
[57] Y. Lan, Y. C. Soh, and G.-B. Huang, “A constructive enhancement for
Online Sequential Extreme Learning Machine,” International Joint
Conference on Neural Networks, IJCNN 2009, pp. 1708-1713, 14-19 Jun.
2009.
[58] H.-J. Rong, G.-B. Huang, N. Sundararajan, and P. Saratchandran, “Online
Sequential Fuzzy Extreme Learning Machine for Function Approximation
and Classification Problems,” IEEE Transactions on Systems, Man, and
Cybernetics, Part B: Cybernetics, vol. 39, no. 4, pp. 1067-1072, Aug. 2009.
[59] H. T. Huynh and Y. Won, “Online training for single hidden-layer
feedforward neural networks using RLS-ELM,” IEEE International
Symposium on Computational Intelligence in Robotics and Automation,
CIRA 2009, pp. 469-473, 15-18 Dec. 2009.
[60] G. Li, M. Liu, and M. Dong, “A new online learning algorithm for structure-
adjustable extreme learning machine,” Computers & Mathematics with
Applications, vol. 60, no. 3, pp. 377-389, Aug. 2010.
[61] D. H. Wang, “ELM-based multiple classifier systems,” in Proc. of 9th
International Conference on Control, Automation, Robotics and Vision,
Singapore, Dec. 2006.
[62] Y. Lan, Y.-C. Soh, and G.-B. Huang, “Ensemble of online sequential
extreme learning machine,” Neurocomputing, vol. 72, no. 13-15, pp. 3391-
3395, 2009.
[63] N. Liu and H. Wang, “Ensemble based extreme learning machine,” IEEE
Signal Processing Letters, vol. 17, no. 8, pp. 754-757, 2010.
[64] J. Cao, Z. Lin, G.-B. Huang, and N. Liu, “Voting based extreme learning
machine,” Information Sciences, vol. 185, no. 1, 15, pp. 66-77, Feb. 2012.
195
[65] M. Heeswijk, Y. Miche, E. Oja, and A. Lendasse, “GPU-accelerated and
parallelized ELM ensembles for large-scale regression,” Neurocomputing,
vol. 74, no. 16, pp. 2430-2437, Sept. 2011.
[66] T. Helmy and Z. Rasheed, “Multi-category bioinformatics dataset
classification using extreme learning machine,” IEEE Congress on
Evolutionary Computation, CEC 2009, pp. 3234-3240, 18-21 May 2009.
[67] D. Wang and G.-B. Huang, “Protein sequence classification using extreme
learning machine,” in Proc. IEEE International Joint Conference on Neural
Networks, IJCNN 2005, vol. 3, pp. 1406- 1411, 31 Jul.-4 Aug. 2005.
[68] G. Wang, Y. Zhao, and D. Wang, “A protein secondary structure prediction
framework based on the Extreme Learning Machine,” Neurocomputing, vol.
72, no. 1-3, pp. 262-268, Dec. 2008.
[69] R. Zhang, G.-B. Huang, N. Sundararajan, and P. Saratchandran,
“Multicategory Classification Using An Extreme Learning Machine for
Microarray Gene Expression Cancer Diagnosis,” IEEE/ACM Transactions
on Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 485-495,
Jul.-Sept. 2007.
[70] S. Baboo and S. Sasikala, “Multicategory classification using an Extreme
Learning Machine for microarray gene expression cancer diagnosis,” IEEE
International Conference on Communication Control and Computing
Technologies, ICCCCT 2010, pp. 748-757, 7-9 Oct. 2010.
[71] D. Yu and L. Deng, “Efficient and effective algorithms for training single-
hidden-layer neural networks,” Pattern Recognition Letters, vol. 33, no. 5,
pp. 554-558, Apr. 2012.
196
[72] B. P. Chacko, V. R. V. Krishnan, G. Raju, and P. B. Anto, “Handwritten
character recognition using wavelet energy and extreme learning machine,”
Int. J. Mach. Learn. & Cyber, vol. 3, pp. 149-161, 2012.
[73] F.-C. Li, P.-K. Wang, and G.-E. Wang, “Comparison of the primitive
classifiers with Extreme Learning Machine in credit scoring,” IEEE
International Conference on Industrial Engineering and Engineering
Management, IEEM 2009, pp. 685-688, 8-11 Dec. 2009.
[74] G. Duan, Z. Huang, and J. Wang, “Extreme Learning Machine for Bank
Clients Classification,” International Conference on Information
Management, Innovation Management and Industrial Engineering, 2009,
vol. 2, pp. 496-499, 26-27 Dec. 2009.
[75] W. Deng, Q.-H. Zheng, S. Lian, and L. Chen, “Adaptive personalized
recommendation based on adaptive learning,” Neurocomputing, vol. 74, no.
11, pp. 1848-1858, May 2011.
[76] X.-G. Zhao, G. Wang, X. Bi, P. Gong, and Y. Zhao, “XML document
classification based on ELM,” Neurocomputing, vol. 74, no. 16, pp. 2444-
2451, Sept. 2011.
[77] Y. Sun, Y. Yuan, and G. Wang, “An OS-ELM based distributed ensemble
classification framework in P2P networks,” Neurocomputing, vol. 74, no.
16, pp. 2438-2443, Sept. 2011.
[78] W. Deng, Q.-H. Zheng, and L. Chen, “Real-Time Collaborative Filtering
Using Extreme Learning Machine,” IEEE/WIC/ACM International Joint
Conferences on Web Intelligence and Intelligent Agent Technologies, WI-
IAT 2009, vol. 1, pp. 466-473, 15-18 Sept. 2009.
197
[79] Q. J. B. Loh and S. Emmanuel, “ELM for the Classification of Music
Genres,” 9th International Conference on Control, Automation, Robotics
and Vision, ICARCV 2006, pp. 1-6, 5-8 Dec. 2006.
[80] I. W. Sandberg, “General structures for classification,” IEEE Transactions
on Circuits and Systems I, vol. 41, pp. 372-376, May 1994.
[81] G.-B. Huang, Y. Chen, and H. A. Babri, “Classification ability of single
hidden layer feedforward neural networks,” IEEE Transactions on Neural
Networks, vol. 11, pp. 799-801, May 2000.
[82] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based Learning
Applied to Document Recognition,” in Proc. of the IEEE, vol. 86, no. 11, pp.
2278-2324, 1998.
[83] V. S. Cherkassky and F. Mulier, Learning from Data: Concepts, Theory,
and Methods (2nd Edition), John Wiley & Sons, Inc., New York, 2007.
[84] N. Tikhonov, “On solving ill-posed problem and method of regularization,”
Doklady Akademii Nauk USSR, vol. 153, pp. 501-504, 1963.
[85] N. Tikhonov and V. Y. Arsenin, “Solution of Ill-Posed Problems,”
Washington, DC: Winston, 1977.
[86] V. N. Vapnik, The Nature of Statistical Learning Theory (2nd Edition), New
York, Springer, 1999.
[87] C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University
Press, Walton Street, Oxford, 1995.
[88] K.-A. Toh, “Deterministic neural classification,” Neural Computation, vol.
20, no. 6, pp. 1565-1595, Jun. 2008.
198
[89] W. Deng, Q.-H. Zheng, and L. Chen, “Regularized extreme learning
machine,” in Proc. IEEE Symp. CIDM, pp. 389-395, Mar. 30-Apr. 2, 2009.
[90] G.-B. Huang, X. Ding, and H. Zhou, “Optimization method based extreme
learning machine for classification,” Neurocomputing, vol. 74, Issues 1-3,
pp. 155-163, Dec. 2010.
[91] J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the theory of neural
computations, Addison-Wesley Publishing Company, 1991.
[92] Z. Man, H. R. Wu, and M. Palaniswami, “An adaptive tracking controller
using Neural Networks for a class of nonlinear systems,” IEEE Transactions
on Neural Networks, vol. 9, no. 5, pp. 947-955, 1998.
[93] G.-B. Huang, Q. Zhu, K. Mao, C. K. Siew, P. Sarachandran, and N.
Sundararajan, “Can threshold networks be trained directly?,” IEEE
Transactions on Circuits and Systems II, vol. 53, no. 3, pp. 187-191, 2008.
[94] G.-B. Huang and H. A. Babri, “Upper bounds on the number of hidden
neurons in feedforward networks with arbitrary bounded nonlinear
activation functions,” IEEE Transactions on Neural Networks, vol. 9, pp.
224-228, 1998.
[95] I. Mrazova and D. H. Wang, “Improved generalization of neural classifiers
with enforced internal representation,” Neurocomputing, vol. 70, pp. 2940-
2952, 2007.
[96] D. H. Wang, Y.-S. Kim, S. C. Park, C. S. Lee, and Y. K. Han, “Learning
based neural similarity metrics for multimedia data mining,” Soft
Computing, vol. 11, pp. 335-340, 2007.
199
[97] D. H. Wang and X. H. Ma, “A hybrid image retrieval system with user's
relevance feedback using neurocomputing,” Informatica, vol. 29, pp. 271-
279, 2005.
[98] P. Bao and D. H. Wang, “An edge-preserving image reconstruction using
neural network,” International Journal of Mathematical Imaging and Vision,
vol. 14, pp. 117-130, 2001.
[99] D. H. Wang and P. Bao, “Enhancing the estimation of plant Jacobian for
adaptive neural inverse control,” Neurocomputing, vol. 34, pp. 99-115, 2000.
[100] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and
regression trees, Wadsworth International, Belmont, CA, 1984.
[101] M. Anthony and P. L. Bartlett, Neural network learning: Theoretical
foundations, Cambridge University Press, Cambridge, 1999.
[102] L. Devroye, L. Gyorfi, and G. Lugosi, A probabilistic theory of pattern
recognition, Springer-Verlag, New York, 1996.
[103] R. Duda and P. Hart, Pattern classification and scene analysis, John Wiley,
New York, 1973.
[104] K. Fukunaga, Introduction to statistical pattern recognition, Academic
Press, New York, 1972.
[105] M. Kearns and U. Vazirani, An introduction to computational learning
theory, MIT Press, Cambridge, Massachusetts, 1994.
[106] J. Proakis and D. Manolakis, Digital signal processing (3rd Edition),
Prentice Hall, 1996.
200
[107] S. M. Kuo, B. H. Lee, and W. Tian, Real-time digital signal processing,
John Wiley & Sons Ltd, 2007.
[108] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-time signal
processing, Prentice Hall, 1999.
[109] E. C. Ifeachor and B. W. Jervis, Digital signal processing: A practical
approach (2nd Edition), Prentice Hall, 2002.
[110] S. S. Rao, Engineering optimization: Theory and practice, John Wiley &
Sons, Inc., 1996.
[111] P. S. Iyer, Operations research, Tata McGraw-Hill, 2008.
[112] F. S. Hillier and G. J. Lieberman, Introduction to operations research, Mc
Graw Hill, 2005.
[113] I. W. Sandberg, “Approximation theorems for discrete-time systems,” IEEE
Transactions on Circuits and Systems, vol. 38, no. 5, pp. 564-566, 1991.
[114] K. Ogata, Modern Control Engineering, Prentice Hall PTR, Upper Saddle
River, NJ, 2001.
[115] W. J. Rough, Linear system theory (2nd Edition), Prentice Hall, 1996.
[116] M. S. Santina, A. R. Stubberud, and G. H. Hostter, Digital control design,
Saunders College Publishing, 1998.
[117] K. J. Astrom and B. Wittenmark, Computer-controlled systems (3rd Edition),
Prentice Hall, 1997.
201
[118] S. Dudoit and Fridlyand, “Introduction to classification in microarray
experiments,” in D. P. Berrar, W. Dubitzky, M. Granzow, (Eds.), A
Practical Approach to Microarray Data Analysis, Norwell, MA, Kluwer,
2002.
[119] Y. Lu and J. Han, “Cancer classification using gene expression data,”
Information Systems, vol. 28, no. 4, pp. 243-268, 2003.
[120] W. Huber, A. C. Heydebreck, and M. Vingron, “Analysis of microarray
gene expression data,” In Martin Bishop et al., editor, Handbook of
Statistical Genetics, Chichester, UK, John Wiley & Sons, Ltd, 2003.
[121] J. Misra, W. Schmitt, D. Hwang, L. Hsiao, S. Gullans, and G.
Stephanopoulos, “Interactive Exploration of Microarray Gene Expression
Patterns in a Reduced Dimensional Space,” Genome Res, vol. 12, no. 7, pp.
1112-1120, 2002.
[122] M. E. Wall, A. Rechtsteiner, and L. M. Rocha, “Singular Value
Decomposition and Principal Component Analysis,” in D. P. Berrar, W.
Dubitzky, M. Granzow, (Eds.), A Practical Approach to Microarray Data
Analysis, Norwell, MA, Kluwer, pp. 91-109, 2003.
[123] X. Liao, N. Dasgupta, S. M. Lin, and L. Carin, “ICA and PLS modelling for
functional analysis and drug sensitivity for DNA microarray signals,” in
Proc. Workshop on Genomic Signal Processing and Statistics, 2002.
[124] A. Chen and J.-C. Hsu, “Exploring novel algorithms for the prediction of
cancer classification,” 2nd International Conference on Software
Engineering and Data Mining, SEDM, pp. 378-383, 2010.
202
[125] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.-H. Yeang, M.
Angelo, C. Ladd, M. Reich, E. Latulippe, J. Mesirov, T. Poggio, W. Gerald,
M. Loda, E. Lander, and T. Golub, “Multiclass Cancer Diagnosis Using
Tumor Gene Expression Signatures,” in Proc. Natl. Acad. Sci. USA, vol. 98,
no. 26, pp. 15149-15154, 2002.
[126] D. Baboo and M. Sasikala, “Multicategory Classification Using Support
Vector Machine for Microarray Gene Expression Cancer Diagnosis,” Global
Journal of Computer Science and Technology, 2010.
[127] J. Sanchez-Monedero, M. Cruz-Ramirez, F. Fernandez-Navarro, J.
Fernandez, P. Gutierrez, and C. Hervas-Martinez, “On the suitability of
Extreme Learning Machine for gene classification using feature selection,”
10th International Conference on Intelligent Systems Design and
Applications, ISDA 2010, pp. 507-512, Nov. 29 2010-Dec. 1 2010.
[128] A. Bharathi and A. Natarajan, “Microarray gene expression cancer diagnosis
using Machine Learning algorithms,” International Conference on Signal
and Image Processing, ICSIP 2010, pp. 275-280, 15-17 Dec. 2010.
[129] G. Unger and B. Chor, “Linear Separability of Gene Expression Data Sets,”
IEEE/ACM Transactions on Computational Biology and Bioinformatics,
vol. 7, no. 2, pp. 375-381, Apr.-Jun. 2010.
[130] T. Pham, D. Beck, and H. Yan, “Spectral Pattern Comparison Methods for
Cancer Classification Based on Microarray Gene Expression Data,” IEEE
Transactions on Circuits and Systems I: Regular Chapters, vol. 53, no. 11,
pp. 2425-2430, Nov. 2006.
[131] Y. Liu, J. Shen, and J. Cheng, “Cancer Classification Based on the
“Fingerprint” of Microarray Data,” The 1st International Conference on
Bioinformatics and Biomedical Engineering, ICBBE 2007, pp. 176-179, 6-8
Jul. 2007.
203
[132] J. P. Brody, B. A. Williams, B. J. Wold, and S. R. Quake, “Significance and
statistical errors in the analysis of DNA microarray data,” in Proc. Natl.
Acad. Sci. USA, vol. 99, no. 20, pp. 12975-12978, 2002.
[133] G. Arce and Y. Li, “Median power and median correlation theory,” IEEE
Transactions on Signal Processing, vol. 50, no. 11, pp. 2768- 2776, Nov.
2002.
[134] R. Salakhutdinov, “Learning in Markov random fields using tempered
transitions,” in Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, A.
Culota (Eds.), Advances in neural information processing systems, 22,
Cambridge, MA, MIT Press, 2009.
[135] L. Yang, H. Yan, Y. X. Dong, and L. Y. Fei, “A kind of correlation
classification distance of whole phase based on weight,” International
Conference on Environmental Science and Information Application
Technology, ESIAT 2010, vol. 3, pp. 668-671, 17-18 Jul. 2010.
[136] C. Chatfield, The Analysis of Time Series: an Introduction. 6th Ed,
Chapman and Hall, 2004.
[137] A. Ben-Dor, A. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z.
Yakhini, “Tissue Classification with Gene Expression Profiles,” J.
Computational Biology, vol. 7, no. 3/4, pp. 559-583, 2000.
[138] S. Mukherjee, P. Tamayo, S. Rogers, R. Rifkin, A. Engle, C. Campbell, T.
R. Golub, and J. P. Mesirov, “Estimating Dataset Size Requirements for
Classifying DNA Microarray Data,” J. Computational Biology, vol. 10, no.
2, pp. 119-142, 2003.
204
[139] Y. Miche, P. Bas, C. Jutten, O. Simula, and A. Lendasse, “A methodology
for building regression models using Extreme Learning Machine: OP-ELM,”
in ESANN 2008, European Symposium on Artificial Neural Networks,
Bruges, Belgium, 2008.
[140] J. Li and H. Liu, "Kent ridge bio-medical data set repository," School of
Computer Engineering, Nanyang Technological University, Singapore,
2004. [Online]. Available: http://levis.tongji.edu.cn/gzli/data/mirror-
kentridge.html.
[141] A. M. Sarhan, “Cancer classification based on microarray gene expression
data using DCT and ANN,” Journal of Theoretical and Applied Information
Technology (JATIT), vol. 6, no. 2, pp. 208-216, 2009.
[142] A. H. Ali, “Self-Organization Maps for Prediction of Kidney Dysfunction,”
in Proc. 16th Telecommunications Forum TELFOR, Belgrade, Serbia, 2008.
[143] S. Haykin, Adaptive filter theory (Third Edition), Prentice Hall, New Jersey,
1996.
[144] X. Yao, “Evolving artificial neural networks,” in Proc. of the IEEE, vol. 87,
no. 9, pp. 1423-1447, 1999.
[145] G. Zhang, “Neural networks for classification: A survey,” IEEE
Transactions on Systems, Man, and Cybernetics, Part C, vol. 30, no. 4, pp.
451-462, 2000.
[146] Z. Man, S. Liu, H. R. Wu, and X. Yu, “A new adaptive back-propagation
algorithm based on Lyapunov stability theory for neural networks,” IEEE
Transactions on Neural Networks, vol. 17, no. 6, pp. 1580-1591, 2006.
205
[147] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme Learning Machine
for Regression and Multiclass Classification,” IEEE Transactions on
Systems, Man, and Cybernetics, Part B: Cybernetics, vol. PP, no. 99, pp. 1-
17, 2011.
[148] C. H. Phillips, J. M. Parr, and E. A. Riskin, Signals, Systems, and
Transforms, Prentice Hall, 2003.
[149] F. Girosi, M. Joenes, and T. Poggio, “Regularization theory and neural
networks architectures,” Neural Computation, vol. 7, pp. 219-269, 1995.
[150] S. Knerr, L. Personnaz, and G. Dreyfus, “Handwritten digit recognition by
neural networks with single-layer training,” IEEE Transactions on Neural
Networks, vol. 3, pp. 962–968, 1992.
[151] C. Liu, K. Nakashima, H. Sako, and H. Fujisawa, “Handwritten digit
recognition: benchmarking of state-of-the-art techniques,” Pattern
Recognition, vol. 36, pp. 2271–2285, 2003.
[152] E. Kussul and T. Baidyk, “Improved method of handwritten digit
recognition tested on MNIST database,” Image and Vision Computing, vol.
22, pp. 971–981, 2004.
[153] B. Zhang, M. Fu, and H. Yan, “A nonlinear neural network model of
mixture of local principal component analysis: application to handwritten
digits recognition,” Pattern Recognition, vol. 34, pp. 203–214, 2001.
[154] D. J. Albers, J. C. Sprott, and W. D. Dechert, “Routes to chaos in neural
networks with random weights,” Int. J. of Bifurcation and Chaos, vol. 8, no.
7, pp. 1463-1478, 1998.
206
[155] W. F. Schmidt, M. A. Kraaijveld, and R. P. W. Duin, “Feed forward neural
networks with random weights,” in Proc. of the 11th IAPR International
Conference on Pattern Recognition, vol. 2, pp. 1–4, 1992.
[156] I. Y. Tyukin and D. V. Prokhorov, “Feasibility of random basis function
approximators for modeling and control,” in Proc. of the IEEE International
Conference on Control Applications (CCA) and Intelligent Control (ISIC),
pp. 1391–1399, 2009.
[157] Y. LeCun, B. Boser, J. S. Denker, R. E. Howard, W. Habbard, L. D. Jackel,
and D. Henderson, “Handwritten digit recognition with a back-propagation
network,” Advances in neural information processing systems, vol. 2, pp.
396-404, 1990.
[158] L. N. Li, J. H. Ouyang, H. L. Chen, and D. Y. Liu, “A Computer Aided
Diagnosis System for Thyroid Disease Using Extreme Learning Machine,” J.
Med. Syst., 2012. doi:10.1007/s10916-012-9825-3
[159] E. Malar, A. Kandaswamy, D. Chakravarthy, and A. G. Dharan, “A novel
approach for detection and classification of mammographic
microcalcifications using wavelet analysis and extreme learning machine,”
Comp. in Bio. and Med., vol. 42, no. 9, pp. 898-905, 2012.
[160] Y. Song, J. Crowcroft, and J. Zhang, “Automatic epileptic seizure detection
in EEGs based on optimized sample entropy and extreme learning machine,”
Journal of Neuroscience Methods, vol. 210, no. 2, pp. 132–146, 2012.
[161] Q. Yuan, W. Zhou, Y. Liu, and J. Wang, “Epileptic seizure detection with
linear and nonlinear features,” Epilepsy and Behavior, vol. 24, no. 4, pp.
415–421, 2012.
[162] L.-C. Shi, and B.-L. Lu, “EEG-based vigilance estimation using extreme
learning machines,” Neurocomputing, vol. 102, pp. 135–143, 2013.
207
[163] S. Decherchi, P. Gastaldo, R. Zunino, E. Cambria, and J. Redi, “Circular-
ELM for the reduced-reference assessment of perceived image
quality,” Neurocomputing , vol. 102, pp. 78-89, 2013.
[164] L. Wang, Y. Huang, X. Luo, Z. Wang, and S. Luo, “Image deblurring with
filters learned by extreme learning machine,” Neurocomputing, vol. 74, no.
16, pp. 2464–2474, 2011.
[165] J. Yang, S. Xie, D. Park, and Z. Fang, “Fingerprint Matching based on
Extreme Learning Machine,” Neural Computing and Applications, vol. 14,
pp. 1–11, 2012.
[166] W. Zong, and G.-B. Huang, “Face recognition based on extreme learning
machine,” Neurocomputing, vol. 74, no. 16, pp. 2541–2551, 2011.
[167] I. Marqués, and M. Graña, “Face recognition with lattice independent
component analysis and extreme learning machines,” Soft Comput., vol. 16,
no. 9, pp. 1525–1537, 2012.
[168] Y. Yu, T.-M. Choi, and C.-L. Hui, “An intelligent fast sales forecasting
model for fashion products,” Expert Syst. Appl., vol. 38, no. 6, pp. 7373–
7379, 2011.
[169] M. Xia, Y. Zhang, L. Weng, and X. Ye, “Fashion retailing forecasting based
on extreme learning machine with adaptive metrics of inputs,” Knowl.-
Based Syst., vol. 36, pp. 253–259, 2012.
[170] F. L. Chen, and T. Y. Ou, “Sales forecasting system based on Gray extreme
learning machine with Taguchi method in retail industry,” Expert Syst.
Appl., vol. 38, no. 3, pp. 1336–1345, 2011.
[171] Y. Xu, Y. Dai, Z. Y. Dong, R. Zhang, and K. Meng, “Extreme learning
machine-based predictor for real-time frequency stability assessment of
electric power systems,” Neural Computing and Applications, vol. 22, no.
3–4, pp. 501-508, 2013.
208
[172] S. Wu, Y. Wang, and S. Cheng, “Extreme learning machine based wind
speed estimation and sensorless control for wind turbine power generation
system,” Neurocomputing, vol. 102, pp. 163–175, 2013.
[173] J. Tang, D. Wang, and T. Chai, “Predicting mill load using partial least
squares and extreme learning machines,” Soft Comput., vol. 16, no. 9, pp.
1585–1594, 2012.
[174] H. Wang, G. Qian, and X. Q. Feng, “Predicting consumer sentiments using
online sequential extreme learning machine and intuitionistic fuzzy sets,”
Neural Comput Appl., 2012. doi:10.1007/s00521-012-0853-1
[175] W. Zheng, Y. Qian, and H. Lu, “Text categorization based on regularization
extreme learning machine,” Neural Computing and Applications, vol. 22, no.
3–4, pp. 447–456, 2013.
[176] H.-J. R., and G.-S. Zhao, “Direct adaptive neural control of nonlinear
systems with extreme learning machine,” Neural Computing and
Applications, vol. 22, no. 3–4, pp. 577–586, 2013.
[177] Y. Yang, Y. Wang, X. Yuan, Y. Chen, and L. Tan, “Neural network-based
self-learning control for power transmission line deicing robot”, Neural
Computing & Applications, 2012. doi:10.1007/s00521-011-0789-x
[178] A. K. Jain, P. W. Duin, and J. Mao, “Statistical pattern recognition: a
review,” IEEE Trans Pattern Anal Machine Intell, vol. 22, pp. 4–37, 2000.
[179] J. Anderson, A. Pellionisz, and E. Rosenfeld, Neurocomputing 2: Directions
for Research, Cambridge Mass., MIT Press, 1990.
[180] X. Liu, C. Gao, and P. Li, “A comparative analysis of support vector
machines and extreme learning machines,” Neural Networks, vol. 33, pp.
58–66, 2012.
[181] R. A. Fisher, “The use of multiple measurements in taxonomic problems,”
Annals of Eugenics 7, Part II, pp. 179–188, 1936.
209
[182] H. Kim, B. L. Drake, and H. Park, “Multiclass classifiers based on
dimension reduction with generalized lda,” Pattern Recogn., vol. 40, no. 11,
pp. 2939–2945, 2007.
[183] M. D. Richard, and R. Lippmann, “Neural network classifiers estimate
Bayesian a posteriori probabilities,” Neural Comput., vol. 3, pp. 461–483,
1991.
[184] P. Gallinari, S. Thiria, R. Badran, and F. Fogelman-Soulie, “On the
relationships between discriminant analysis and multilayer perceptrons,”
Neural Networks, vol. 4, pp. 349–360, 1991.
[185] P. L. Bartlett, “The sample complexity of pattern classification with neural
networks: The size of the weights is more important than the size of the
network,” IEEE Trans. Inf. Theory, vol. 44, no. 2, pp. 525–536, Mar. 1998.
[186] R. Wang, S. Kwong, and X. Wang, “A study on random weights between
input and hidden layers in extreme learning machine,” Soft Comput., vol. 16,
no. 9, pp. 1465–1475, 2012.
[187] P. Horata, S. Chiewchanwattana, and K. Sunat, “A comparative study of
pseudo-inverse computing for the extreme learning machine classifier,”
Data Mining and Intelligent Information Technology Applications (ICMiA),
2011 3rd International Conference on, pp. 40–45, 24–26 Oct. 2011.
[188] X.-K. Wei, Y.-H. Li, and Y. Feng, “Comparative Study of Extreme
Learning Machine and Support Vector Machine”, Proceedings of
International Conference on Intelligent Sensing and Information Processing,
pp. 1089–1095, 2006.
[189] Q. Liu, Q. He, and Z. Shi, “Extreme support vector machine classifier,”
Lecture Notes in Computer Science, vol. 5012, pp. 222–233, 2008.
210
[190] B. Frénay, and M. Verleysen, “Using SVMs with randomised feature spaces:
An extreme learning approach,” in Proc. 18th ESANN, Bruges, Belgium, pp.
315–320, Apr. 28–30, 2010.
[191] A. M. Sarhan, “A Comparison of Vector Quantization and Artificial Neural
Network Techniques in Typed Arabic Character Recognition,” International
Journal of Applied Engineering Research (IJAER), vol. 4, no. 5, pp. 805-
817, May, 2009.
[192] A. M. Sarhan, and O. I. Al-Helalat “A Novel Approach to Arabic Characters
Recognition Using A Minimum Distance Classifier,” In Proceedings of the
World Congress on Engineering, London, U.K, July 2007.
211
Appendix
Matlab Codes 1.1 Matlab code for FIR-ELM function [FIRELMrmse, FIRELMstd, ELMrmse, ELMstd] = ...
FIRELM(sample, target, filmat, reg_d, reg_r, test_snr,
numite)
% Finite Impulse Response Extreme Learning Machine algorithm % % [FIRELMrmse, FIRELMstd, ELMrmse, ELMstd] = ... % FIRELM(sample, target, filmat, reg_d, reg_r, test_snr,
% numite) % % Inputs: % sample is a M x N sample pattern matrix with: % M sample pattern vectors of length N % % target is a M x 1 class label matrix with: % M scalar target labels % % filmat is a N x n hidden layer matrix where: % each column is a N x 1 FIR filter % such that n is the number of neurons % % reg_d is the first regularization parameter % such that the regularization ratio is d/r % % reg_r is the second regularization parameter % such that the regularization ratio is d/r % % test_snr is the SNR value in Db for the sample % patterns during testing % % numite is the number of iterations to run the % testing with noisy samples % % Outputs: % FIRELMrmse is the mean RMSE for the FIR-ELM after % numite iterations % % FIRELMstd is the standard deviation of the FIR-ELM % RMSE values over numite iterations %
212
% ELMrmse is the mean RMSE for the ELM after % numite iterations % % ELMstd is the standard deviation of the ELM % RMSE values over numite iterations % % Publication: Z. Man, K. Lee, D. H. Wang, Z. Cao, and C. Miao, % “A new robust training algorithm for a class of % single hidden layer neural networks,” Neurocomputing, % vol. 74, pp. 2491-2501, 2011. % % Email: kevinhklee@live.com.au % % Copyright (C) 2011 by Kevin Hoe Kwang Lee. % % This function is free software; you can redistribute it and/or % modify it under the terms of the GNU General Public License as % published by the Free Software Foundation; either version 2 of % the License, or any later version. % % The function is distributed in the hope that it will be useful, % but WITHOUT ANY WARRANTY; without even the implied warranty of % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU % General Public License for more details. % http://www.gnu.org/copyleft/gpl.html
%% Set up experiment parameters rpt = numite; Am = sample; Output = target; xaxis = [1:size(Am,1)];
%% Neural network parameters NumData = size(Am,1); NumInput = size(Am,2); NumNeuron = size(filmat,2);
%% Train FIR-ELM % initialize hidden layer weights using designed filter matrix Nw = filmat;
% intialize random biases matrix B = zeros(NumData,NumNeuron); for i = 1:NumNeuron B(:,i) = 1*rand() - .5; end
% solve for output layer weights NG = Am*Nw + B; d = reg_d; r = reg_r; Beta2 = inv((eye(size(NG'*NG)).*(d/r)) + NG'*NG)*NG'*Output;
%% Train ELM % initialize hidden layer weights using random method W = zeros(NumInput,NumNeuron); for j = 1:NumInput for i = 1:NumNeuron
213
W(j,i) = 2*rand() - 1; end end
% intialize random biases matrix b = zeros(NumData,NumNeuron); for i = 1:NumNeuron b(:,i) = 2*rand() - 1; end
% solve for output layer weights G = Am*W + b; Hinv = pinv(G); Beta = Hinv*Output;
%% Testing phase % calculate standard output for FIR-ELM A = Am; NG = A*Nw + B; O2D = NG*Beta2;
% performing testing with noisy samples for q = 1:rpt
% Add noise A = awgn(Am,test_snr,'measured');
% @Random Case@ G = A*W + b; O1 = G*Beta;
% @FIR Filter@ NG = A*Nw + B; O2 = NG*Beta2;
% Calculate MSE of random case count = 0; error = zeros(1,length(xaxis)); for t = xaxis count = count+1; error(count) = Output(count)-O1(count); end RMSE_Random(q) = sqrt((sum(error.^2))/count);
% Plot output of random case % subplot(2,2,2); figure(1) plot(xaxis,Output,'-.',xaxis,O1,'o');grid on; title('ELM') axis([0 11 0.6 2.5]); legend('target','output');hold on; % subplot(2,2,4); figure(2) plot(xaxis,error);title('Error ELM'); axis([0 11 -1 1]); grid on;hold on;
% Calculate MSE of FIR case count = 0;
214
error2 = zeros(1,length(xaxis)); for t = xaxis count = count+1; error2(count) = Output(count)-O2(count); end RMSE_FIR(q) = sqrt((sum(error2.^2))/count);
% Plot output of pole placement case % subplot(2,2,1); figure(3) plot(xaxis,O2D,'-.',xaxis,O2,'o');title('FIR-ELM');grid on; axis([0 11 0.6 2.5]); legend('target','output');hold on; % subplot(2,2,3); figure(4) plot(xaxis,error2);title('Error FIR-ELM') axis([0 11 -1 1]); grid on;hold on;
end
ELMrmse = mean(RMSE_Random); FIRELMrmse = mean(RMSE_FIR); ELMstd = std(RMSE_Random); FIRELMstd = std(RMSE_FIR);
%************************************************************** % End of Code: FIRELM.m %**************************************************************
215
1.2 Matlab code for DFT-ELM function [DFTELMacc] = DFTELM(trainSet, testSet, tt, ts, HT,
reg_d1, reg_r1, reg_d2, reg_r2)
% Discrete Fourier Transform Extreme Learning Machine algorithm % % [DFTELMacc] = DFTELM(trainSet, testSet, tt, ts, HT, reg_d1, ... % reg_r1, reg_d2, reg_r2) % % Inputs: % trainSet is a M1 x N sample pattern matrix with: % M1 sample training pattern vectors of
% length N % % testSet is a M2 x N sample pattern matrix with: % M2 sample testing pattern vectors of
% length N % % tt is the training class label matrix with: % M1 target label vectors (1-of-C) % % ts is the testing class label matrix with: % M2 target label vectors (1-of-C) % % HT is a M1 x n target frequency spectrum % matrix for the SLFN hidden layer outputs, % where n is the number of neurons % % reg_d1 is the first regularization parameter % for the hidden layer weights
% (used in d1/r1) % % reg_r1 is the second regularization parameter % for the hidden layer weights
% (used in d1/r1) % % reg_d2 is the first regularization parameter % for the output layer weights
% (used in d2/r2) % % reg_r2 is the second regularization parameter % for the output layer weights
% (used in d2/r2) % % Output: % DFTELMacc is the accuracy (%) obtained for one % iteration of training and testing % % % Publication: Z. Man, K. Lee, D. H. Wang, Z. Cao, and S-Y. Khoo, % “A robust single–hidden layer feedforward network % based pattern classifier,” IEEE TRANSACTIONS ON NEURAL % NETWORKS AND LEARNING SYSTEMS, 23(12), pp. 1974-1986,
% 2012. % % Email: kevinhklee@live.com.au % % Copyright (C) 2012 by Kevin Hoe Kwang Lee. %
216
% This function is free software; you can redistribute it and/or % modify it under the terms of the GNU General Public License as % published by the Free Software Foundation; either version 2 of % the License, or any later version. % % The function is distributed in the hope that it will be useful, % but WITHOUT ANY WARRANTY; without even the implied warranty of % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU % General Public License for more details. % http://www.gnu.org/copyleft/gpl.html
%% Set up experiment parameters % Neural network parameters [x,y] = size(trainSet); [x2,y2] = size(testSet);
NumData = x; NumInput = y; NumNeuron = size(HT,2);
%% Train DFT-ELM % Create A-bar matrix for a = 0:NumNeuron-1 for b = 0:NumNeuron-1; Abar(a+1,b+1) = exp(-j*((2*pi)/NumNeuron)*a*b); end end
% Calculate hidden layer target matrix HTD = HT*Abar;
% Calculate hidden layer weights reg_d1 = 1; reg_r1 = 100; W =
(1/NumNeuron)*inv((eye(NumInput,NumInput))*(reg_d1/(reg_r1*NumNeuron))
+trainSet'*trainSet)*trainSet'*HTD;
% Actual hidden layer output with new weights G1 = trainSet*W; G2 = abs(G1*Abar); G = G2;
% Activation function -> linear G = G;
% Calculate output layer weights reg_d2 = 1; reg_r2 = 100; Beta = inv((eye(size(G'*G)).*(reg_d2/reg_r2)) + G'*G)*G'*tt;
%% Testing phase % Calculate test data output for i = 1:x2 NG1(i,:) = testSet(i,:)*W; end
NG2 = abs(NG1*Abar);
217
OTELM = NG2*Beta;
% Calculate classification accuracy count = 0;
for i = 1:x2 target = find(ts(i,:) == max(ts(i,:))); output = find(OTELM(i,:) == max(OTELM(i,:)));
if(output == target) count = count + 1; end end
DFTELMacc = (count/x2) * 100;
%************************************************************** % End of Code: DFTELM.m %**************************************************************
218
1.3 Matlab code for OWLM function [OWLMacc] = OWLM(trainSet, testSet, tt, ts, NumNeuron, ... reg_d1, reg_r1, reg_d2, reg_r2)
% Optimal Weights Learning Machine algorithm % % [OWLMacc] = OWLM(trainSet, testSet, tt, ts, NumNeuron, ... % reg_d1, reg_r1, reg_d2, reg_r2) % % Inputs: % trainSet is a M1 x N sample pattern matrix with: % M1 sample training pattern vectors of
% length N % % testSet is a M2 x N sample pattern matrix with: % M2 sample testing pattern vectors of
% length N % % tt is the training class label matrix with: % M1 target label vectors (1-of-C) % % ts is the testing class label matrix with: % M2 target label vectors (1-of-C) % % NumNeuron is the number of hidden layer neurons set % for the SLFN to be trained by OWLM % % reg_d1 is the first regularization parameter % for the hidden layer weights
% (used in d1/r1) % % reg_r1 is the second regularization parameter % for the hidden layer weights
% (used in d1/r1) % % reg_d2 is the first regularization parameter % for the output layer weights
% (used in d2/r2) % % reg_r2 is the second regularization parameter % for the output layer weights
% (used in d2/r2) % % Output: % OWLMacc is the accuracy (%) obtained for one % iteration of training and testing % % % Publication: Z. Man, K. Lee, D. H. Wang, Z. Cao, and S-Y. Khoo, % “An optimal weight learning machine for handwritten % digit image recognition,” Signal Processing, 93(6), % 1624-1638, 2013. % % Email: kevinhklee@live.com.au % % Copyright (C) 2012 by Kevin Hoe Kwang Lee. % % This function is free software; you can redistribute it and/or % modify it under the terms of the GNU General Public License as
219
% published by the Free Software Foundation; either version 2 of % the License, or any later version. % % The function is distributed in the hope that it will be useful, % but WITHOUT ANY WARRANTY; without even the implied warranty of % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU % General Public License for more details. % http://www.gnu.org/copyleft/gpl.html
%% Set up experiment parameters % Neural network parameters [x,y] = size(trainSet); [x2,y2] = size(testSet);
NumData = x; NumInput = y;
%% Train OWLM % Initialize random weights matrix scaleW = 1; % scaling of weights magnitude % eg: 1 = [1 -1], 10 = [10 -10] W = zeros(NumInput,NumNeuron); for j = 1:NumInput for i = 1:NumNeuron W(j,i) = 2*rand() - 1; end end W = W*scaleW;
% Intialize random biases matrix scaleb = 1; % scaling of bias magnitude % eg: 1 = [1 -1], 10 = [10 -10] b = zeros(NumData,NumNeuron); for i = 1:NumNeuron b(:,i) = 2*rand() - 1; end b =b*scaleb;
% random weights hidden layer output HELM = trainSet*W + b;
% activation function -> tansig HELM = tansig(HELM);
% calculate new hidden layer weights reg_d1 = 1; reg_r1 = 100; CW = inv((eye(size(trainSet'*trainSet)).*(reg_d1/reg_r1)) +
trainSet'*trainSet)*trainSet'*HELM;
% calculate optimized hidden layer output Hnew = trainSet*CW;
% calculate output layer weights reg_d2 = 1; reg_r2 = 100; Beta = inv((eye(size(Hnew'*Hnew)).*(reg_d2/reg_r2)) +
Hnew'*Hnew)*Hnew'*tt;
220
%% Testing phase % calculate test data output NG1 = testSet*CW; OTELM = NG1*Beta;
% calculate classification accuracy count = 0;
for i = 1:x2 target = find(ts(i,:) == max(ts(i,:))); output = find(OTELM(i,:) == max(OTELM(i,:)));
if(output == target) count = count + 1; end end
OWLMacc = (count/x2) * 100;
%************************************************************** % End of Code: OWLM.m %**************************************************************
221
List of Publications Journal Papers
1. Z. Man, K. Lee, D. H. Wang, Z. Cao, and C. Miao, “A new robust training
algorithm for a class of single hidden layer neural networks,” Neurocomputing,
vol. 74, pp. 2491-2501, 2011.
2. K. Lee, Z. Man, D. H. Wang, and Z. Cao, “Classification of Bioinformatics
Dataset using Finite Impulse Response Extreme Learning Machine for Cancer
Diagnosis,” Neural Computing & Applications, Available online 30 Jan. 2012,
Doi: 10.1007/s00521-012-0847-z.
3. Z. Man, K. Lee, D. H. Wang, Z. Cao, and S. Khoo, “An optimal weight learning
machine for handwritten digit image recognition,” Signal Processing, Available
online 1 Aug. 2012, Doi: 10.1016/j.sigpro.2012.07.016.
4. Z. Man, K. Lee, D. H. Wang, Z. Cao, and S. Khoo, “A Robust Single-Hidden
layer Feedforward Neural Network based Signal Classifier,” under minor
revision at IEEE Transactions on Neural Networks and Learning Systems, 2012.
Conference Papers
5. Z. Man, K. Lee, D. H. Wang, Z. Cao, and C. Miao, “A modified ELM algorithm
for single-hidden layer feedforward neural networks with linear nodes,” 6th
IEEE Conference on Industrial Electronics and Applications, ICIEA 2011, pp.
2524-2529, 21-23 Jun. 2011.
222
6. K. Lee, Z. Man, D. H. Wang, and Z. Cao, “Classification of microarray datasets
using finite impulse response extreme learning machine for cancer diagnosis,”
37th Annual Conference on IEEE Industrial Electronics Society, IECON 2011,
pp. 2347-2352, 7-10 Nov. 2011.