Robust single hidden layer feedforward neural networks for ... · theory. The proposed optimal...

Post on 31-Jul-2020

2 views 0 download

Transcript of Robust single hidden layer feedforward neural networks for ... · theory. The proposed optimal...

Robust Single Hidden Layer Feedforward Neural Networks for Pattern Classification

Submitted in total fulfilment of the requirements of the degree of Doctor of Philosophy

Kevin (Hoe Kwang) Lee

Faculty of Engineering and Industrial Sciences Swinburne University of Technology

Melbourne, Australia



Abstract Artificial neural network (ANN) has been becoming one of the most widely researched

artificial intelligence techniques since 1970s. The ANN is known to be capable of

providing good performance in many practical applications such as system modelling,

forecasting, classification, and regression, using only a finite number of available

samples. Its strength resides in its capability of learning complex non-linear mappings

and performing parallel processing. Given many learning paradigms developed for the

ANNs, the generalization capability and robustness are the key design criteria to ensure

the high quality of performance in practical implementations. This thesis focuses on the

development of robust supervised training algorithms for a class of ANNs known as the

single hidden layer feedforward neural networks (SLFNs).

In conventional SLFN training algorithms, the backpropagation learning method

is widely adopted to iteratively tune the weights and biases of SLFNs. The slow

learning speed of the iterative algorithms has been a major bottleneck in practice. In

addition, the backpropagation training process has a high probability of stopping at local

minima of the cost functions in weight space, because the final training outcome is

heavily dependent on user-initialized parameters.

The extreme learning machine (ELM) has become an emerging learning

technique for the SLFNs recently. It has been shown that the ELM can overcome most

disadvantages of conventional training algorithms. Different from the conventional

training algorithms used for SLFNs, the ELM proposes that (i) the input weights and the

hidden layer biases can be randomly assigned without training, and (ii) the output

weights can be analytically determined using the Moore-Penrose pseudoinverse. Thus,

the ELM has prominent advantages of extremely fast learning speed, less user

intervention, and good generalization performance.


Based on the ELM, three new training algorithms are developed and verified for

SLFNs in this thesis. The first training algorithm considers the design of the input

weights of the SLFNs according to the finite-impulse-response filtering theory, such

that the outputs of the SLFN hidden layer become less sensitive to the input pattern

noise. In addition, the optimal design of the input weights of the SLFNs will also be

developed using both the regularization theory and the prediction risk minimization

theory. The proposed optimal designs of both the input and output weights will then

significantly improve the robustness of the SLFN classifiers in dealing with real-world

datasets. Specifically, the classification of bioinformatics datasets related to cancer

diagnosis will be studied in detail. The relationship between the linear separability of

the bioinformatics sample patterns and the classification performance of the above

proposed training algorithm will be investigated together with a newly developed

frequency domain feature selection method.

The second training algorithm considers the optimal design of the input weights

based on the discrete Fourier transform of feature vectors. The feature assignment

method is developed based on the pole placement theory in control systems, which

optimally determines the target features for pattern classes, such that the noise

components can be reduced, and the respective target features are maximally separated.

In the third training algorithm, unlike the one developed in the second scheme,

the desired feature vectors are defined first in the feature space, and the input weights

are then designed in the sense that, in the training phase, the feature vectors from the

hidden layer outputs of the SLFNs can be placed at the desired positions in the feature

space, specified by the desired feature vectors. The reference feature vectors, in this

algorithm, are chosen from the hidden layer outputs of the SLFN classifiers trained with

the ELM. Both the regularization theory and the prediction risk minimization theory

will be employed to balance and reduce the optimization risks when determining the

hidden layer and the output layer weights. In addition, the performance of all three

proposed training algorithms for the SLFNs will be verified using a series of sound clip

recognition and handwriting recognition experiments.


Acknowledgement I would like to first thank my supervisors Prof. Zhihong Man and Dr. Zhenwei Cao for

their most valuable guidance and support during the past 4 years of my research that has

culminated with the production of this thesis. They have made every effort to provide

academic and financial advice to make my PhD life an interesting and rewarding one.

Especially, I would like to give my greatest gratitude to Prof. Zhihong Man for his

constant supervision of my research progress, and for making sure that I met the

requirements and deadlines throughout the course of the past 4 years. There has not

been another lecturer who is so accessible and enthusiastic in teaching me the much

needed research and writing skills that are vital to complete this thesis.

I am grateful to Swinburne University for awarding me the SUPRA scholarship

and providing me with a comfortable and conducive working environment. Special

thanks to Cathy, Melissa, and Adrianna from the research student administration and

support team, all of you have helped me settle in at my workplace and attended to my

enquiries promptly and always with a welcoming smile. I am also indebted to the senior

technical staff, Walter and Phil, who swiftly resolved any issues I had with my research

equipment and building access cards. The little things that these university staff did

made all the difference in my everyday life as a PhD candidate.

My research life would have been a little bit boring if not for the friends in my

research group. Therefore, to Fei Siang, Aiji, Sui Sin, Wang Hai, and Tuan Do, I am

very thankful to have you all as friends and fellow researchers. The countless sharing

sessions we had proved invaluable in keeping our PhD lives a little more vibrant.

Last but not least, I would like to thank my family for their unrelenting support

during my times of doubt and hardship. Without them I would not have come this far.



Declaration This is to certify that:

1. This thesis contains no material which has been accepted for the award to the candidate of any other degree or diploma, except where due reference is made in the text of the examinable outcome.

2. To the best of the candidate’s knowledge, this thesis contains no material

previously published or written by another person except where due reference is made in the text of the examinable outcome.

3. The work is based on the joint research and publications; the relative

contributions of the respective authors are disclosed. ________________________ Kevin (Hoe Kwang) Lee, 2013



Contents 1.0 Introduction 1

1.1. Motivations 2

1.1.1. Selection of Robust Hidden Layer Weights 2

1.1.2. Optimal Design of Output Layer Weights 3

1.2. Objectives and Major Contributions 5

1.3. Organization of the Thesis 6

2.0 Literature Review 9

2.1. Introduction 10

2.2. Statistical Classifiers 12

2.3. Single Hidden Layer Feedforward Neural Networks 15

2.3.1. Gradient Descent Based Algorithms 17

2.3.2. Standard Optimization Method Based Algorithms 21

2.3.3. Least Squares Based Algorithms 21

2.4. Extreme Learning Machine 22

2.4.1. Learning Theories of ELM 25

2.4.2. Batch Learning ELM 30

2.4.3. Sequential Learning ELM 33

2.4.4. ELM Ensembles 33

2.5. Regularized ELM 34

2.6. ELM and SVM 38

2.7. ELM and RVFL 42


2.8. Applications of ELM 43

2.8.1. Medical Systems 43

2.8.2. Image Processing 46

2.8.3. Face Recognition 46

2.8.4. Handwritten Character Recognition 47

2.8.5. Sales Forecasting 48

2.8.6. Parameter Estimation 49

2.8.7. Information Systems 50

2.8.8. Control Systems 51

2.9. Conclusion 53

3.0 Finite Impulse Response Extreme Learning Machine 55

3.1. Introduction 55

3.2. Problem Formulation 58

3.3. Design of the Robust Input Weights of SLFNs 65

3.4. Design of the Robust Output Weight Matrix 70

3.5. Experiments and Results 73

3.6. Conclusion 84

4.0 Classification of Bioinformatics Datasets with FIR-ELM

for Cancer Diagnosis 85

4.1. Introduction 85

4.2. Time Series Analysis of Microarrays 88

4.3. Linear Separability of Microarrays 94

4.4. Outline of the FIR-ELM 96

4.4.1. Basic FIR-ELM 96

4.4.2. Frequency Domain Gene Feature Selection 99

4.5. Experiments and Results 101

4.5.1. Biomedical Datasets 101


4.5.2. Experimental Settings 102

4.5.3. Leukemia Dataset 103

4.5.4. Colon Tumor Dataset 107

4.6. Discussions 110

4.6.1. Linear Separability of the Hidden Layer Output

for SLFN 110

4.7. Conclusion 112

5.0 Frequency Spectrum Based Learning Machine 113

5.1. Introduction 114

5.2. Problem Formulation 117

5.3. Design of the Optimal Input and Output Weights 123

5.4. Experiments and Results 130

5.4.1. Example 1: Classification of Low Frequency Sound

Clips 130

5.4.2. Example 2: Classification of Handwritten Digits 138

5.5. Conclusion 148

6.0 An Optimal Weight Learning Machine for Handwritten

Digit Image Recognition 149

6.1. Introduction 149

6.2. Problem Formulation 153

6.3. Optimal Weight Learning Machine 164

6.4. Experiments and Results 172

6.5. Conclusion 181

7.0 Conclusions and Future Work 183

7.1. Summary of Contributions 183

7.2. Future Research 184


7.2.1. Ensemble Methods 184

7.2.2. Analytical Determination of Regularization Parameters 185

7.2.3. Analysis of the Effects of Non-linear Nodes 185

7.2.4. Multi-class Classification of Real-World Dataset 186

Bibliography 187

Appendix: Matlab Codes 211

List of Publications 221


List of Figures 2.1 A single hidden layer feedforward neural network 16

2.2 The backpropagation training algorithm 18

2.3 (a) Gradient descent on a convex cost function

(b) Gradient descent on a non-convex cost function with randomly

initialized weights 20

2.4 A set of handwritten digits from the MNIST database 48

3.1 A single hidden layer neural network with linear nodes 59

3.2 Output of the SLFN with the ELM 62

3.3 Output error of the SLFN with the ELM 62

3.4 Output of the SLFN with the modified ELM 63

3.5 Output error of the SLFN with the modified ELM 63

3.6 Output of the SLFN with the FIR hidden nodes 64

3.7 Output error of the SLFN with the FIR hidden nodes 64

3.8 A sound clip modulated by the envelope function 74

3.9 The disturbed sound clip 74

3.10 Signal classification with the ELM algorithm 75

3.11 The RMSE with the ELM algorithm 75

3.12 Signal classification with the modified ELM algorithm 76

3.13 The RMSE with the modified ELM algorithm 76

3.14 Signal classification with the FIR-ELM algorithm with the

rectangular window 77

3.15 The RMSE with the FIR-ELM algorithm with the rectangular

window 78

3.16 Signal classification with the FIR-ELM algorithm with the Kaiser

window 79

3.17 The RMSE with the FIR-ELM algorithm with the Kaiser window 79


3.18 The RMSE via d/γ for the FIR-ELM algorithm 81

3.19 Signal classification using the SLFN classifier with the non-linear

hidden nodes and trained with the FIR-ELM algorithm 81

3.20 The RMSE of the SLFN classifier with the non-linear hidden nodes

and trained with the FIR-ELM algorithm 82

3.21 The non-linear sigmoid function ( ) ( ( ))⁄ 82

3.22 Signal classification using the SLFN classifier with the non-linear

hidden nodes and trained with the FIR-ELM algorithm 83

3.23 The RMSE of the SLFN classifier with the non-linear hidden nodes

and trained with the FIR-ELM algorithm 83

3.24 The non-linear sigmoid function ( ) ( ( ))⁄ 84

4.1 Aggregated time series for the colon dataset 90

4.2 The filtered and detrended time series and


4.3 Plot of residual between and


4.4 Overlaid plot of and

for genes 1000 to 1040 94

4.5 A single hidden layer feedforward neural network with linear nodes

and an input tapped delay line 98

4.6 Frequency response of a sample from the colon dataset 100

4.7 An FIR filter design search algorithm for FIR-ELM 100

4.8 Classification performance for leukemia with low pass filter 104

4.9 Classification performance for leukemia with high pass filter 105

4.10 Classification performance for leukemia with band pass filter 105

4.11 Classification performance for colon dataset with low pass filter 107

4.12 Classification performance for colon dataset with high pass filter 108

4.13 Classification performance for colon dataset with band pass filter 108

5.1 A single hidden layer network with linear nodes and an input delay

line 117

5.2 A sound clip modulated by the envelope function 131

5.3 The disturbed sound clip with SNR of 10 dB 131

5.4 Classification using R-ELM with SNR of 10 dB 133

5.5 The RMSE using R-ELM with SNR of 10 dB 133

5.6 The classification using FIR-ELM with SNR of 10 dB 134

5.7 The RMSE using FIR-ELM with SNR of 10 dB 134


5.8 Classification using DFT-ELM with SNR of 10 dB 135

5.9 The RMSE using DFT-ELM with SNR of 10 dB 135

5.10 The RMSE and class error via

for the DFT-ELM 137

5.11 The RMSE and classification error via the number of hidden nodes

with the DFT-ELM 138

5.12 A set of handwritten digits from the MNIST database 139

5.13 (a) Image of digit and (b) Segmented image 139

5.14 (a)~(j) Encoded sample data set for images 0 to 9 140

5.15 Classification accuracies of the SLFN classifiers with DFT-ELM,

FIR-ELM and R-ELM via the regularization parameter ratio 147

5.16 Classification accuracies of the SLFN classifiers with DFT-ELM,

FIR-ELM and R-ELM via the number of training samples 148

6.1 A single hidden layer neural network with both random input

weights and random hidden layer biases 156

6.2 Recognition of the handwritten digits by using the SLFN classifier

with 10 hidden nodes trained with the ELM 159

6.3 Recognition of the handwritten digits by using the SLFN classifier

with 50 hidden nodes trained with the ELM 159

6.4 Recognition of the handwritten digits by using the SLFN classifier

with 100 hidden nodes trained with the ELM 160

6.5 Recognition of the handwritten digits by using the SLFN classifier

with 10 hidden nodes trained with the R-ELM 161

6.6 Recognition of the handwritten digits by using the SLFN classifier

with 50 hidden nodes trained with the R-ELM 162

6.7 Recognition of the handwritten digits by using the SLFN classifier

with 100 hidden nodes trained with the R-ELM 162

6.8 A single hidden layer neural network with linear nodes and an

input tapped delay line 164

6.9 A set of handwritten digits from the MNIST database 172

6.10 (a) Image of digit and (b) Segmented image 172

6.11 (a)~(c) Sample feature vectors for digit 0 173

6.12 Recognition of the handwritten digits by using the SLFN classifier

with 10 hidden nodes trained with the OWLM 175


6.13 Recognition of the handwritten digits by using the SLFN classifier

with 50 hidden nodes trained with the OWLM 176

6.14 Recognition of the handwritten digits by using the SLFN classifier

with 100 hidden nodes trained with the OWLM 176

6.15 Classification accuracy versus regularization ratio for the MNIST

dataset 179

6.16 Classification accuracy versus regularization ratio for the USPS

dataset 180

6.17 Classification accuracies with OWLM, R-ELM and ELM via the

number of training samples for the MNIST dataset 180

6.18 Classification accuracies with OWLM, R-ELM and ELM via the

number of training samples for the USPS dataset 181


List of Tables 2.1 Research incorporating the ELM and SVM 42

3.1 Comparison of averaged RMSE for sound clip recognition

experiment 80

4.1 Correlation coefficient of colon dataset binary classes with

different FIR filters 92

4.2 Summary of leukemia and colon datasets 95

4.3 Linearly separable gene pairs for leukemia and colon datasets 95

4.4 Selection of DCT coefficients for leukemia and colon datasets 102

4.5 Classification performance for leukemia dataset 106

4.6 Confusion matrix for classification of leukemia dataset 106

4.7 Classification performance for colon dataset 109

4.8 Confusion matrix for classification of colon dataset 109

4.9 Linearly separable gene pairs for the hidden layer output in ELM

and FIR-ELM for leukemia and colon datasets 111

5.1 Comparisons of the R-ELM, FIR-ELM and DFT-ELM 136

5.2 Classification accuracies of the handwritten digit classification with


6.1 Classification accuracies of the handwritten digit classification

with the ELM, R-ELM and OWLM for the MNIST dataset 177

6.2 Classification accuracies of the handwritten digit classification

with the ELM, R-ELM and OWLM for the USPS dataset 178



List of Abbreviations and Acronyms AI – artificial intelligence

ALL – acute lymphoblastic leukemia

AML – acute myeloid leukemia

ANN – artificial neural networks

AWGN – additive white gaussian noise

BC – bayesian classifier

BP – backpropagation

CBP – circular backpropagation

C-ELM – circular-ELM

CO-ELM – constrained-optimization based ELM

CV – cross-validation

DCT – discrete cosine transform

DFT – discrete fourier transform

DFT-ELM – discrete fourier transform extreme learning machine

DTFT – discrete time fourier transform

EEG – electroencephalogram

ELM – extreme learning machine

ERM – empirical risk minimization

FGFS – frequency domain gene feature selection

FIR – finite impulse response

FIR-ELM – finite impulse response extreme learning machine

FLDA – fisher’s linear discriminant analysis

FNR - false negative rate

FPR – false positive rate

GA – genetic algorithm

ICA – independent component analysis


LDA – linear discriminant analysis

MIMO – multiple-inputs-multiple-outputs

MLP – multilayer perceptron

MSE – mean-squared-error

OAA – one against all

OAO – one against one

OS – online sequential

OWLM – optimal weights learning machine

PCA – principle component analysis

RBF – radial basis function

RELM – regularized extreme learning machine

RMSE – root-mean-square error

ROC – receiver operating characteristic

RVFL – random vector functional link

SLFN – single hidden layer feedforward neural network

SNR – signal to noise ratio

SRM – structural risk minimization

SVD – singular value decomposition

SVM – support vector machine

TER – total error rate

TNR – true negative rate

TPR – true positive rate


Chapter 1


Artificial intelligence (AI) techniques have been becoming increasingly popular in

recent years. The AI techniques provide scientists and engineers with alternative

methods for solving complex problems that do not have parametric solutions or for

which the parametric solutions are too difficult to express analytically. Among many AI

methods, artificial neural network (ANN) is a versatile technique and has been widely

applied to many areas such as system modelling, economic forecasting, fault diagnosis,

bioinformatics, handwriting recognition, and information management. Its strength is

largely due to its capability of learning complex non-linear mappings and performing

parallel processing [1], [2]. Given a finite set of sample data from an unknown system,

the ANN is capable of providing a practical solution for system modelling or pattern

classification in a much shorter time frame compared to traditional analytical methods.

Although the use of ANNs has already been of a great success in numerous real-

world applications, several major drawbacks have been noted in practice. For instance, a

longer training time may be required for recursive learning and the training process may

easily stop at local minima. These issues have inhibited the use of ANNs in some

situations where the fast online training is required, the global solutions are essential,

and a large amount of data needs to be processed. Recently, Huang et al. [3] proposed a

compact training algorithm known as the extreme learning machine or ELM for single

hidden layer feedforward neural networks (SLFNs). The ELM has been shown to

provide an extremely fast training for SLFNs, while achieving an optimal global

solution in the sense of least squares.


The aim of this research is to develop a few new robust training algorithms for

SLFNs to function as pattern classifiers with less number of hidden nodes, achieving

excellent robustness property and high classification accuracy. Several computer

generated and real-world pattern classification examples and experiments are provided

to demonstrate the effectiveness of the proposed algorithms.

1.1 Motivations

It is well known that the SLFN has the capability of performing universal

approximation [4-7]. Such an advantage has made SLFNs highly adapt to various

environments where the processes may have uncertainties and complex dynamics, and

the collected data may be degenerated by disturbances and noises. However, many

existing training algorithms developed for the SLFN are based on the conventional

backpropagation (BP) [8]. Such BP-based algorithms with the recursive feature have

disadvantages of longer training time, stopping at local minima in training and slow

convergence, as mentioned in the previous section.

Although the recently proposed ELM [3] has provided some partial solutions to

the issues mentioned above, there is still an urgent need to focus on the development of

robust neural networks pattern classifiers to deal with real-world data that are highly

disturbed or have non-linearly separable patterns. Therefore the research performed

herein intends to cover this important area by building upon the compact concept

introduced in the ELM. Some important issues that are revealed by our recent study on

the ELM [9-12] include:

Non-optimal assignment of hidden layer weights for the SLFN.

Lack of a holistic prediction risk minimization strategy in the design of the

SLFN weights and biases.

1.1.1 Selection of Robust Hidden Layer Weights

Zhu et al. [37], was amongst the first works to state that the randomly selected weights

and biases in the hidden layer of the ELM are not optimal in the sense that the


projections provided by the random weights generate stochastic results for each trial. It

is then generally accepted that the ELM requires repeated testing to obtain the reliable

results. Since then, many modified algorithms have been proposed to initialize the

hidden layer weights and biases of the SLFN being trained with the ELM algorithm.

The attempts in [33-38] provided some improvements over the random selection

method either in terms of accuracy achieved, or condition of the network matrices.

However, these methods approach the problem of weights selection from a

mathematical and biological computing point of view which is generally a global

optimization problem with respect to the provided training samples. As the hidden layer

of the SLFN is often seen as the projection layer, which is responsible to map the input

space into the feature space, the selection of the weights should take into consideration

the robustness of the outputs generated as well. From an engineering point of view, the

robustness refers to the consistent performance of the SLFN given a wide range of

unseen samples that are noisy. Therefore this thesis focuses on developing robust

hidden layer weights assignment strategies.

1.1.2 Optimal Design of Output Layer Weights

In the ELM training algorithm, the pseudo-inverse is a least squares solution for the

output layer weights of the SLFN. It is also known as an empirical risk minimization

(ERM) operation. The ERM works well for learning algorithms with a finite number of

training samples when the sample set size is large enough such that the sample points

are dense in the pattern space. However, the problem becomes complex when the

number of dimensions involved is large due to the curse of dimensionality. The curse of

dimensionality states that as the number of dimensions of a sample pattern increases, it

becomes exponentially harder to approximate the underlying function. This is because

as the number of dimensions increase, the sampled patterns have a high probability of

being far apart [83]. When learning is performed on such sample points which are far

apart, the interpolation between the points becomes hard to determine because there

may be more than one solution based on the ERM principle.

In order to find an optimal solution under the ill-posed problem stated above,

some constraints need to be introduced in the learning process such that the


interpolation between the sample points is smooth. In the absence of large sample

datasets that are dense in pattern space, the alternative is to introduce a penalty term in

the cost function of the output weights. Two of the commonly used cost functions with

penalties are given in (1.1.1) and (1.1.2).

‖ ‖

‖ ‖ ( )

‖ ‖

‖ ‖

( )

where they are known as the L1-norm and L2-norm regularization (Tikhonov

regularization [84], [85]) respectively. The regularization constant acts to scale the

importance of the regularization term , which is the magnitude of the SLFN output

layer weights, with respect to the output error . The purpose of using the magnitude of

the output layer weights as a penalty is to ensure that solutions of the interpolation

function will have small magnitude and therefore restricts the optimization of the cost

function to smoother solutions. This characteristic is also known as the structural risk

minimization (SRM) principle discussed by Vapnik in [86]. In the context of training

the SLFN, the prediction risk can be defined as the sum of the approximation risk and

estimation risk, where risk is embodied by error.

( )

The approximation risk represents the closeness of fit with respect to the sample

targets, while the estimation risk represents the generalization ability over unseen

samples. The regularization term controls the emphasis of the optimization process

with a trade-off between achieving the SRM, which increases the robustness and

generalization capability, and the ERM, which reduces the classifier error based on the

training sets. If the focus is placed on the ERM, it is easy to over-fit the data and hence

affects the final testing performance, while placing heavier emphasis on the SRM may

reduce the size of the feature space significantly and distort the outputs; this issue is

more formally known as the bias-variance dilemma [83], [87]. Selecting the optimal

value of is usually done using cross-validation in practical applications.


Three major contributions to improve the robustness of the ELM training

algorithm using regularization methods were proposed respectively by Toh [88], Deng

et al. [89], and Huang et al. [90] recently. The authors recognized that the ELM

algorithm is susceptible to perturbations and noise in the input sample sets and

introduced variants of the regularization method discussed above. However, an open

question still revolves around finding a deterministic method to estimate the optimal

value of the regularization constant . In this thesis, the formulation of the optimization

functions will be reconsidered to better understand the effects of regularization in

reducing the prediction risks.

All the issues above need to be addressed appropriately in the design of the new

training algorithms in this thesis.

1.2 Objectives and Major Contributions

The aim of this thesis is to develop a new breed of robust and optimized training

algorithms for the SLFN pattern classifiers with new strategies of optimizing both the

input and the output weights of the SLFNs that are capable of processing a wide variety

of real-world data with varying noise profiles and non-linear separability.

The major contributions of the thesis are outlined as follows:

Develop robust input weights assignment strategies for the SLFN pattern

classifiers that are less sensitive to noises and undesired features.

Optimally design the output weights of the SLFN pattern classifier such that the

sensitivity of the SLFN outputs can be significantly reduced with respect to

changes at the hidden layer outputs.

Develop a new optimal weights design for the ELM pattern classifier by

minimizing the total prediction risk of the SLFNs.

Develop and test the practical implementation of the robust pattern classifiers

designed in this thesis using real-world bioinformatics datasets with application

to cancer diagnosis.


In summary, the research performed in this thesis develops and implements a

group of robust neural network based pattern classifiers that are designed to handle real-

world datasets degenerated with a wide range of noise and disturbances. The developed

pattern classifiers have the potential to find a vast range of applications, especially, in

the fields of bioinformatics, data mining, telecommunications, and signals and image


1.3 Organization of the Thesis

This thesis explores the design of new training algorithms for SLFN pattern classifiers

that behave with strong robustness with respect to noises and undesired features, and

improve the classification accuracy in the real-world applications. The rest of the thesis

is organized as follows:

Chapter 2 begins with a brief overview of conventional pattern classifiers with a focus

on SLFNs, and the conventional training algorithms developed for SLFNs. A

comprehensive survey of ELM with emphasis on the latest technical developments and

applications is then presented to pave the background for the following main chapters of

this thesis.

Chapter 3 proposes a new finite impulse response extreme learning machine (FIR-ELM).

The new algorithm is characterized with the strategy of the input weights assignment

based on the finite impulse response digital filtering concept, which makes the SLFN

classifiers have the capability of eliminating the effects noises and reducing the

sensitivity of the hidden layer outputs. The optimal output layer weights are then

designed to balance and reduce both the empirical and structural prediction risks. The

resulting training algorithm is then verified by classifying noisy audio clips in the

simulation example.

Chapter 4 applies the FIR-ELM developed in Chapter 3 for the classification of

bioinformatics datasets for cancer diagnosis. Considering the complexity of the gene

microarray samples, a frequency domain feature selection algorithm is first developed to

determine the filtering technique used for designing the input weights. The FIR-ELM is


then implemented to design both the input and output weights. In the simulation

example, the effectiveness of the FIR-ELM for the classification of the gene microarray

samples is confirmed with high classification accuracy through comparing with a few

existing neural classification algorithms.

Chapter 5 develops a new frequency spectrum based learning machine for the SLFNs

with the discrete Fourier transform (DFT) theory. In order to minimize the effects of the

input pattern disturbances and maximally separate patterns in the feature space, the

frequency spectrums of the input patterns are analysed for defining the desired feature

vectors in the frequency-domain. The input weights are then optimized to ensure that, in

the frequency-domain, the DFT of the feature vectors can be placed at their desired

positions. In addition, the output layer weights are optimally designed to balance and

reduce the empirical and structural prediction risks. The feasibility and good

performance of the new algorithm is further confirmed in the simulation examples with

both linearly separable and non-linearly separable data.

Chapter 6 investigates an optimal weight learning machine (OWLM) for handwritten

digit image recognition. Unlike the input weight learning machine developed in the

frequency-domain in Chapter 5, a strategy of using the feature vectors of an ELM based

SLFN classifier as the reference features is first formulated. The input weights are then

optimized in the sense that the real feature vectors from the SLFN’s hidden layer

outputs can be placed at the locations specified by the reference feature vectors in the

feature space. It will be seen in the simulation examples that the OWLM behaves with a

strong robustness and achieves a high accuracy for the handwritten digit image

recognition of both the MNIST and USPS handwriting datasets.

Chapter 7 summarises the main contributions of this thesis and presents a few open

questions for future work.

The author’s publications based on this thesis’ research are given at the end of

this thesis. In addition, the Matlab codes for each of the new learning algorithms

developed in this thesis are provided in the Appendix.



Chapter 2

Literature Review

Neural networks based pattern classifiers have been applied in many areas of

engineering and science due to their ease of use and good generalization performance.

Conventional training algorithms for neural networks, such as the backpropagation (BP)

method, are known to face the issues of tedious training time, stopping at local minima,

having many meta-parameters to tune, and having high variance (unpredictable)

solutions. Recently, Huang proposed the extreme learning machine (ELM) for

single hidden layer feedforward neural networks, which is able to learn thousands of

times faster than the BP and even the support vector machine (SVM). The learning

process of the ELM involves only two main steps: (1) Compute the hidden layer outputs

with randomly generated input weights and biases; (2) Compute the output weights

using pseudo-inverse of the data correlation matrix from the hidden layer. The

classification performance of the ELM has been proven to be comparable or better than

the one of the SVM, especially in the multi-class pattern classification problems.

Research into the optimization framework of both the ELM and the LS-SVM have

further proven that the ELM generates the global optimal solutions compared to the LS-

SVM, and in fact, the ELM is closely related to a simplified LS-SVM optimization with

milder constraints. Most importantly, the ELM method represents a unified solution for

binary and multi-class pattern classification problems. This chapter provides a review of

the ELM and its extensions, including some issues on the ELM, and the applications of

the ELM.


2.1 Introduction

Pattern classification problems have been receiving a lot of attention from researchers in

diverse fields of study since more than half a century ago [178]. The main goal of the

research in this area is to develop a set of rules or a discriminating system to accurately

identify patterns and allocate them to their appropriate classes. The earliest form of

pattern classification is based on the concept of template matching [178]. It requires a

database of all possible outcomes for any new samples to be compared against in order

to determine the best match. This basic method is still used today in the form of look-up

tables for information retrieval in established and low variability situations. The main

flaw of the template matching is its susceptibility to noise and transformations within

the input patterns. In order to build more reliable pattern classifiers, intensive researches

have been done to build robust classification models that make use of the meta-data of

sample datasets. Essentially, pattern classifiers are categorized as either parametric or

non-parametric models. Parametric classifiers require the statistical information of the

sample pattern variables to build the decision function, while the non-parametric

classifiers learn directly from the sample patterns to build empirical discriminators. The

discussions in this chapter are focused on the development in the field of supervised

non-parametric pattern classifiers only, specifically, the neural networks based pattern


Neural networks have been used extensively in pattern classification problems

ever since the 1970s. A neural network consists of highly parallel connections between

layers of nodes that function as the localized processing units. Its massive parallel

architecture allows it to learn complex nonlinear mappings that are not possible to be

learnt using linear modelling methods. However, the main reason why neural networks

have gained a lot of interest from researchers is its simple implementation in real world

situations. Unlike the conventional statistical decision theory based classifiers, neural

networks belong to a class of non-parametric classifiers that learn from samples of a

system only without needing to know the statistical model of the pattern variables. After

training a neural network using some learning algorithm and a sample dataset, the

neural network is able to classify new unseen samples with good generalization and

accuracy. It was said in [179] that neural networks hide the statistics of the decision


making theory from the user and hence allows it to be appreciated by a wide audience.

There exist many modern applications of neural networks based pattern classifiers that

were triggered by the increasing availability of large scale datasets and the need to make

sense of them. Successfully implemented applications include medical systems [66-70],

[158-162], image processing [163-165], face recognition [166], [167], handwriting

recognition [71], [72], sales forecasting [168-170], parameter estimation [171-173],

information systems [76], [174], [175], and control systems [176], [177].

Recently, a novel machine learning algorithm called the extreme learning

machine (ELM) was proposed by Huang et. al. [3]. The ELM is a fast and highly

effective learning algorithm designed for single hidden layer feedforward neural

networks (SLFN). The uniqueness of the ELM lies in the random selection of hidden

layer weights that do not require any tuning, and the analytical solution of the output

layer weights using the pseudo-inverse. Obviously, the training procedures of the ELM

differ significantly from the conventional iterative methods used to train neural

networks such as the backpropagation (BP) method. Many researchers [3], [18], [23-25],

[90], [147] have reported that the ELM is capable of learning thousands of times faster

than the BP methods and tends to perform better in terms of classification accuracy and

generalization on unseen samples. Some researchers have also compared the ELM to

the popular support vector machine (SVM) commonly used for classification problems.

It was found in [180] that the ELM is capable of learning much faster than the SVM,

and the generalization performance of the ELM is comparable or better than the SVM

when the sample size is large. In terms of computational costs, the ELM is seen to be

the clear choice for handling large datasets when compared to the SVM [147], [180].

Furthermore, there are also an increasing number of studies done to compare the

learning theories of ELM and SVM. In [147], It was reported that the LS-SVM

optimization framework is related to the ELM and the LS-SVM optimization solutions

are actually suboptimal compared to the ELM due to the LS-SVM’s stricter

optimization conditions. The ELM has been proven multiple times to perform better

than the SVM in multi-class pattern classification [3], [147] and is recognized as the

state-of-the-art by many. A large number of researchers have proposed modified

versions of the ELM training algorithm to either further improve its performance or

customize the learning process for specific applications [23], [24], [33-65]. From the


fast growing literature on the ELM, it can be easily seen that the ELM has been

successfully implemented in a wide range of real world classification problems [66-72],

[76], [158-177].

The aim of this chapter is to provide a review of the development of ELM and

its extensions in the field of pattern classification, as well as to present some discussions

on the issues encountered within the ELM literature. The rest of this chapter is

organized as follows: Section 2.2 gives a brief overview of statistical classifiers. Section

2.3 introduces the SLFN and the conventional neural network training algorithms.

Section 2.4 provides a thorough survey of the ELM and its learning theories. Section 2.5

examines the regularized versions of the ELM. Section 2.6 and Section 2.7 analyses the

relationship between the ELM and the SVM and RVFL respectively. Section 2.8

surveys the applications of ELM in various fields, with a focus on medical pattern

classification problems. Lastly, Section 2.9 gives the conclusions.

2.2 Statistical Classifiers

The objective of the research in statistical classifiers is to construct the optimal classifier

given the accurate statistical model of the variables that make up the sample input

pattern. Direct analytical solutions for such optimal classifiers are obtainable when the

full statistical parameters are available. What is more, the statistical classifiers allow for

the generalization of new unseen samples, which is a highly desirable characteristic. For

this reason, many early applications implement statistical classifiers for complex cases

where it is impossible to catalogue all possible pattern mappings. Two commonly

known statistical classifiers are first introduced in this section, then the relationship

between the statistical classifiers and neural networks are discussed in brief to explain

why neural networks can achieve good performance as pattern classifiers.

One of the basic statistical classification algorithms is known as the Fisher’s

linear discriminant analysis (FLDA) [181]. The FLDA projects high dimensional

sample pattern vectors into a reduced dimensional space using linear transformations. It

is desired that the linear transformations chosen will minimize the intra-class separation

and maximize the inter-class separation, thereby creating a well clustered feature space


with multiple dis-joint regions for classification. The linear transformation matrix is

calculated using the first and second order statistics of the classes. In multi-class cases,

the FLDA is generally known as the linear discriminant analysis (LDA). A great

limitation of the LDA is that it only works well for linearly separable cases, where the

linear transformation creates well separated sample distributions after projection. Hence

the LDA is often used as a dimension reduction tool instead before applying other

nonlinear pattern classifiers such as the SVM [182].

Another well-known statistical classification algorithm is the Bayesian Classifier

(BC) [103]. Different from the LDA which only uses the prior probabilities of sample

variables, the BC uses the a posterior probabilities to arrive at a classification decision.

Given the prior probabilities ( ) and ( ) and the conditional probability density

functions ( | ) and ( | ) in a binary case, the a posterior probabilities (2.2.1) and

(2.2.2) of a new sample is interpreted as the likelihood that the sample belongs to a

specific class ( or ) after the sample is collected.

( | ) ( | ) ( ) ( ) ( )⁄ ( | ) ( | ) ( ) ( ) ( )⁄

The maximum likelihood rule in the Bayesian theory states that we should choose the

class that yields the largest a posterior probability as the output class [103].

( ( | ) ( | )) ( )

It is worth noting that the BC produces an inference from the available statistics

only, without any transformation of the sample patterns. Therefore its applications are

limited to finding the optimal threshold or decision boundary in the feature space.

Both the LDA and the BC require that the statistical parameters and probability

distribution functions related to the sample variable be available and accurate. In

practical applications, these parameters are usually not known and have to be estimated

from the sample dataset. Considering the effects of heteroscedasticity in any random

sample set, it is obvious that the sample statistics may not estimate the population


statistics well. Therefore, it is hard to implement accurate statistical classifiers in real

world applications.

Neural networks are known as one of the most popular non-parametric

classifiers. However, it has been long established that the neural classifiers and

statistical classifiers share many similarities, and the link between neural classifiers and

statistical classifiers has been thoroughly investigated by researchers [145]. The simple

illustration of this link by Richard and Lippmann [183] provided great insight into the

black-box like operation of neural classifiers. It was described in [145], [183] that a

neural classifier can be viewed as a mapping function ( ) , where d-

dimensional input is mapped to an M-dimensional output . Typical neural network

training algorithms such as the BP perform least squares optimization to minimize the

mean-squared-error (MSE). Based on the statistical least squares estimation theory, the

mapping function ( ) which minimizes the expected squared error [ ( )] is

actually the conditional expectation of given , hence ( ) [ | ] . If the

classification targets are set as the one-of-C format (one element is unity, the other

elements are zero), the following proof [183] is then valid for the case when the output

belongs to the -th class:

( ) [ | ]

( | ) ( | )

( | )

( | ) ( )

It can be seen that the least squares estimate of the mapping function is exactly

the a posterior probability. This significant finding proves that the neural classifier

outputs are actually estimates of the a posteriori probabilities defined in statistical

classifiers. The reason why the neural networks based methods does not directly give

the a posteriori probability as in (2.2.1) is because most neural network training

algorithms are not optimal in the least squares sense. Training issues such as

convergence to local minima and limited sample size restrict the approximation

capabilities of the neural network. In order to obtain the best estimate, the neural


network is said to require a large network architecture and unlimited samples. When

such conditions are met, the sum of the neural classifier outputs tends to approach unity.

In conventional multilayer perceptrons (MLP) based neural classifier, the hidden

layers act as the feature mapping stage while the output layer acts as the feature space

decision function. The statistical equivalent of the neural classifier can be seen as a

combination of the LDA and BC. The hidden layers of the MLP are found to transform

sample patterns into clusters [184], which is similar to the function of the LDA.

However, the feature mapping capabilities of the LDA are limited compared to the

nonlinear mappings of the MLP hidden layers which are capable of nonlinear

discriminant analysis. The superior feature mapping capability of the neural classifiers

help to explain why they consistently perform better than linear discriminant methods.

The output layer of the MLP, which is conventionally trained by BP is then the least

squares estimate to the optimal decision boundary in BC. Nevertheless, the limited

number of samples, the existence of noise and outliers, the lack of a large enough

architecture, and the randomness of conventional BP based neural classifiers tend to

produce results with high variance compared to the optimal BC decision boundary.

The ELM utilizes the SLFN which is a specific type of MLP with only one

hidden layer. Instead of using the BP method which produces varying models

depending on the initialization of the SLFN, the ELM uses the pseudo-inverse to

analytically determine the output layer weights. As a result of that, the output layer

weights of the ELM tend to achieve the smallest training error and the smallest norm.

Therefore it can be said that the ELM provides a better and more stable estimate of the a

posteriori probability compared to the BP in terms of optimization strategy with all

other conditions being similar.

2.3 Single Hidden Layer Feedforward Neural Networks

Neural networks belong to a class of non-parametric machine learning paradigm that are

capable of both classification and regression. One of the most researched neural

networks based pattern classifier is the SLFN. The SLFN is the main focus of this

research, where the analysis and discussions concentrate on feedforward neural


networks that allow the flow of information in only one direction (left to right) in the

network. A SLFN with 4 input nodes { }, 1 hidden layer with 4 neurons

{ }, and 1 output neuron { } connected to the output node { } is shown in

Figure 2.1, the bias term is omitted here for clarity. The hidden layer weights are

represented by and the output layer weights are represented by .

Figure 2.1 A single hidden layer feedforward neural network

The SLFN in Figure 2.1 can be modelled by the equation in (2.3.1),

( ) ( ) ( )

where [ ] , is the hidden layer weights matrix, is

the output layer weights matrix, is the number of neurons (here ), and

( ) is the activation function of each hidden layer neuron. The hidden layer

activation functions are usually selected as sigmoids and the similar activation functions

are assumed for every neuron in a layer. The output activation functions are normally

linear. For the case where the bias is defined explicitly, the equation (2.3.2) models the

similar SLFN, where is the bias matrix. In the discussions of the later

chapters, the SLFN representation in (2.3.1) is normally used, unless specifically noted.

( ) ( ) ( )

From the pattern classification point of view, the SLFN output layer usually

consists of as many neurons as the available classes (i.e. [ ], = number of







classes), where each neuron represents the “active” or “inactive” state of a class. The

corresponding training target for each sample input pattern is usually generated from the

1 of C template, which sets a 1 for the target class (output neuron) and leaves the others

as 0. With this setup, the discriminant function at the output of the neural classifier

needs to only choose the neuron with the largest output value as the predicted output

class. The scalable size of the output vector allows the neural network to be trained for

both binary and multi-class pattern classification problems using the same learning


Many researchers have proposed novel methods to determine the ideal network

parameters and neuron weights for a given SLFN to properly learn a training sample set.

All the training algorithms can be broadly categorized into either supervised or

unsupervised learning. In supervised learning, the training dataset consists of sample

patterns and their corresponding class labels. The class labels act as the target values for

which the outputs of the trained SLFN will be compared against to determine the

training error. This error can then be utilized to further tune the SLFN weights to

produce better results. On the other hand, unsupervised learning requires no class labels

and the training algorithm works with the sample patterns alone to determine unique

structures or relationships among samples. This thesis intends to develop robust

supervised training algorithms for the SLFN. A detailed discussion on the three main

approaches [18] used in the design of supervised training algorithms for the SLFN is

given in the following sub-sections.

2.3.1 Gradient Descent Based Algorithms

The most popular training algorithms used for the neural networks in general are based

on the gradient descent method. The BP method is the most prominent representation of

this class of training algorithm to date. It was developed by Rumelhart et al. [8] in the

1980s and has since been the cornerstone of neural networks training. It is known as the

first training algorithm to iteratively update the neuron weights in the multilayer

perceptron (MLP) to reduce the output error with respect to a target. Before this, the

error correction learning algorithm for perceptrons only worked for a single processing

element. The importance of the BP algorithm lies in its ability to extend and scale the


training process to suit any neural network architecture. In order to perform training

using the BP algorithm there must exist some error term, which is usually calculated as

the difference between the network output and the intended target value. The additive

type of hidden layer neurons is most commonly used in these algorithms.

Figure 2.2 The backpropagation training algorithm [2]

Figure 2.2 shows an example of the BP training algorithm. It is seen that the BP

training algorithm described in Figure 2.2 employs the sequential update method, where

the neurons weights are modified based on the error produced by each input pattern

directly. Another popular weight update rule called batch update introduces a slight

change to the update procedure. In the batch update process, the whole set of sample

patterns are presented to the network and the errors from the respective samples are

summed together to produce an aggregated error term that is used to update the neuron

weights; the iteration through the entire set of input patterns is called an epoch. It has

Given a sample set {X,T} of size n, a SLFN with hidden neurons and 1 output neuron

Initialize neuron weights and biases: Initialize target error: Initialize learning rate: Initialize maximum iterations: MaxIter

Repeat for = 1 to MaxIter Input pattern and calculate output Calculate error // Update output layer neuron weights // Calculate propagated error for hidden layer neurons = // Update hidden layer weights // Check if the total error for an epoch is small enough If Stop the training End // Training also stops when reaches the MaxIter

Trained SLFN with neuron weights and biases:


been shown in [2] that for small values of the learning rate η both the sequential update

and batch update produce very similar weights updates for a complete cycle of the input

pattern set. However, in terms of computational efficiency, the sequential update is

faster and therefore is the preferred method for practical implementations.

A large number of modified and improved BP based training algorithms have

been proposed by researchers. These interesting developments have found various

applications that span numerous disciplines. However, there are several issues that

inhibit the large scale adoption of neural networks in real world applications. These

issues are among the essential problems that motivated the research performed in this

thesis. The list below outlines the salient issues that will be addressed herein.

(a) Convergence to local minima

As the BP method uses the information from the gradient of the error cost function

to descent to some minimum, the convergence of the algorithm to the global

minimum very much depends upon the location of the initialized weights. This

phenomenon applies to non-convex cost functions where there is more than one

minimum point. It should be noted that the probability of having a non-convex cost

function is high among real-world problems which are normally complex in nature.

If the initialized weights position is far away from the global minimum, the

optimization process will likely be trapped in one of the local minima of a non-

convex cost function. An illustrative example of this case is shown in Figure 2.3,

where only the initialized weights converge to the global minimum. It is

important to point out that since the cost function ( ) with respect to the neuron

weights ( ) is generally unknown in practical situations, there is no way to confirm

whether the gradient descent process converged to the global minimum.


(a) (b)

Figure 2.3 (a) Gradient descent on a convex cost function, (b) Gradient descent on a

non-convex cost function with randomly initialized weights

(b) Long training time

The BP based training algorithms iteratively tune the weights of the SLFN to form

suitable decision boundaries on the pattern space. Such a process is slow and time

consuming. Moreover, estimating the time or iterations required for the algorithm to

converge to a minimum is non-trivial. Several issues contribute to the total time

required to find the optimal weights for the SLFN such as the selection of the

learning step size, the features of the training algorithm, the complexity of the

network architecture, the size of the training set, and the separability of the sample

patterns. It is common that complex problems with regular network architectures

such as the SLFN may take hours or even days to be solved.

(c) Numerous training parameters

From the above discussions on the issues affecting the training time, it is clear that

the number of free parameters that need to be tuned is daunting. In practical

applications, most of these are selected based on trial and error or user experience. It

is hard to guarantee the optimality in such training procedures unless all possibilities

are tested. However, there are several generally agreed upon guidelines which act as

the first resort. The SLFN is one such network, with elaborate literature describing

its ability to function as universal approximators, as well as a universal


approximation theorem that suggests that one hidden layer is enough to gain most of

the advantages offered by neural networks.

2.3.2 Standard Optimization Method Based Algorithms

In the attempt to address the problem of non-ideal solutions produced by the BP based

training algorithms, researchers have proposed alternative optimization methods for the

SLFN. The support vector network, or support vector machine (SVM) [19], [20] by

Cortes and Vapnik is one of the leading optimization based training algorithms which

makes use of the method of Lagrange multipliers. The support vector network proposes

that only the weights connected to the output layer needs to be determined. In essence,

the input patterns are first non-linearly mapped to a high dimensional feature space

using some kernel function. Then a linear decision surface is constructed in the feature

space to best separate the classes. Different from the BP methods, the non-linear

mapping function of the support vector network is fixed a priori, and the optimization

method is used to deterministically compute the output layer decision surface with the

best generalization ability.

Although the research related to SVM training algorithms provided significant

findings in terms of the optimal decision surface, there exist two unavoidable drawbacks.

It is generally known that selecting the ideal non-linear mapping function is a non-

trivial task. The non-linear function has to transform the input patterns to the best

separable feature space in order to achieve the optimal generalization performance.

Furthermore, given the complex nature of the optimization procedure and the

polynomials of high degrees that are normally used to form decision surfaces, the

training time is considerably long. What is more, as the training algorithm is

deterministic, each support vector has to be considered simultaneously during the

optimization process and this proportionally increases the training time.

2.3.3 Least Squares Based Algorithms

The radial basis function (RBF) networks are prime examples of training algorithms that

determine the neural network weights using least squares. The RBF networks are a


special case of the SLFN where the hidden layer neurons are substituted using radial

basis functions such as the Gaussian function. An example of a RBF hidden node is

shown below

( ‖ ‖) ( )

where is the input pattern vector, is the centre of the th node, is the impact

factor or bias term, ( ) is the non-linear radially symmetrical activation function, is

the output of the th hidden node, and ‖ ‖ is the L2-norm. In the RBF networks [21],

[22], the parameters of each hidden node are linear and fixed, and the output is given by

a radially symmetric function of the distance metric between the input pattern and the

centre. Therefore, the RBF network hidden layer performs a fixed non-linear

transformation of the input space to the feature space. The only free parameter is then

the output layer weights that can be determined using the least squares method.

The difficulties in implementing the RBF network reside in selecting the optimal

centres and activation function. Conventionally, the centres are selected randomly from

the training data, but this selection method often requires very large networks. Lowe [21]

proposed adaptive tuning of the centres using Quasi-Newton methods similar to the

iterative BP. Chen et al. [22] later proposed the orthogonal least squares learning

algorithm to determine the optimal centres of the RBF nodes. The orthogonal least

squares method avoids the over-size and ill-conditioning problems that occur frequently

in the random selection of centres. However, it should be noted that longer training time

is required to perform optimization on the RBF hidden layer.

In summary, the literature reveals that the generalization capability, the selection

of free parameters, and the required training time are the major concerns in the design of

robust and computationally efficient training algorithms for the SLFN.

2.4 Extreme Learning Machine

Recently, a new machine learning algorithm called the extreme learning machine (ELM)

was proposed by Huang et al. [3]. Different from the conventional training algorithms


for SLFNs such as the BP method, the ELM proposed that only the output layer weights

need to be tuned. The basic ELM scheme assigns the hidden layer weights and biases

randomly prior to the training stage. Then the input patterns are non-linearly

transformed into the feature space by the hidden layer function, producing the hidden

layer output matrix , such that , where represents the target matrix.

( ) ( )

Finally the output layer weights are calculated by the method of least squares using the

Moore-Penrose pseudo-inverse. The convention in [3] is adopted here for readability.

( )

In summary, the basic ELM training algorithm can be defined as:

1) Given a training data set [ ], randomly assign the hidden layer weights and

biases of the SLFN.

2) Calculate the hidden layer output .

3) Solve for using the Moore-Penrose pseudo-inverse, where .

Significant improvements in training speed and generalization performance have

been reported for the ELM [18]. It is seen that the assignment of the hidden layer

weights and biases using randomly generated values reduces the model complexity

tremendously in terms of the number of parameters that require tuning. Furthermore, the

analytical solution of the output layer weights in (2.4.2) converts the output weights

optimization problem into solving a system of linear equations. The solution to the

linear equations matrix using the pseudo-inverse produces the smallest training error

and the smallest norm (including the overdetermined and underdetermined cases).

Achieving the smallest norm actually maximizes the generalization ability of the SLFN

classifier by reducing the output variance in the testing phase. In addition, Bartlett’s

theory [185] also suggests that for neural networks achieving small training error, the

smaller the norm of the network weights the better the performance.


The reduction in the number of parameters to tune also allows the SLFN training

to be less sensitive to human intervention. The only training parameter to specify for the

ELM is the number of neurons, while the activation functions are assumed to be

sigmoidal with additive type of hidden nodes. Hence the ELM training algorithm is

much simpler than the BP and SVM, where the model initialization in both the latter

cases requires educated estimates of the solutions. Unlike an earlier proposed randomly

initialized hidden layer weights method in [21], which actually uses randomly selected

training patterns, the ELM hidden layer weights are completely independent of the

sample dataset. In fact, the ELM hidden layer weights and biases are usually pre-

generated without knowledge of the input patterns and output target variables.

Although the ELM was initially designed for the SLFN, it was later extended to

include generalized-SLFNs which allow non-neuron alike hidden nodes [23], [24]. For

instance, the ELM literature provides theoretical proof that hard-limiting functions such

as the threshold function that is not infinitely differentiable is capable of universal

approximation [93]. As more function types are applicable for the ELM, the hidden

layer of the ELM is seen to function as the arbitrary feature mapping layer. This

observation bears strong similarities with the operations of the kernel functions in the

SVM training algorithm. Considerable research effort has been put into the comparison

and integration of the ELM and SVM and the findings are discussed in detail in section


Lastly, it is worth noting that there have been several studies conducted to

examine the feasibility of the ELM learning algorithm [186], [187]. In [186], Wang et.

al. proposed experiments to compare the direct input SLFN (without any hidden layer)

with the random weights ELM. The experimental results confirm that the randomly

generated weights do have some positive effects on the SLFN for many classification

and regression tasks. Horata et. al. [187] studied the various methods to compute the

pseudo-inverse as it is the most time consuming step, and the basic ELM typically uses

the SVD to compute it. They subsequently proposed QR-ELM and GENINV-ELM that

significantly speed up the computational time required to calculate the pseudo-inverse

while maintaining acceptable levels of accuracy. The following sub-section examines

the learning theories developed for the ELM.


2.4.1 Learning Theories of ELM

The universal approximation and interpolation capabilities of neural networks have been

thoroughly researched over the past decades [3-7], [23-26]. Broomhead and Lowe [26]

suggested that feedforward neural networks may be viewed as performing a simple

curve-fitting operation in high dimensional space. Learning is then an attempt to

produce a best fit surface in the high dimensional space using a finite set of data points

(sample patterns). Generalization can then be seen as interpolating between the data

points on the fitted surface. A brief review of the learning theories developed for the

SLFN will be given below.

The early works on universal approximation stem from the initial problem of

finding a group of single variable functions to represent a high order polynomial

function. The solution to this problem was proposed by Kolgomorov, who proved that

“any continuous function defined on an -dimensional cube is representable by sums

and superpositions of continuous functions of exactly one variable” [2], with the

mathematical model shown in (2.4.3), where ( ) is the target function with input

variables { | } , ( ) are the individual continuous functions with one

variable { | } such that the sum of the outputs of individual continuous

functions are connected to the weighting variables { | } to be superimposed.

( ) ∑ [∑ ( )


( )

The function in (2.4.3) shows great similarities with the mathematical model of

the SLFN in (2.4.4), where ( ) is the output function with the -th input vector from

a total of sample vectors, and are the weights and biases of the hidden layer

neurons with activation function ( ) , and are the output layer weights. The

number of neurons in the hidden layer and output layer are set as the same here for



( ) ∑ ( )

( )

After Kolgomorov’s theorem was conceived, Cybenko [4] and Funahashi [5] in

the late 1980s used it as a means to derive rigorous proves that there exists a solution for

the SLFN with any continuous sigmoidal non-linear activation function to approximate

any continuous function on a compact sample set with uniform topology. Hornik et al.

[6] then proved that MLPs in general, including the SLFN, is capable of universal

approximation using any arbitrarily bounded non-constant activation function, provided

that sufficient number of hidden neurons are available. In the early 1990s, Leshno et al.

[7] provided the general proof that the SLFN is capable of universal approximation with

any locally bounded piecewise continuous activation function as long as the activation

function is not a polynomial. An important deviation from Kolgomorov’s theorem that

is found in the above proofs is that instead of allowing for different non-linear functions,

the proofs involving the SLFN use the same non-linear function (neuron activation

function) for all its hidden layer neurons. The universal approximation theorem below is

derived from the works of these early researchers [4-6]:

Theorem 2.1: Let ( ) be a non-constant, bounded, and monotone-increasing

continuous function. Let denote the -dimensional unit hypercube [ ] , and the

space of continuous functions on be denoted by ( ). Then given any function

( ) and , there exists an integer and sets of real constants , , and ,

where and such that we may define

( ) ∑ (∑


( )

as an approximate realization of the function where | ( ) ( )|


All the above contributions, including Theorem 2.1, provide existential proof

that there exists some combination of neurons that when put together correctly, is able


to approximate any continuous function. However, there are no constructive instructions

on how these “best case” networks can be found. Hence numerous training algorithms

are developed to search the weight space for the ideal set of neuron weights for a given

network architecture to produce the best approximation to a target function.

Unfortunately, there is no short-cut to find out whether the best solution is found

without testing the entire set of combinations, but it is generally acceptable that the final

testing error is negligible for the intended application (i.e. let the learning error

where is a small positive number).

In all of the SLFN learning theories mentioned above, the meta-parameters

(network weights, activation functions, training variables) are required to be freely

adjustable. As a consequence, the training time increases quickly with complex

algorithms that require a lot of tuning. The ELM does not require any tuning at the

hidden layer, in fact, in the typical implementation of ELM they can be generated

randomly. The output layer weights can then be determined analytically using the

pseudo-inverse. It differs significantly from the conventional BP method in that it does

not involve any iterative tuning at any stage of training the SLFN. The learning

performed by the ELM on a finite set of sample data is initially based on the

interpolation theory [3], [18]. In terms of learning a target function, the interpolation

theory and universal approximation are two different approaches to a problem. In the

interpolation theory, the network should learn all the sample data points well, so that

these data points can be reproduced exactly, and any query to the region outside the set

of training points can be estimated by interpolation or extrapolation from existing points.

The universal approximation theorem does not demand that the sample data points be

learned exactly, rather, it proposes that the total error over the entire sample space be

minimized. However, it should be noted that both interpolation and universal

approximation capabilities can allow a neural network to learn a continuous function

within a dense set given a large enough amount of neurons, and their differences are

significant only when the number of neurons is limited.

The first function approximation theory for the ELM was introduced in [3],

using the interpolation theory. For a dataset with sample patterns { |

} , and corresponding target vectors { | } ,


such that the SLFN attempts to map | , the SLFN with neurons can be

written as

( ) ∑ ( )

( )

where ( ) is the SLFN output vector, { | } is the

output layer weights, { | } is the hidden layer weights,

{ | } is the bias, and ( ) is the activation function. The

function in (2.4.6) can be written compactly as matrices , where

[ ( ) ( )

( ) ( )

] ( )

[ ] and [ ]

It has been shown in [3] that from the interpolation theory point of view, if the

activation function ( ) is infinitely differentiable in any interval the hidden layer

weights and biases can be randomly generated. Different from the previous work that

randomly selects weights from the training data [21], the random weights generated for

the ELM can be completely data independent. Theorem 2.2 [3] below states the results.

Theorem 2.2: Given any small positive value , activation function |

which is infinitely differentiable in any interval, and arbitrary distinct samples

( ) , there exist such that for any { } randomly generated

from any intervals of , according to any continuous probability distribution, with

probability one, ‖ ‖ .

Theorem 2.2 shows that the ELM is capable of learning the sample points

efficiently with a small amount of error. The SLFN trained according to this theorem is

known as the approximate interpolation network, where the points are learned with

some error. However, the authors in [3] also proved that for the case of samples, it is


possible for the ELM to learn the samples exactly if , such that ‖

‖ . This extension of Theorem 2.2 is known as the exact interpolation network,

where there exists a set of network parameters that can learn exactly samples point

with neurons. The proof provided for this theorem relies on the definite probability of

finding a set of network parameters to make the hidden layer output matrix full rank,

and hence, guarantees invertibility.

There also exist universal approximation proofs for the ELM where the authors

have attempted to provide rules for the design and training of the SLFN. In [25], an

incremental version of the ELM was used to prove that if the incrementally added

neurons (with randomly selected weights and biases) of a SLFN used (2.4.8) to update

the corresponding new output weight term, the universal approximation theorem below


Theorem 2.3: Given any bounded non-constant piecewise continuous function |

for additive nodes or any integrable piecewise continuous function | and

∫ ( )

for radial basis function nodes, for any continuous target function and

any randomly generated function sequence { } , ‖ ‖ holds with

probability one if

( ) ⟨ ( ) ⟩

‖ ‖ ( )

Theorem 2.3 only confirms the universal approximation capability of the ELM

for the incremental learning method, where the network starts with an empty set of

neurons and iteratively adds more neurons as required. Indeed, the universal

approximation capability of the ELM in the most commonly used batch learning mode

was only published later in [23], [24]. Within the derivations of [23], [24], the authors

still used the incremental learning method to first proved that such SLFN training

methods could perform universal approximation, before stating the theorem for the

fixed structure SLFN by induction. This important proof is shown as Theorem 2.4.


Theorem 2.4: Given any non-constant piecewise continuous function | , if span

{ ( ) ( ) } is dense in the function space , for any continuous

target function and any function sequence { ( ) ( } randomly generated

based on any continuous sampling distribution, ‖ ‖ holds with

probability one if the output weights are determined by ordinary least square to

minimize ‖ ( ) ∑ ( ) ‖.

With the proof provided in Theorem 2.4, the universal approximation capability

of the original ELM [3] and all other modifications of its batch learning type of training

algorithms with fixed SLFN network architecture are justified. In addition, Theorem 2.4

also indicates that the ELM training algorithm can work for the “generalized” SLFNs,

which include a wide range of non-linear and non-differentiable activation functions

where the nodes need not be neuron alike. This essentially means that the ELM training

algorithm allows much greater flexibility in the selection of basis functions to be used in

the SLFN architecture.

Lastly, it is worth noting that all the ELM approximation theories state that as

the number of neurons becomes very large (possibly much larger than the number of

training samples), the performance of the trained SLFN with the ELM algorithm will

proportionally improve. Therefore the ELM training norm deviates from the

conventional MLP learning rule that at most neurons are required to learn the -

unique sample patterns well. In fact, the ELM networks are usually tested with a large

amount of neurons at the first attempt. The next sub-sections survey the extensions of

the ELM algorithm and their applications.

2.4.2 Batch Learning ELM

The training algorithms that are developed based on the ELM can be generally

categorized into off-line or batch learning and on-line or sequential learning. The batch

learning method trains a group of samples patterns at once and usually requires longer

computational time, while the sequential learning method learns the sample patterns

one-by-one as new samples arrive (usually in real-time systems). The initial ELM was

designed for batch learning and was later extended to have sequential learning


capabilities. In the design of the basic batch learning ELM, the SLFN hidden layer

weights and biases are assigned randomly, and the output layer weights are obtained by

solving for in the least squares sense such that . If the number of hidden

nodes is equal to the number of unique training samples, the matrix is invertible, and

there exist a solution where with zero error ( ). However in practical

cases where the number of samples is either more than or less than the number of

hidden nodes, such that the matrix is not full rank, the pseudo-inverse is used as it

achieves the smallest training error and has the smallest norm. The neural network

theory states that for SLFNs reaching small training error, the smaller the norm of the

network weights, the better the generalization performance [18].

Within the batch learning category, the development of the ELM training

algorithms can be divided into three sub-sections, namely, (i) the optimal selection of

weights and biases, (ii) the use of different types of nodes and activation functions, and

(iii) the automatic model selection algorithms. The Optimal Selection of Weights and Biases

The first sub-section involves the least modification to the architecture of the

SLFN. Instead of randomly generating the network weights and biases, the methods

proposed in [33-39] implement alternative weights selection strategies that improve the

performance of the ELM. The discussions in [33-35] revolve around the ill-posed

solution of the pseudo-inverse when the matrices are not well conditioned. They

propose methods to properly select the hidden layer weights and biases such that the

hidden layer output matrix is well conditioned before calculating the output weights. In

[36], [37], the authors implement evolutionary algorithms to select the hidden layer

weights and biases, whereas the input data is used to generate the suitable hidden layer

weights and biases in [38]. Different from the works mentioned above, the algorithm in

[39] proposes the optimal calculation of the output layer weights when the hidden layer

output matrix is non-full rank.

32 The Use of Different Types of Nodes and Activation Functions

The second sub-section of learning algorithms proposes some changes to the

nodes or activation functions in the SLFN. In [40], the parameterized RBF was used in

the hidden layer so that the RBF node meta-parameters can be tuned to provide better

performance. A method to implement fuzzy activation functions was proposed in [41],

while the authors in [42] proved that the combination of upper integral functions at the

hidden layer can function as a classifier. Other than that, a good deal of attention was

also paid to the development of ELM algorithms for complex SLFNs. The authors in

[43-45] have developed fully complex ELMs with the main purpose of implementing an

ELM-based channel equalizer. In [46], both the sine and cosine were used as activation

functions for periodic function approximation. The Automatic Model Selection Algorithms

The third and final sub-section for batch learning algorithms consists of

automatic model selection methods which determine the optimal architecture for the

ELMs. Generally, the neural networks literature on model selection based training

algorithms can be defined as constructive methods and pruning methods. The

constructive methods start from a single hidden layer neuron, and incrementally add

neurons until a target or stopping criterion is met, while the pruning methods start with a

large number of neurons and selectively remove neurons which are less important. In

[23], [24], [47-49], constructive model selection methods were proposed to iteratively

search for the optimal number of hidden nodes for the SLFN trained with the ELM

algorithm. The convergence rate analysis for the constructive type training algorithms

was also provided in [50], where the main results show that the constructive type

training algorithms have universal approximation capability. Alternatively, the authors

in [51] and [52] proposed optimally pruned ELM algorithms that remove hidden layer

neurons from a large group of pre-generated neurons, based on regularization theories.

In [53], the modified Gram-Schmidt algorithm was used to select the important nodes in

the SLFN, while the fast pruned-ELM in [54] used statistical methods to determine the

best nodes to be retained.


2.4.3 Sequential Learning ELM

Real-time systems require speed and continuous updating capability that is only

available using sequential learning algorithms. Because the ELM is a very fast learning

algorithm by itself, the sequential learning form of the ELM promises even greater

speeds and improved overall performance compared to the conventional methods. The

online sequential extreme learning machine (OS-ELM) and its enhanced versions in

[55-57] extends the simple learning algorithm of the ELM to the sequential case, where

the sample by sample learning equations are derived and are proven to be equivalent to

the batch learning case within a compact set of training samples. There is also the online

version of the ELM with fuzzy activation functions developed in [58] for function

approximation and classification problems, and the structure-adjustable (constructive or

pruning based) sequential learning ELMs in [59], [60].

2.4.4 ELM Ensembles

In order to boost the performance of the ELM learning algorithms mentioned above

(batch learning and sequential learning methods), the combination of several ELMs to

learn a problem as a group, and to devise a collective output, was developed by

researchers. These approaches which combine the efforts of multiple ELMs are known

as ensemble methods. The most prominent works in this area are the contributions in

[61-64], where different types of combinational theories are implemented to improve

the classification performance of the single ELM. The recently published results in [64]

shows that the increased number of ELMs generally improves the classification ability

of the neural classifier ensemble when tested with 19 real-world problems. It was also

found that the number of ELMs between 5 and 35 is sufficient for most cases. In terms

of practical implementations of the ELM ensemble in real-world situations where fast

training is essential, the authors in [65] have proven that the ELM ensembles can be

trained in parallel using graphics processing units (GPU), which are faster by several

orders of magnitude than the usual CPU computational time.


2.5 Regularized ELM

The basic ELM training algorithm uses the pseudo-inverse to analytically determine the

output weights after the hidden layer weights and biases are randomly generated.

According to the statistical optimization theory [83], the pseudo-inverse is a form of

empirical risk minimization (ERM) strategy, where the focus is on minimizing the

training error for classification of the training sample set. However, achieving the

minimum training error does not directly translate to good classification performance in

the testing phase, where the generalization ability is essential. In fact, over-fitting is

often observed in ELM applications [89]. A good classification model should optimally

balance between the structural risk and empirical risk minimizations. In order to

improve the generalization ability of the ELM, several modified ELMs [88-90] that

make use of regularization techniques in the determination of the output weights were


Deng et. al. [89] described the ELM as being ERM themed and that the training

process of the ELM does not consider the heteroscedasticity in real world sample

datasets. Therefore the ELM is susceptible to over-fitting and outliers during training.

They proposed the regularized ELM (RELM) which introduces a regularization factor

for the empirical risk and uses the magnitudes of the output weights to represent the

structural risk in training. Furthermore, an error weighting factor matrix was included

to minimize the interference of outliers in the training sample set. The optimization

problem is defined in (2.5.1).

‖ ‖

‖ ‖ ( )

∑ ( )

The standard optimization technique using Lagrange multipliers was then used to obtain

the output weights solution in (2.5.2).




( )

where is the identity matrix, is the hidden layer output matrix, and is the target

matrix. When the matrix is set as the identity matrix, the solution in (2.5.2) can be

written as in (2.5.3).



( )

The RELM reported great improvements in the generalization ability compared to the

ELM when tested on 13 real world datasets while still maintaining the fast training

speed. However, the authors used the matrix search method to select the optimal

regularization factor and number of neurons combination. The matrix search method is

time consuming as every possible combination within the specified range of values has

to be tested. Although the time spent to search for the optimal meta-parameters is

essential and significant, most of the results in the literature do not include it.

Toh [88] proposed the total error rate-based ELM (TER-ELM) designed for

pattern classification problems. Instead of minimizing the empirical error (as in basic

ELM), it is desired in the TER-ELM that the classification error of the SLFN be

minimized. The possible classification results are first categorized as True Positive Rate

(TPR), True Negative Rate (TNR), False Positive Rate (FPR), and False Negative Rate

(FNR). The objective is then to minimize the total error rate (TER):

( )

In the optimization of the TER, Toh used the quadratic function to approximate the step

function commonly used as the hard discriminator (outputs either 1 or 0 only).

Comparisons of other functions such as the sigmoid or logistic-like functions, piecewise

power functions, and some non-linear smooth functions in their work pointed out that

the non-linear nature of these functions require an iterative search method to obtain the

solution. Therefore, the quadratic function remains as their choice to derive a

deterministic solution. By defining two class-specific diagonal weighting matrices


and , which represents the weight factors for the positive and negative samples of

each category in a multi-category problem (similar to multi-class classification), the

generalized optimization solution for the SLFN output weights can be written as:

( ) ( )

where , the elements of and are ordered according to the two

classes, and represents the number of categories of the multi-category problem,

such that [ ]. In order to handle the case when the matrix is not full

rank, the regularizer is included to improve the numerical stability in the regularized

solution in (2.5.6).

( ) ( )

The TER-ELM was first evaluated against the ELM using 42 benchmark datasets for

classification from the UCI, StatLog and StatLib datasets. It should be noted that the

regularizer was not used in the experiments ( ), instead, the author selected 10 sets

of class-specific weights that form , to verify the effectiveness of the TER

minimization design. The average classification results showed a consistent

improvement by the TER-ELM over the basic ELM. Later comparisons with other

state-of-the-art classifiers reveal that the TER-ELM produces comparable results, but

the simple deterministic methodology employed by the TER-ELM significantly reduces

the network complexity and computational costs.

Comparing the solution in (2.5.6) with the solution for the RELM in (2.5.2), it

can be seen that in (2.5.2) the weight matrix is sample specific to reduce the effects of

outliers, while the weight matrix in (2.5.6) is class specific to balance the sensitivity

of the classifier. However, both the TER-ELM and RELM algorithms still require extra

time for parameter selection compared to the ELM.

Huang et. al. [90] proposed the constrained-optimization based ELM (CO-ELM)

to extend the work by Deng et. al. [89] and Toh [88] to the generalized SLFNs and the

kernel functions. In [90], it was first proven that the ELM with generalized SLFN is


capable of universal classification as long as the number of neurons selected are large

enough for the universal function approximation conditions to be valid. Then the

Lagrange multiplier method for equality constraints is applied to the ELM optimization

functions in (2.5.7) to show that the multi-class single output type of SLFN belongs to a

special case of the multi-class multi output type of SLFN for classification; and that

only the multi output case needs to be considered for CO-ELM.

‖ ‖

( )

( )

For the case when the number of training samples is less than the number of

pattern variables, the solution in (2.5.8) is suggested.



( )

For the case when the number of training samples is more than the number of

pattern variables, the solution in (2.5.9) is suggested.



( )

Both the output layer weights solutions in (2.5.8) and (2.5.9) tend to reach the

smallest training error and the smallest norm, and selecting which one to use is mainly a

matter of the computational cost required in performing the matrix inverse. They also

suggested that the number of neurons can be set at a large number (i.e. 1000)

automatically, as the generalization performance of the CO-ELM is not sensitive to the

dimensionality of the feature space.

For the case when the hidden layer mapping ( ) is unknown, the ELM kernel

matrix in (2.5.10) is proposed.


( ) ( ) ( ) ( )

By inserting the ELM kernel into (2.5.8), the corresponding output function of the CO-

ELM classifier can be defined as in (2.5.11), and hence the CO-ELM can be

implemented for kernel functions as well.

( ) ( ) (


( )

[ ( )

( )

] (


The classification performance of the CO-ELM on 12 binary-class datasets and

12 multi-class datasets have confirmed that the CO-ELM always achieves comparable

classification performance compared to SVM and LS-SVM for binary-class cases, and

the CO-ELM performs much better for multi-class cases. Overall, the CO-ELM

demonstrated much faster training speed than the SVM and LS-SVM.

In summary, the regularization methods are seen as an integral part of the ELM

training algorithm, so much so that the CO-ELM solutions in (2.5.8) and (2.5.9) have

become the preferred form of the ELM for comparisons with other pattern classification

algorithms. The CO-ELM is sometimes directly referred to as the ELM. By proper

tuning of the regularization factor or regularization matrix, the ELM can avoid

overfitting, thus improving the generalization ability significantly without any

noticeable decrease in training speed. However, finding the ideal regularizer values

remain a costly computational task.

2.6 ELM and SVM

Support vector machines have been the popular choice for pattern classification

applications because of their good generalization performance and solid literature [19,

20]. Given a set of sample training patterns ( ), the SVM first maps the input

patterns into a high dimensional feature space using some kernel function ( )

( ) ( ), and then the hyperplane that maximally separates the margin between the


two classes are determined using the method of Lagrange multipliers [20]. The outcome

of the optimization procedures are support vectors that define the boundaries of each

class and a decision function that acts as the discriminator. The decision function of the

SVM is given in (2.6.1). The convention in [90] is adopted here for readability.

( ) (∑ ( )

) ( )

where is the sample input pattern, is the number of support vectors , is the

Lagrange multiplier corresponding to the target of the support vector , and is the

bias. The SVM is originally developed for binary classification problems and requires

specific configurations such as the one-against-one or one-against-all methods to

perform multi-class classification.

From the SVM decision function in (2.6.1) it is easily seen that it resembles the

ELM decision function in (2.4.6), where the kernel function ( ) behaves like the

additive type neuron with an activation function to map the input pattern into feature

space and the terms act like the output layer weights . However, the bias term

performs different functions in the ELM and SVM, in the ELM the bias term shifts the

points in the feature space, while the bias in the SVM shifts the separating hyperplane

away from the origin where necessary. Several detailed investigations have been

conducted in [90], [147], [188] to identify the connections between the ELM and SVM.

In [188], Wei et. al. performed comparative studies on the learning

characteristics of ELM and SVM. It was found that both ELM and SVM obtains the

global solutions, ELM through least squares, and SVM through quadratic programming.

However, the computational load of the SVM is significantly higher when the sample

size is large. The number of support vectors generated by the SVM is also usually more

than the number of neurons required by the ELM to achieve similar performances.

Obviously, the decision function of the SVM is more complex compared to the ELM,

and therefore the testing times of the ELM is also much faster than the SVM. In

addition, the SVM requires the careful selection of the kernel function, kernel

parameters, and the regularization constant, while the ELM only needs to determine the


optimal number of neurons during training. All the results points to the superiority of

the ELM over the SVM.

Liu et. al. [189] and Frenay and Verleysen [190] directly applied the ELM

random weights hidden layer function as the so called ELM kernel in SVM. It was

found that the ELM kernels eliminates the need for tedious kernel parameter selection

as only the number of neurons need to be specified, and this number can be set

sufficiently large (i.e. 1000 neurons) without affecting the classification performance

due to the regularization parameter. The classification results achieved showed good

generalization performance that is comparable to standard SVM kernels with the

significant advantage that the SVM with ELM kernels were supremely fast.

Huang [90] derived the optimization method based solution for the ELM

using the similar quadratic programming framework of the SVM. Two differences in

the SVM optimization problem was noted for the ELM, (i) instead of using a

conventional kernel function such as the RBF kernel, the ELM uses random feature

mappings. (ii) the bias is not required in the ELM’s optimization constraints since in

theory the separating hyperplane of the ELM feature space passes through the origin.

The primal and dual Lagrangian was defined for the ELM and the solution to the dual

cases was derived. Using the decision function of the dual optimization with the ELM

kernel in (2.6.2), the so called support vector network for ELM is shown in [90].

( ) (∑ ( )

) ( )

The most obvious difference between the SVM and ELM decision functions

(2.6.1) and (2.6.2) is the missing bias term in the ELM function. The elimination of the

bias term for the optimization method based ELM also correspondingly reduced the

optimization constraints compared to the SVM. As a result of that, the SVM is said to

reach suboptimal solutions within a stricter feature space, compared to the optimization

method based ELM. The experiment results on 13 benchmark binary classification

problems confirmed that the ELM achieves better generalization performance overall.


In [147], Huang et. al. further investigated the relationship between ELM and

SVM by comparing the LS-SVM and PSVM with ELM. It was found that both the LS-

SVM and PSVM, that uses equality constraints compared to conventional SVMs,

actually have similar optimization formulation compared to the CO-ELM (henceforth

referred to as ELM) discussed in section 2.5. If the bias term is removed from the LS-

SVM and PSVM optimization function, the resultant learning algorithms are unified

with ELM. This is a significant finding because the ELM learning theory allows the

SLFN classifier to achieve comparable or better classification performance when

compared to other state-of-the-art pattern classification algorithms. In addition, the

ELM is suited for multiple applications, including binary classification, multi-class

classification, and regression problems. In terms of computational complexity, the ELM

has to only tune the regularizer parameter if the hidden layer feature mapping is known,

and the number of neurons is set to a large value (i.e. 1000 neurons). The comparably

simpler ELM algorithm runs up to thousands of times faster than the SVM and LS-

SVM. Even when the hidden layer feature mapping is not known, kernels can be used in

the ELM. The experiment results further confirmed that the ELM achieves comparable

generalization performance compared to the SVM and LS-SVM for binary classification

problems, and the ELM has much better generalization performance in multi-class cases.

In summary, the research incorporating the learning theories of the ELM and

SVM can be categorized as in Table 2.1. It can be seen from Table 2.1 that different

approaches have been used to merge the ELM and SVM learning theories, and all the

methods have reported comparable results or some improvements over the conventional

SVM. In terms of the optimization formulation, the ELM and SVM are indeed very

similar, where the maximal margin theory of the SVM is similar to the minimization of

the norm of the output weights in ELM [90], [147]. From the many experiments

comparing the ELM and the SVM, it is seen that the ELM is the preferred algorithm for

handling large scale datasets while the SVM is better when the sample size is small.


Table 2.1 Research incorporating the ELM and SVM

Algorithm Linked to ELM Linked to SVM

ELM Basic ELM [3] Optimization method based ELM [90]

SVM SVM with ELM kernels [189, 190]

LS-SVM with bias removed [147] Conventional SVM [20]

2.7 ELM and RVFL

The ELM is capable of achieving very fast training and good generalization

performance. However, it should be noted that the idea of using random weights and

biases in the SLFN was first suggested in the mid-1990s. Similar to the ELM, the

random vector functional link (RVFL) net proposed by Pao et al. [27-32], also uses the

randomly generated weights and biases. The RVFL is a modified version of the

functional link net which initially proposed the use of functional links that are non-

linear, instead of the usual dot product, at the input connections of the SLFN. Then the

authors introduced the use of randomly generated weights and biases in the input layer

as a special case, and the output weights could be calculated using quadratic

programming methods, such as gradient descent.

In terms of network architecture, the main difference in the literature of the ELM

by Huang et al., and the RVFL by Pao et al., is that the RVFL is defined as a “flat”

neural network, which means that it does not have a hidden layer. Hence the inputs of

the RVFL are connected directly to the output layer. However, the inputs of the RVFL

contain the similar hidden layer neuron calculations as the ELM hidden layer with

random weights and biases. Therefore, the effect of the difference in network

architecture on the mathematical formulation of both networks is indeed subtle.

Nevertheless, the ELM is different from the RVFL in terms of the types of acceptable

hidden layer nodes and the batch learning principle of the output layer weights.

The training procedure of the ELM differs from the RVFL in the sense that the

batch learning method using the pseudoinverse is the initial basis for fast learning


proposed for the ELM algorithm, while the RVFL initially suggests the gradient descent

approach. Although the training of the RVFL with the similar pseudoinverse is indeed

possible, the ELM has vast amounts of literature that is solely based on the batch

learning pseudoinverse method. The definition of the possible types of hidden layer

nodes is also more general in the ELM literature, where the hidden layer nodes can be

non-neuron alike, and the network is known as the generalized-SLFNs (including

sigmoid networks, RBF networks, trigonometric networks, threshold networks, fuzzy

inference systems, fully complex neural networks, high-order networks, ridge

polynomial networks, wavelet networks, etc.) [23], [24]. Based on the points compared

herein, the ELM is seen to propose a clear difference from the conventional neural

network architecture and the traditional iterative training algorithms.

2.8 Applications of ELM

The ELM boasts good generalization performance and low computational complexity.

Therefore there have been many applications of the ELM in various fields. This section

gives a brief survey of the applications of ELM, with a focus on medical pattern

classification problems. The following sub-sections discuss the applications of the ELM

in the area of medical systems [66-70], [158-162], image processing [163-165], face

recognition [166], [167], handwriting recognition [71], [191], sales forecasting [168-

170], parameter estimation [171-173], information systems [76], [174-175], and control

systems [176], [177].

2.8.1 Medical Systems

One of the most popular fields of application for the ELM is in medical systems. The

amount of data generated by medical equipment nowadays is significantly voluminous

hence fast intelligent systems are required to infer vital information that could aid

medical practitioners. Due to its good classification performance for large scale datasets,

the ELM is often used in computer aided medical diagnosis.

Li et. al. [158] proposed a computer aided diagnosis system for thyroid disease

using the ELM. In order to differentiate between the hyperthyroidism, hypothyroidism,


and normal condition of the thyroid, they developed the so called PCA-ELM. The PCA-

ELM uses the principal component analysis (PCA) to first reduce the dimensions of the

patient data before using the ELM to learn the classification model. The PCA-ELM was

compared with the PCA-SVM in the experiments, and it was found that the PCA-ELM

performs much better than the PCA-SVM in terms of classification accuracy and tends

to achieve a smaller standard deviation in the 10 fold cross-validation (CV) tests with

shorter run time.

Malar et. al. [159] developed a novel classifier for mammographic

microcalcifications using wavelet analysis and the ELM. Microcalcification in breast

tissue act as potential indicators of breast cancer, but it is usually hard to identify

because of the dense nature of the breast tissue and the poor contrast of the

mammogram. Selected wavelet texture features extracted from a pool of 120 regions of

interest within a mammogram are used as input patterns to train the ELM. The 10 fold

CV tests with the ELM, Bayes classifiers, and SVM showed that the ELM achieves the

best classification accuracy and sensitivity. The receiver operating characteristic (ROC)

of the ELM also verified its superiority over the other classifiers.

Authors in [160-162] designed ELM based classifiers for electroencephalogram

(EEG) signals classification. The EEG records electrical activity within the brain and is

a very useful method in diagnosing neurological disorders. However, the EEG

recordings are usually taken over a long period of time and are very tedious to be

examined by visual inspection. In [160] and [161], the authors developed epileptic

seizure detection systems using the ELM. After passing the EEG signal through the

feature extraction process, the ELM was used to learn the sample pattern sets. Both

studies reported excellent detection results with good sensitivity and specificity. The

authors also agree that the high classification accuracy and low computational cost of

the ELM based systems have great potential to be implemented as real-time detection

systems. In [162], Shi and Lu used the EEG signals to estimate the continuous vigilance

of humans in human machine interaction situations. Techniques for real-time estimation

of operator vigilance are highly desirable in human machine interactions due to work

safety concerns. Detailed experiments comparing the basic ELM, ELM with L1 and L2

norm penalty respectively, and the SVM were conducted to investigate the effectiveness


of the learning algorithms. All the ELM based methods have outperformed the SVM in

terms of learning speed. In terms of accuracy, the regularized versions of ELM, which

includes the L2 norm performs better than the SVM, while the other ELM algorithms

has comparable results with the SVM.

In [67], the ELM was used to predict the appropriate super-family of proteins for

each new sample protein sequence. Conventionally the new protein sequence would

need to be compared with the individual identified protein sequences, hence the testing

process is very time consuming. The experiment results compared the performances of

the BP training algorithm with the radial basis function ELM and the sigmoidal ELM. It

was found that the ELM training algorithms produced better classification accuracy and

were four orders of magnitude faster than the BP method. In [68], the authors developed

a protein secondary structure prediction framework based on the ELM. The prediction

framework is aimed at obtaining the three-dimensional structure of the protein sequence

in order to determine its functions. The protein secondary structure acts as the

intermediate step in predicting the three-dimensional protein sequence. Previous works

in this area are reported to achieve high accuracy but is very time consuming [68]. The

ELM was combined with a new encoding scheme, a probability-based combining

algorithm, and a helix-postprocessing method in the so called ELM-PBC-HPP. The

experiment results show that the proposed method provides the same high level of

accuracy at much faster training speeds compared with BP and SVM.

In [66], the authors studied the effectiveness of the ELM in the multi-category

classification of 5 bioinformatics datasets from the UCI machine learning repository. In

all the experiments, the ELM was found to outperform the SVM in terms of training

speed and classification accuracy. The maximum performance of the ELM is revealed

to be achieved within a set limit of hidden layer nodes. In addition, the optimal size of

the SLFN trained with the ELM was also consistently more compact compared to

conventional SLFN training methods such as the BP. In [69] and [70], the authors

investigated the multi-category classification of microarray gene expressions for cancer

diagnosis. A total of 3 microarray datasets were examined in [69], which consist of the

GCM dataset, Lung dataset, and Lymphoma dataset. A recursive gene feature

elimination method was used for gene selection. The experiment results showed that the


ELM is more robust compared to the SVM under a varying number of genes selected

for training. Furthermore, it was concluded that as the number of classes becomes large,

the ELM achieves a higher accuracy with less training time and a more compact

network compared to other algorithms. Lastly, the ELM classifier in [70] uses the

ANOVA for the first stage gene importance ranking, followed by the training of the

ELM using the minimum gene subset. The results obtained for the performance of the

ELM were consistent with the findings in [69].

2.8.2 Image Processing

Decherchi et. al. [163] proposed the Circular-ELM (C-ELM) for the automated

assessment of image quality on electronic devices. The C-ELM implements the

additional augmented input dimension as seen in the circular backpropagation (CBP)

architecture to the basic ELM. It is seen that the C-ELM outperforms the basic ELM

and the conventional CBP algorithm in converting the visual signals into quality scores.

Wang et. al. [164] developed ELM based algorithms to learn the image de-

blurring functions of traditional image filters. It was reported that the newly developed

learned filter-partial differential equations (LF-PDE) de-blurring model overcomes the

limitation of traditional filters in terms of edge protection and allows the user to

customize the learning functions to select certain de-blurring properties.

Yang et. al. [165] presented a fingerprint matching system based on the ELM.

The system implements an image pre-processing and feature extraction module to

produce 15-dimentional feature vectors to be classified by the ELM and RELM. It is

seen that the ELM and RELM outperforms the traditional BP and SVM in both

classification accuracy and training time. In addition, the fast training and testing times

of the ELM and RELM is said to be suitable for real-time implementations.

2.8.3 Face Recognition

Face recognition is one of the most important biometric identification methods with

many possible applications. In [166], Zong and Huang investigated the performance of


regularized ELM on 4 popular face databases, namely, YALE, ORL, UMIST, and

BioID. The experiment compared ELM-OAO, ELM-OAA, SVM-OAO, SVM-OAA,

and Nearest Neighbour methods with pre-processing applied to all databases to reduce

the dimensions of the samples. In their experimental results, it can be seen that the ELM

and SVM variants achieve comparable classification accuracy while consistently

performing better than Nearest Neighbour. For most cases, the OAA classifiers are

reported to give better results. The advantage of ELM in this application is in the model

simplicity where only one meta-parameter needs to be tuned after the SLFN is assigned

a constant large number of neurons.

Marques and Grana [167] proposed a face recognition system using the lattice

independent component analysis (LICA) and ELM. The so called LICA-ELM system

performs feature extraction in the first stage and then does learning and classification on

the generated features. The experimental results from testing on the Color FERET

database shows that the ELM and the regularized ELM achieves better classification

performance compared to the Random Forest decision tree, SVM, and BP methods. It

can be seen that the regularized ELM performs much better than the ELM for the small

sized dataset.

2.8.4 Handwritten Character Recognition

Handwritten character recognition databases have continuously been used as standard

benchmark tests for developing new learning algorithms. In [72], Chacko et. al. used the

ELM, E-ELM, and OS-ELM for the handwritten Malayalam character recognition. The

character images are first pre-processed using the wavelet energy feature extraction

method. The ELM and its variants are then used to learn and classify the sample pattern

vectors. In the experiments, the authors suggested that the daubechies, symlet and

biorthogonal wavelets give the best performance. In terms of classification accuracy, the

basic ELM is seen to be consistently more efficient than its variants over a range of

wavelet types and levels of decomposition. Comparison of the ELM performance with

previously reported results in the literature showed that the ELM achieved the best

classification accuracy.


The authors in [71] developed new training algorithms for the SLFN using

unique SLFN structural properties and gradient information to improve on the ELM

algorithm. The experiment was conducted using the MNIST handwritten digits database

[82], which consists of the handwritten digits from 0 to 9. Each digit is a gray-scale

image of 28x28 pixels with the intensity range of 0 (black) to 255 (white). A sample set

of the digit images is given in Figure 2.4.

Figure 2.4 A set of handwritten digits from the MNIST database

The experiment results revealed that the proposed weighted accelerated upper-

layer-solution-aware (WA-USA) algorithm is capable of achieving the same accuracy as

the ELM using only 1/16 of the network size and the testing time. However, the WA-

USA requires up to 6 times longer training time compared to the ELM. It was also noted

that the classification performance of the ELM suffers significantly when the number of

hidden nodes is small compared to the newly proposed algorithms.

2.8.5 Sales Forecasting

Sales forecasting or modelling is a very attractive field of study for businesses. It is

important to predict whether a certain product or some fashion style will attract the

interest of customers. Yu et. al. proposed an Intelligent Fast Sales Forecasting Model

(IFSFM) for fashion products [168]. Using the attributes of a fashion sales record such

as the colour, size, and price, the sales amount can be forecasted. Conventional neural

network methods such as the BP and statistical models are found to be slow compared

to the IFSFM method which implements the ELM. In terms of forecasting error, the

IFSFM obtains the lowest MSE.

Xia et. al. [169] presented an Adaptive metrics of input ELM called the AD-

ELM for fashion retail forecasting. Different from the basic ELM, the AD-ELM uses

the adaptive metrics to avoid dramatic changes at the input of the SLFN in order to

avoid large fluctuations on unseen data samples. The final forecasting scheme employs


several ELMs and averages their results as the final sales amount forecasted. The

experiments on several real world fashion sales datasets shows that the AD-ELM and

ELM performs better than the conventional auto-regression and BP based neural

networks models. The AD-ELM significantly outperforms the ELM in all cases.

Chen and Ou [170] developed the Gray ELM (GELM) with Taguchi method for

a sales forecasting system. The Gray relation analysis is used to extract important

features from the raw sales data to be used as inputs to the ELM. The Taguchi method

was applied to find the optimal number of hidden nodes and the type of activation

function to obtain the best results. The experiment implemented the GELM and several

BP based neural networks to predict 120 days of sales data for lunchboxes. It was found

that the GELM achieves the smallest MSE and has much faster training time.

2.8.6 Parameter Estimation

There are a wide array of applications for parameter estimation of system variables and

attributes. In [171], Xu et. al used the ELM for the real-time frequency stability

assessment of electric power systems. Traditional methods of estimating the frequency

stability involved solving a large set of nonlinear differential-algebraic equations, which

can be computationally expensive and time consuming. The authors proposed the ELM

predictor which uses the power system operational parameters as inputs and outputs the

frequency stability margin. The training procedure involves an off-line training process

before inserting the ELM predictor into the real-time system. The fast training speed of

the ELM caused minimal delay, and the high accuracy and small standard deviation

given the randomly generated weights and biases made the system acceptable for

practical use.

In [172], Wu et. al. implemented the ELM for wind speed estimation in a wind

turbine power generation system (WTPGS). The estimation of wind speed is vital for

the optimal control of the turbine shaft in order to achieve the maximum power point

tracking. Different from the conventional neural networks based wind speed estimators

the new scheme is independent of the environmental air density. The ELM uses the real-

time turbine generator information to provide precise estimates of wind speed for the


control system. In addition, the ELM was also implemented in the pitch controller of the

wind turbine to replace the conventional linear controller. Both the experimental and

simulation results showed greatly improved performance compared to conventional

RBF neural networks and PID controllers. It is seen that the ELM based estimation

scheme can provide almost optimal control, clearly outperforming the conventional

RBF and PID models, thus producing optimal power generation capabilities among all

the methods.

In [173], Tang et. al. proposed the partial least squares-optimized ELM (PLS-

OELM) for mill load prediction. The PLS algorithm was used to extract frequency

spectrum features from the mill shell vibrations and the latent features are then learned

by the OELM for predicting the mill load. The experiments were conducted using a

laboratory sized ball mill (XMQL-420450) with an accelerometer attached to the middle

of the shell. Comparing the fitting and predictive performances of the PLS-OELM with

Gaussian kernel with PLS, PLS-ELM, PLS-BP, and PCA-SVM, it can be seen that the

PLS-OELM achieves better results than all the other algorithms except the PCA-SVM,

under the constraints of small sample size and high dimensionality.

2.8.7 Information Systems

With the growth in the use of digital systems to store information for easy access

globally, there is an increasing demand for intelligent systems to sort and extract

relevant data in large scale digital databases.

Wang et. al. [174] combined the OS-ELM with intuitionistic fuzzy sets for

predicting consumer sentiments from online reviews. They focused on the chinese

character reviews of 3 databases using single classifier and ensembles methods. The

reviews can be categorized into positive, negative or neutral. For the single classifier

methods, the OS-ELM and ELM achieves comparable results compared to the SVM and

performs much better than the Naïve Bayesian classifier. However, it is worth noting

that the OS-ELM and ELM has much smaller standard deviations and much faster

training speed. In the ensemble experiments, the authors focused on finding the best

multi-classifier output fusion technique. The experimental results show that the


conventional mathematical averaging methods perform better than the accuracy or

norms of weights weighting schemes.

Zheng et. al. [175] used the RELM for text categorization. They introduced a

three stage framework for the classification of text. The latent semantic analysis was

first used to reduce the dimensionality of the input patterns. Then the semantic features

were used to train the RELM classifier. Finally the RELM was evaluated in single label

(WebKB database) and multi label (Reuters-21578) text categorization cases against

other popular machine learning algorithms. The classification results indicate that the

RELM performs better than the ELM and BP in most cases and is comparable to the

SVM. However, it was confirmed that the RELM and ELM achieves much faster

training and classification speed. The authors suggested that the RBF function and

triangulation basis function be chosen in the SLFN for text categorization applications.

Zhao et. al. [76] proposed an XML document classification scheme using the

ELM. It is seen that, the reduced structured vector space model was first used to

generate a feature vector for each XML sample. Then the ELM and a newly developed

voting-ELM were used to learn and classify the sample patterns. The framework of the

voting-ELM is based on the OAO multi-class classification method and voting theory.

Therefore in the training and testing phases, the voting-ELM requires more time

compared to the single ELM classifier. In the simulation of 10 datasets, the voting-ELM

consistently outperforms the single ELM classifier in terms of classification


2.8.8 Control Systems

Neural networks have found many applications in adaptive control schemes that provide

fast responses to changes in the target system. In [176], Rong and Zhao used the ELM

to develop a direct adaptive neural controller for nonlinear systems. The control

framework consists of an ELM-based neural controller and a sliding mode controller.

The ELM-based neural controller model is compensated by the sliding mode controller

to reduce the effects of modelling error and system disturbances. In addition, the output

layer weights of the ELM are updated using the stable adaptive laws derived from a


Lyapunov function instead of the typical pseudo-inverse. The control scheme

guarantees the stability and convergence of the nonlinear system. In the experiments,

the inverted pendulum was tested using 2 trajectories, with the basic ELM and the ELM

with Lyapunov based solution. It was found that only the ELM with Lyapunov based

solution consistently converges to the reference asymptotically. Furthermore, the newly

proposed control framework produces similar results for both sigmoid and RBF nodes

and the control signals avoid the chattering problem.

In [177], Yang et. al. developed a neural networks based self-learning control

strategy for power transmission line de-icing robots. The control of these robots has

been hard due to the multiple nonlinearities, plant parameter variations, and external

disturbances. The proposed control framework consists of a fuzzy neural network

controller and an OS-ELM identifier. It can be seen that the fuzzy neural network

controllers are updated adaptively while using the OS-ELM to model the time-varying

plant dynamics and plant parameter variations. From the simulation results, the tracking

error of the new control scheme is seen to be much better than the conventional control

strategies such as the PD controller. The OS-ELM is also seen to perform better than

other online sequential learning schemes such as the RAN, MRAN, and BP. The

experimental results on the actual robot then confirmed that the new control strategy has

fast transient response and good accuracy.

There are still a large number of ELM applications which are not able to be

included in this brief survey. However, the general perceptions of the performance and

the learning characteristics of the ELM compared to current state-of-the-art machine

learning techniques can be summarized as follows:

(i) ELM based algorithms are much faster in training and testing; requiring less

computational resources.

(ii) ELM based algorithms are simpler to implement; having less meta-parameters

to tune.

(iii) ELM based algorithms tend to achieve comparable results to SVM in binary

cases but performs better than SVM in multi-class cases.


(iv) The additive type of neuron with sigmoidal activation function and the RBF

node are sufficient for the ELM in most applications.

(v) Regularized versions of ELM are insensitive to the number of neurons; hence

the SLFN is usually initialized with a large number of neurons, i.e. 1000.

(vi) Regularization parameters of the regularized ELM still require tedious search

algorithms to tune.

2.9 Conclusion

This chapter has thoroughly reviewed the SLFN based training algorithm referred to as

the ELM. It is seen that the ELM is a highly effective machine learning algorithm suited

for both binary and multi-class pattern classifications. The ELM is able to learn up to

thousands of times faster than the SVM and demonstrates the comparable or better

generalization performances. The regularized ELM extensions have the similar

optimization formulation compared to the SVM, and in some cases, the ELM achieves

the unified solution with SVM. In short, the ELM is a promising emerging technology

that is gaining more and more interest from researchers due to its simplistic but effective




Chapter 3

Finite Impulse Response

Extreme Learning Machine

A robust training algorithm for a class of single hidden layer feedforward neural

networks (SLFNs) with linear nodes and an input tapped-delay-line memory is

developed in this chapter. It is seen that, in order to remove the effects of the input

disturbances and reduce both the structural and empirical risks of the SLFN, the input

weights of the SLFN are assigned such that the hidden layer of the SLFN performs as a

pre-processor, and the output weights are then trained to minimize the weighted sum of

the output error squares as well as the weighted sum of the output weights squares. The

performance of an SLFN-based signal classifier trained with the proposed robust

algorithm is studied in the experiments section to show the effectiveness and efficiency

of the new scheme.

3.1 Introduction

The applications of single hidden layer feedforward neural networks (SLFNs) have been

receiving a great deal of attention in many engineering disciplines. As shown in [66-72],

[76], [158-177], by properly choosing the number of nodes in both the hidden layer and

the output layer and training the input and the output weights, one may use SLFNs for

function approximation, digital signal and image processing, complex system modeling,

adaptive control, data classification and information retrieval. In practical applications,

the techniques for training the weights of SLFNs are very important in order to


guarantee the good performance of SLFNs. The most popular training technique used

for SLFNs is the gradient-based backpropagation (BP) algorithm [1], [8]. It has been

seen that the BP can be easily implemented from the output layer to the hidden layer of

SLFNs in real-time. However, the slow convergence has limited the BP in many

practical applications where fast on-line training is required. In addition, the sensitivity

of the SLFNs, trained using the BP, with respect to the input disturbances and the large

spread of data is another important issue that needs to be further studied by the

researchers and engineers in neural computing.

In [3], [18], [147] a learning algorithm called extreme learning machine (ELM)

for SLFNs is proposed, where the input weights and the hidden layer biases of an SLFN

are randomly assigned, the SLFN is then simply treated as a linear network and the

output weights of the SLFN are then computed by using the generalized inverse of the

hidden layer output matrix. It has been noted that the ELM has an extremely fast

learning speed and produces good performance in many cases. However, the poor

robustness property of the SLFNs trained with the ELM has been observed as the

SLFNs are used for signal processing to handle noisy data. For instance, as the input

weights and the hidden layer biases are randomly assigned in an SLFN, the changes of

the hidden layer output matrix sometimes are very large because of the effects of the

input disturbances, which also result in significant changes of the output weight matrix

of the SLFN.

In [89] and [90], two modified ELM algorithms are proposed, where the cost

function consists of the sum of the weighted error squares and the sum of the weighted

output weight squares. In terms of the optimization of the cost function in the output

weight space and the proper choice of the weights of the error squares, the structural and

the empirical risks are balanced and reduced. However, the structural and the empirical

risks are not significantly reduced and the robustness property of the trained SLFN is

not significantly improved because of the random assignment of both the input weights

and the hidden layer biases. According to the statistical learning theory [86], [100-105],

significant changes in the output weight matrix will largely increase both the structural

risk and empirical risk of the SLFNs. Therefore, in order to substantially reduce the

structural and empirical risks and improve the robustness property of SLFNs with


respect to the input disturbances, the proper choice of the input weights of SLFNs is

absolutely necessary.

In this chapter, a new robust training algorithm is proposed for a class of SLFNs,

with both linear nodes and an input tapped-delay-line memory, for signal processing

purposes. Since the output of each linear hidden node in the SLFN is the sum of the

weighted input data, each node can be treated as a finite-impulse-response (FIR) filter.

Therefore, the hidden layer with linear nodes can be designed as the pre-processor of

the input data. For instance, based on the FIR filter design techniques in signal

processing [106-109], it is possible to design the hidden layer as a group of low-pass

filters or high-pass filters or band-pass filters or band-stop filters or other types of filters

for the purpose of the pre-processing of the input data with disturbances and undesired

frequency components. The advantages of the hidden layer’s pre-processing function

are that not only the input disturbances and the undesired frequency components can be

removed, also both the structural and empirical risks of the SLFNs can be greatly

reduced from the viewpoint of the output of the SLFNs.

For the design of the output weight matrix of the SLFNs, in this chapter, an

objective function which includes both the weighted sum of the output error squares and

the weighted sum of the output weight squares of the SLFNs is chosen [1], [89], [90],

[110-112]. By minimizing this objective function in the output weight space, as well as

the proper choice of the input weights based on the FIR filter design techniques, both

the structural and empirical risks can be balanced and reduced for signal processing

purposes. For the comparison with the ELM and the modified ELM algorithms in [89]

and [90], the new training scheme to be developed in this chapter is referred to as the

FIR-ELM algorithm. According to the ELM theory, the hidden nodes used in SLFNs

may not be neuron alike. It is then convenient to call the hidden linear nodes, with the

input weights trained with the FIR filtering techniques in this chapter, as the FIR nodes,

which are one type of the many possible hidden nodes mentioned in [3], [23-25], [90],

[93], [94].

It should be emphasized that the SLFNs considered in [3], [23-25], [89], [90],

[93], [94] use the non-linear hidden nodes and the linear output nodes without dynamics


and, with the proper choice of the output weights, the SLFNs can uniformly

approximate non-linear input-output mappings. However, the class of SLFNs

considered in this chapter use both the linear hidden nodes and the linear output nodes.

In order to make such SLFNs have universal approximation capability, an input tapped-

delay-line memory is added to the input layer. It has been shown in [113] that the SLFN

with linear (or non-linear) nodes, as well as an input tapped-delay-line memory, is

capable of approximating the maps that are causal, time invariant and satisfy certain

continuity and approximately-finite-memory conditions. Since, in many cases of signal

processing, the input and output data have some dynamic relationships, it is thus

convenient to train the SLFNs with linear nodes as well as an input tapped-delay-line

memory to perform as signal processors.

The rest of the chapter is organized as follows: In Section 3.2, a class of SLFNs,

with linear nodes and an input tapped-delay-line memory, as signal classifiers are

formulated, and the issues on the empirical and the structural risks, as well as the

robustness property of the SLFNs with respect to the input disturbances, trained with

the ELM algorithm and the modified ELM algorithm in [3], [23-25], [89], [90], [93],

[94], are studied. In Section 3.3, the design of the input weights using FIR filtering

technique, for reducing both the empirical and structural risks, improving the robustness

of the SLFNs with respect to the input disturbances, and removing some undesired

frequency components is presented. In Section 3.4, the design of the output weights by

the minimization of the weighted sum of the output error squares as well as the

weighted sum of the output weight squares of the SLFNs is discussed in detail. In

Section 3.5, the SLFN-based signal classifiers, trained with the ELM in [3], [23-25],

[93], [94], the modified ELM in [89] and the FIR-ELM developed in this chapter are

simulated and compared in order to show the effectiveness of the proposed FIR-ELM

algorithm for signal processing. Section 3.6 gives the conclusions and some further


3.2 Problem Formulation

The architecture of a class of SLFNs with the linear hidden nodes and an input tapped-

delay-line memory is presented in Figure 3.1, where the output layer has linear nodes,


the hidden layer has linear nodes, D is the unit-delay element, the time-delay

elements, added to the input of the neural network, form the tapped-delay-line memory,

which indicates that the input sequence ( ) ( ) ( ) represent a

time series consisting of the present observation ( ) and the past observations

of the process.

Figure 3.1 A single hidden layer neural network with linear nodes

From Figure 3.1, the input data vector ( ) and the output data vector ( ) can be

expressed as follows:

( ) [ ( ) ( ) ( )] ( )

( ) [ ( ) ( ) ( )] ( )

the output of the th hidden neuron is computed as:

∑ ( ) ( )

( )


[ ] ( )


and the th output of the neural network, ( ), is of the form:

( ) ∑ ( )

( )

Thus, the output data vector ( ) can be expressed as:

( ) ∑ ( )

( )


[ ] ( )

In this chapter, distinct sample signal data vector pairs ( ) are used to

train the SLFN given in Figure 3.1, where [ ] and

[ ] , for , are the desired input and output training data

vectors, respectively. For the th input data vector , the corresponding neural output

vector can be expressed as:

∑ ( )

( )

and all equations can then be written as the following matrix form:

( )



] [

] ( )




and [


( )

Matrix is called the hidden layer output matrix [3], and the th column of is the th

hidden neuron output corresponding to the input vectors .

Remark 3.2.1: It is seen from [3], [23-25], [93], [94] that, as the ELM is used to train an

SLFN, the input weights and the biases of the hidden layer of the SLFN are randomly

assigned. The SLFN is then treated as a linear network and the output weight matrix of

the SLFN is computed using the generalized inverse of the hidden layer output matrix as


( )

where the is the Moore-Penrose generalized inverse of the matrix , and is the

desired output data matrix, expressed as:



( )

However, it has been noted that, when the input weights of an SLFN are

randomly chosen, both the empirical risk and the structural risk of the SLFN are greatly

increased. This undesired characteristic of the SLFN with the ELM can be clearly seen

from Figure 3.2 and Figure 3.3, respectively, where a single output SLFN with 5 hidden

neurons, trained with the ELM, is used to approximate a straight line within

the range . As the signal , disturbed by a random noise ( )

( ) , is input to the trained SLFN, the error between the output of the neural

network and the straight line is very large sometimes. Figure 3.4 and Figure

3.5 show the simulation results where the SLFN is trained with the modified ELM in

[89]. It is seen that, although the performance is improved a little bit because of the

balance of the structural risk and the empirical risk seen from the output of the SLFN,

the error between the output of the neural network and the straight line is


not significantly reduced because of the random choice of the input weights of the


Figure 3.2 Output of the SLFN with the ELM

Figure 3.3 Output error of the SLFN with the ELM

15 16 17 18 19 20 2130










15 16 17 18 19 20 21-1.5










Figure 3.4 Output of the SLFN with the modified ELM

Figure 3.5 Output error of the SLFN with the modified ELM

On the contrary, when the input weights of the SLFN in Figure 3.1 are assigned

in such a way that each hidden node performs as a linear-phase lower-pass FIR filter

and the output weights are computed based on the improved ELM in [89], with the same

training data as in Figure 3.2 and Figure 3.3 (or Figure 3.4 and Figure 3.5), the

robustness of the SLFN with respect to the input disturbances has been greatly

improved and at the same time, the structural risk and the empirical risk are well

balanced and reduced, as seen in Figure 3.6 and Figure 3.7, respectively.


Figure 3.6 Output of the SLFN with the FIR hidden nodes

Figure 3.7 Output error of the SLFN with the FIR hidden nodes

Therefore, it is necessary to properly assign the input weights in order to

improve the robustness property with respect to disturbances and reduce both structural

and empirical risks of SLFNs.

In the following sections, a new robust training algorithm for a class of SLFNs

in Figure 3.1 will be developed. The weight training of the SLFN is divided into two

steps: First, the input weights of the SLFN are designed off-line in the sense that every

15 16 17 18 19 20 2130










15 16 17 18 19 20 21-1.5










hidden node performs as a linear FIR filter and the whole hidden layer plays the role of

a pre-processor of input data to remove the effects of the input disturbances and

significantly reduce the structural and empirical risks. Then the output weights of the

SLFN are designed to minimize the output error of the SLFN and further balance and

reduce the effects of the empirical and the structural risks.

3.3 Design of the Robust Input Weights of SLFNs

By rewriting ( ), the output of the th hidden node, as follows:

( ) ∑ ( )

( ) ( )

It is seen that (3.3.1) has the typical structure of an FIR filter [106-109], where the input

weight set { } can be treated as the set of the filter coefficients or the impulse

response coefficients of the filter, and the output ( ) is the result of the convolution

sum of the filter impulse response and the input (time series) data, the filter length is

equal to the number of the input data of the neural network. According to the signal

processing theory in [106-109], if the elements of the input weight vector are chosen

to be positive and symmetrical, that is, and

( ) {

( )

( )

( )

(3.3.1) is a non-recursive linear phase FIR filter, which has the advantages that all

outputs of the hidden nodes are stable because of the absence of the poles, and the

finite-precision errors are less severe than in other filter types.

Remark 3.3.1: Without loss of the generality, in the following, only how each hidden

node is designed to perform the function of a low-pass filter is considered. The similar

design methods from the signal processing in [106-109] can be used for designing the


hidden nodes as high-pass filters or band-pass filters or band-stop filters or other types

of filters for the purpose of the pre-processing of the input data to remove the effects of

the input disturbances and the undesired frequency components.

Remark 3.3.2: For practical application, Matlab can be used to develop a look-up table

containing all possible sets of input weights of the SLFN with the characteristics of low-

pass filter, high-pass filter, band-pass filter and other specified filters, respectively.

Based on some observation and understanding of the frequency-spectrum of the input

data, it is then possible to determine what frequency components should be eliminated

or retained, and then a proper set of parameters can be chosen from the look-up table

and assigned to the input weights of the SLFN.

Suppose that the desired frequency response of the th hidden node of the SLFN

can be represented by the following discrete-time Fourier transform (DTFT):

( ) ∑ [ ]

( )

where [ ] is the corresponding impulse response in the time domain, which can be

expressed as:

[ ]

∫ ( )

( )

It is noted that the unit sample response [ ] in (3.3.4) is infinite in duration and must

be truncated at the point , in order to yield an FIR filter of length . Truncation of

[ ] to a length is equivalent to multiplying [ ] by a rectangular window

[ ] defined as:

[ ] {

( )


Since the Fourier transform of [ ] is given by

( ) ∑ [ ] ( ) ( )

( )

( )

the frequency response of the truncated FIR filter can then be computed by using the

following convolution:

( )

∫ ( ) ( )

( )

If it is desired that the th hidden node in the SLFN performs as a low-pass filter with

the following desired frequency response:

( ) {

( ) | |

| | ( )

where is the cut-off frequency of the low-pass filter to separate the low frequency

pass-band and the high frequency stop-band, the impulse response of the truncated low-

pass filter can be obtained using (3.3.7) as follows:

[ ]

∫ ( )

[ ( ( ) )]

( ( ) ) ( )

It is easy to see from (3.3.4) that

[ ] [ ] ( )

Then, the weights for the th hidden node can be obtained as follows:

[ ] [ ] [ ] ( )


Remark 3.3.3: In the above, the problem of how the rectangular window method can be

used to design the input weights so that the hidden layer of the SLFN performs as a pre-

processor to filter away the high frequency disturbances added to the input data and

significantly reduce both the structural and empirical risks of the SLFN is considered. In

fact, many other window methods, such as Kaiser method, Hamming method and the

Bartlett method in signal processing, can also be used for designing the input weights so

that the hidden layer functions as the pre-processor of the input data for the purpose of

removing the input disturbance.

Remark 3.3.4: In order to explain why the proper choice of the input weights can

significantly reduce both the structural and the empirical risks of the SLFN, the

optimization of the output weight matrix of the SLFN in Figure 3.1 is first examined,

where is computed using the standard least squares method with the estimate, given in

(3.2.12). It is clearly seen that, in the optimization process for determining the optimal

value of in the output weight parameter space, the sensitivity issue of with respect

to the change of the hidden layer output matrix is not considered. In fact, a very

sensitive may bring a very high structural risk of the SLFN when input data is

disturbed by various noises.

For further analysis, let represent the change of the hidden layer output

matrix , caused by some input disturbance or noise, and the corresponding change of

the output weight matrix be represented by . Then, from (3.2.9), the equation

below is obtained

( )( ) ( )

Multiplying out the left hand side of (3.3.12) and subtracting on both sides, the

equation becomes

( ) ( )

and then

( ( )) ( )


where is the Moore-Penrose generalized inverse of matrix .

From (3.3.14), the following inequality can be obtained:

‖ ‖ ‖ ‖‖ ‖‖ ‖ ( )

and the sensitivity of the output weight matrix can be obtained as follows:

‖ ‖

‖ ‖

‖ ‖

‖ ‖ ‖ ‖‖ ‖ ‖ ‖‖ ‖

‖ ‖

‖ ‖ ( )

Similar to the definition of the condition number of a squared matrix in [1], the

generalized condition number of the hidden layer output matrix is defined as follows:

( ) ‖ ‖‖ ‖ ( )

The sensitivity of the output weight matrix can then be expressed as:

‖ ‖

‖ ‖

‖ ‖

‖ ‖ ( )

‖ ‖

‖ ‖ ( )

It is seen from (3.3.18) that, if, in the ideal case, all high frequency disturbance

components can be removed by the hidden nodes with the characteristics of low-pass

FIR filters, the change of the hidden layer output matrix , caused by the high

frequency noise components, is reduced to zero, that is, ‖ ‖ . Then the sensitivity

of the output weight matrix is reduced to zero.

However, it should be addressed that, in practice, a well-designed real-time FIR

filter can remove most disturbance components of the input signal and makes the value

of ‖ ‖ very small (but ‖ ‖ ) and thus the output weight matrix is non-

sensitive to the changes of the hidden layer output matrix .


Remark 3.3.5: In fact, not only the input disturbance increases the sensitivity of the

output weight matrix in (3.3.18), but also the randomly assigned input weights in both

the ELM and the modified ELM in [3], [23-25], [89], [90], [93], [94] largely increase

the sensitivity of the output weight matrix . This is because the random choice of the

input weights in SLFNs may result in a significant change of the hidden layer output

matrix . A similar analysis as in (3.3.12)~(3.3.18) can be done.

Remark 3.3.6: It is noted that, if the hidden layer output matrix is square, the

definition of the generalized condition number in (3.3.17) is the same as the ordinary

one defined in [1]. Although the condition number of a non-square matrix has been

defined in terms of its singular values as seen in [1], the assumption that is full-

rank is not always valid for the SLFNs, because, in many cases, some singular values of

matrix are zero. Therefore, the definition of the generalized condition number of

matrix in (3.3.17) can find a wide application in signal processing and neural


3.4 Design of the Robust Output Weight Matrix

In this section, the non-linear optimization technique is used to design the optimal

output weight matrix such that the structural and the empirical risks are well balanced

and the effects of the structural and the empirical risks are further reduced. For this

purpose, the optimization problem is stated as follows [1], [89], [90], [110-112]:

Minimize { ‖ ‖

‖ ‖ } ( )

Subject to ( )

This problem can be solved conveniently by the method of Lagrange multipliers. For

this, construct the Lagrange function as:



∑∑ ( )

( )


where is the th element of the error matrix , is the th element of the output

weight matrix , is the th element of the output data matrix , is the th column

of the hidden layer output matrix , is the th column of the output weight matrix ,

is the th Lagrange multiplier, and are constant parameters used to adjust the

balance of the structural risk and the empirical risk.

Remark 3.4.1: It is noted from the cost function defined in (3.4.1) that, compared with

the modified ELM algorithm in [89], the norm of the output weight matrix has been

weighted by a constant . This is because the structural risk and the empirical risk of the

SLFN can be easily balanced and reduced through the adjustment of the values of and

in the optimization process. The effects of the ratio on the performance of the SLFN

will be discussed in the simulation section.

Differentiating L in (3.4.3) with respect to , the equation becomes

(∑∑ (




) ( )


[ ]

[ ]

( )

(3.4.4) becomes



( ) ( )



and using (3.4.6),

[ ]

[ ]

( )


[ ] [ ] [

] ( )



] [

] [

] ( )


( )

In addition, differentiating with respect to ,

( )


. It is then possible to obtain the following relationship:

( )


( )


Considering the constraint in (3.4.2), (3.4.13) can be expressed as:

( ) ( )

and using (3.4.14) in (3.4.10) leads to

( ) ( )

Then, the output layer weight matrix is derived as follows:



( )

Remark 3.4.2: Based on the discussions in Section 3.3 and Section 3.4, the proposed

FIR-ELM algorithm can be summarized as follows:

Step 1: Assign the input weights according to (3.3.9) ~ (3.3.11);

Step 2: Calculate the hidden layer output matrix using (3.2.10);

Step 3: Calculate the output weight matrix based on (3.4.16).

3.5 Experiments and Results

To illustrate the FIR-ELM algorithm proposed in this chapter, consider an SLFN-based

classifier, with 20 hidden linear nodes, one output linear node and an input tapped-

delay-line memory, to classify the computer-generated low frequency tones. The input

data vectors to the SLFN are computer-generated low frequency sound clips with

frequencies 100 Hz, 150 Hz, 200 Hz, 300 Hz, …, 900 Hz, respectively, and each of

which is 0.5 seconds long, modulated by an envelope function to create a tone. Figure

3.8 shows the 700 Hz sound clip modulated by the envelope function ( )

( ). The desired output reference values of the SLFN, so as to provide the desired

values of the signal classifier states, for all input data vectors, are generated by a sine

function, ( ) for with equal increments.


The training data are 10 pairs of the input sound clip data vectors and the

corresponding desired output classifier states. The SLFN is then trained with the ELM

in [3], [23-25], [93], [94], the modified ELM in [89] and the FIR-ELM proposed in this

chapter, respectively. To examine the robustness performances of the SLFN-based

signal classifier trained with the above algorithms, the random noise ( )

( ) ( ) is added to all sound clips, as seen in Figure 3.9.

Figure 3.8 A sound clip modulated by the envelope function

Figure 3.9 The disturbed sound clip

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-1















Clear Clip

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-1















High Noise Clip


Figure 3.10 shows the classification result of the SLFN-based signal classifier

trained with the ELM algorithm, where the simulation is repeated 50 times, showing the

accuracy of the classification, and Figure 3.11 shows the corresponding RMSEs. It is

clearly seen that, as the input weights of the SLFN are arbitrarily assigned with the

ELM in [3], [23-25], [93], [94], both the structural risk and empirical risk of the SLFN

are very high and the effects of the input noise cannot be reduced through the

minimization of the output weights.

Figure 3.10 Signal classification with the ELM algorithm

Figure 3.11 The RMSE with the ELM algorithm

0 1 2 3 4 5 6 7 8 9 10 110.6












0 1 2 3 4 5 6 7 8 9 10 11-1












Figure 3.12 and Figure 3.13 show the classification result and the RMSE,

respectively, of the SLFN-based signal classifier trained with the modified ELM

algorithms in [89] with the constant parameter = 0.9 for adjusting the balance of the

structural risk and the empirical risk. Since the input weights of the SLFN are arbitrarily

assigned, the high structural and empirical risks result in very poor robustness with

respect to the input disturbances and no noticeable performance improvement is seen,

compared with Figure 3.10 and Figure 3.11 with the ELM algorithm.

Figure 3.12 Signal classification with the modified ELM algorithm

Figure 3.13 The RMSE with the modified ELM algorithm

0 1 2 3 4 5 6 7 8 9 10 110.6












0 1 2 3 4 5 6 7 8 9 10 11-1












Figure 3.14 and Figure 3.15 show the classification result and the RMSE,

respectively, of the SLFN-based signal classifier trained with the FIR-ELM algorithm

proposed in this chapter, where the input weights of the SLFN are assigned by using the

rectangular window method with the window length of 1001 and the cut-off frequencies

50 Hz and 1.2kHz, and the constant parameters = 0.9 and = 1, respectively, to

balance and reduce the structural and empirical risks. It is seen that, compared with the

classification results in Figure 3.8 and Figure 3.9, trained with the ELM algorithm, and

Figure 3.10 and Figure 3.11, trained with the ELM algorithm, the accuracy of the

classification and the robustness of the SLFN with respect to the input noise have been

significantly improved due to the fact that the input weights are assigned using the

rectangular window low-pass filtering technique and both the structural and the

empirical risks have been greatly reduced.

Figure 3.14 Signal classification with the FIR-ELM algorithm

with the rectangular window

0 1 2 3 4 5 6 7 8 9 10 110.6













Figure 3.15 The RMSE with the FIR-ELM algorithm with the rectangular window

Figure 3.16 and Figure 3.17 show the classification result and the corresponding

RMSE of the SLFN trained with the FIR-ELM algorithm, where the input weights of

the SLFN are assigned based on the Kaiser window method [106-109], with the window

length of 1001, the cut-off frequencies 50 Hz and 1.2kHz, the transition width 0.5 Hz,

and the pass band ripple 0.1db, and the constant parameters and are chosen as the

same as in Figure 3.14 and Figure 3.15. It is seen that the accuracy of the classification

and the robustness of the SLFN with respect to the input disturbance are even better

than the ones in Figure 3.14 and Figure 3.15, trained with the rectangular window

method. This is because the low-pass filters designed with the Kaiser window method in

practice have better low-pass filtering performance than the one using the rectangular

methods as commonly known in signal processing.

0 1 2 3 4 5 6 7 8 9 10 11-1












Figure 3.16 Signal classification with the FIR-ELM algorithm

with the Kaiser window

Figure 3.17 The RMSE with the FIR-ELM algorithm with the Kaiser window

Table 3.1 shows the comparison results of the average RMSEs of the SLFN

trained with the FIR-ELM (rectangular), the FIR-ELM (Kaiser), and the ELM

algorithms, respectively, as the number of the hidden neurons is increased from 20 to 50:

0 1 2 3 4 5 6 7 8 9 10 110.6












0 1 2 3 4 5 6 7 8 9 10 11-1












Table 3.1: Comparison of averaged RMSE for sound clip recognition experiment


FIR-SLFN (Rectangular)

FIR-SLFN (Kaiser) ELM Modified ELM

low noise

high noise

low noise

high noise

low noise

high noise

low noise

high noise

20 0.0067 0.0335 0.0056 0.0287 0.0667 0.2206 0.0408 0.2100

30 0.0055 0.0324 0.0047 0.0281 0.0472 0.2199 0.0373 0.1998

40 0.0049 0.0285 0.0040 0.0219 0.0454 0.2161 0.0328 0.1804

50 0.0049 0.0171 0.0030 0.0214 0.0352 0.1851 0.0283 0.1688

It is seen that, as the number of the hidden neurons is increased, the performance

of the SLFN with the FIR-ELM has been greatly improved and the SLFN with the FIR-

ELM has demonstrated highly desirable characteristics for real-world signal processing

applications, compared with the SLFN with the ELM and the modified ELM.

In order to analyse the effect of the ratio in (3.4.16) on the performance of

the SLFNs trained with FIR-ELM algorithm, Figure 3.18 shows the RMSE via the

for the SLFN with 20 hidden linear nodes, a single output linear node and an input

tapped-delay-line memory, where is changed from 0.1 to 10 with a step size of 0.1.

Without losing the generality, more than 20 different groups of input and output training

data have been used for deriving Figure 3.18. It is seen that the RMSE tends to be

reduced as the value of the is increased. However, the large value of the will

narrow the output range of the neural classifier and thus, some output states of the

neural classifier may be lost in the signal classification process. According to our

experience, it is reasonable to choose the value of the between 0.7 and 2.0, as in the

simulation results shown in Figure 3.14 ~ Figure 3.17 and Table 3.1.

In addition, the simulation of the SLFN classifier with the non-linear sigmoid

hidden nodes trained with the FIR-ELM with the Kaiser window is also presented here.


It has been noted that some non-linear sigmoid hidden nodes can help to reduce the

effects of the disturbances and lower both the structural and empirical risks compared

with the ELM and the modified ELM, as seen in Figure 3.19 and Figure 3.20, where the

non-linear sigmoid function has a large linear region as shown in Figure 3.21. However,

some opposite results have been seen as in Figure 3.22 and Figure 3.23 where the non-

linear sigmoid function has a narrower linear region as seen in Figure 3.24.

Figure 3.18 The RMSE via d/γ for the FIR-ELM algorithm

Figure 3.19 Signal classification using the SLFN classifier with the non-linear hidden

nodes and trained with the FIR-ELM algorithm

0 1 2 3 4 5 6 7 8 9 100.02












RMSE vs d/ plot for FIR-ELM


Figure 3.20 The RMSE of the SLFN classifier with the non-linear hidden nodes and

trained with the FIR-ELM algorithm

Figure 3.21 The non-linear sigmoid function ( ) ( ( ))⁄


Figure 3.22 Signal classification using the SLFN classifier with the non-linear hidden

nodes and trained with the FIR-ELM algorithm

Figure 3.23 The RMSE of the SLFN classifier with the non-linear hidden nodes and

trained with the FIR-ELM algorithm


Figure 3.24 The non-linear sigmoid function ( ) ( ( ))⁄

3.6 Conclusion

In this chapter, a robust training algorithm, the FIR-ELM, for a class of SLFNs with

linear nodes and an input tapped-delay-line memory has been developed. It has been

seen that the linear FIR filtering technique has been successfully used to design the

input weights, which makes the hidden layer of the SLFN perform as a pre-processor of

the input data, to remove the input disturbance and the undesired signal components.

Then, the non-linear optimization technique has been used to design the optimal output

weight matrix to minimize the output error and balance and reduce the empirical and

structural risks of the SLFNs. The simulation results have shown the excellent

robustness performance of the SLFNs with the proposed FIR-ELM training algorithm

under noisy environments. The further work is to consider the SLFNs with the non-

linear nodes and combine the FIR filtering technique with the pole-placement-like

method in [114-117] to enhance the function of the FIR-ELM algorithm with

application to signal processing, image processing and robust control.


Chapter 4

Classification of Bioinformatics Datasets

with Finite Impulse Response Extreme

Learning Machine for Cancer Diagnosis

In this chapter, the classification of the two binary bioinformatics datasets, leukemia and

colon tumor, is further studied by using the neural network based finite impulse

response extreme learning machine (FIR-ELM) developed in Chapter 3. It is seen that a

time series analysis of the microarray samples is first performed to determine the

filtering properties of the hidden layer of the neural classifier with FIR-ELM for feature

identification. The linear separability of the data patterns in the microarray datasets is

then studied. For improving the robustness of the neural classifier against noise and

errors, a frequency-domain gene feature selection (FGFS) algorithm is also proposed. It

is shown in the simulation results that the FIR-ELM algorithm has an excellent

performance for the classification of bioinformatics data in comparison with many

existing classification algorithms.

4.1 Introduction

The analysis of cancer diagnosis data is one of the most important research fields in

medical science and bioengineering [118], [119]. As the complete treatments of cancers

have not been achieved, early diagnosis plays an important role for doctors to help

patients control the metastatic cancer growth. Recently, the microarray gene expression


datasets, consisting of thousands of gene expressions that can be used for the molecular

diagnosis of cancer, have been established [119], [120]. The sample vectors in these

datasets contain a large amount of information about the origin and development of

cancers. However, due to the fact that all of these data contain different levels of noises

and measurement errors from sampling processes, it is difficult for the existing

classification techniques to accurately identify the patterns from the samples [120]. In

order to use the existing classification techniques for pattern classification of the

microarray gene expression datasets, a gene selection algorithm is usually implemented

to reduce the effects of the noise, error, and the high dimensionality of samples [121-

123]. It has been shown in [118] that the gene selection process can help to improve the

performance of the classifier by identifying representative gene expression subsets.

Some of the popular gene selection algorithms include the principal component analysis

(PCA) [121], the singular value decomposition (SVD) [122], the independent

component analysis (ICA) [123], genetic algorithm (GA) [124], and the recursive

feature elimination method [69], [125].

Other related works on the classifiers are the GA-based cancer detection system

proposed in [124], the support vector machine (SVM) in [19], [125], [126] and the

extreme learning machine (ELM) in [66], [70], [127], [128]. In [124], the GA gene

selection algorithm is applied to the microarray dataset to obtain the most informative

gene subsets that are classified using the multilayer perceptron (MLP). In [19], [125],

[126], the support vector machine first maps the input space into the high dimensional

feature space, and then constructs an optimal hyperplane using selected support vectors

to separate the classes. In [66], [70], [127], [128], the ELM simplifies model selection

by randomly generating the hidden layer weights and biases, and performs very fast

batch training using the Moore-Penrose pseudoinverse to deterministically obtain the

output weights. Most importantly, the ELM performs well in several cancer diagnosis

applications [66], [70], [127], [128]. However, all the algorithms mentioned in the

above are used for the classification of a small subset of genes that does not sufficiently

represent the whole process of the origin and the development of cancers in general

[119]. In addition, the robustness issues with respect to noises and disturbances are not

discussed in detail.


Please note that there has not been a standard guideline in the biomedical

industry for producing microarrays, and the microarrays produced by different labs may

contain different profiles of error or different sets of genes. Furthermore, even samples

taken from the same lab may sometimes contain different composition of cells which

may bias the accuracy of a classifier [119], [120]. Therefore, the motivation of this

chapter is to develop a general microarray classification technique that is capable of

classifying a variety of genes based on the finite impulse response extreme learning

machine (FIR-ELM) developed by Man et al. [9]. The FIR-ELM algorithm implements

a single hidden layer feedforward neural network (SLFN) as the classifier where the

well known filtering methods such as the finite length low-pass filtering, high-pass

filtering and band-pass filtering in digital signal processing, are adopted to train the

input weights in the hidden layer to extract features from the dataset. The readers may

find the details of the filtering techniques from [9], [106-109].

As seen in [9], the hidden layer of the neural classifier with the FIR-ELM is

based on FIR filter designs. The sample vectors from the microarray datasets are thus

treated as time series input pattern vectors. To show the validity of applying FIR filter

theory in microarray gene feature detection, a time series analysis for a binary

microarray gene expression dataset is first explored with the gene expression dataset

denoted as the time series type data. Then the FIR filtering function is applied to the

time series to reveal spectrum features. The filtered time series are then analysed using

the cross-correlation to show the importance of proper filter design selection to achieve

optimal separation of the classes. In addition, in order to provide a quantitative measure

of the gene features within each dataset, a linear separability analysis on the microarrays

is also discussed in detail based on the recent study in [129] on the linear separability of

microarray gene pairs.

It will be seen that, in order to determine the filtering properties of the hidden

layer of the neural classifier with the FIR-ELM, a frequency domain gene feature

selection (FGFS) algorithm is developed for analyzing the frequency characteristics of

all datasets. The outcomes of the FGFS are then used to determine the optimal FIR

filtering strategy for the input weights’ design. In addition, the effects of the time series


with the randomly ordered gene samples, which have not been previously considered in

similar works [130], [131], are also examined in this chapter.

The rest of this chapter is organized as follows. Section 4.2 describes the

characteristics of the time series type patterns of the microarray samples. Section 4.3

presents the analysis on the linear separability of the patterns. Section 4.4 outlines the

FIR-ELM as well as the FSGS algorithm. Section 4.5 shows the experimental results

and the performance analysis of the FIR-ELM compared with other existing algorithms.

Section 4.6 and 4.7 give the discussions and conclusions respectively.

4.2 Time Series Analysis of Microarrays

In order to analyse and classify the data patterns in the microarray gene expression

datasets using the FIR-ELM, in this chapter, it can be seen that all samples are

expressed as the time series type data. For the binary dataset with samples, assume

that the first samples belong to class 1 and the other samples belong to class 2,

with . The gene expressions in the two classes can then be denoted as:




] ( )




] ( )



} ( )


} ( )


and are the sample matrices for class 1 and class 2 respectively, and are

the gene indices, and is the number of genes in a sample.

For the purpose of conducting a bivariate time series analysis, the cells in the

microarray tests are treated as a black box that outputs two types of time series, ( )

and ( ) , to represent non-cancerous and cancerous states, or any other binary

phenotypes, as follows:

( ) { ( ) } ( )

( ) { ( ) } ( )

The samples from each of the binary classes and can then be aggregated

to represent their respective classes as in (4.2.5) and (4.2.6). It is well known that the

gene expression values in (4.2.3) and (4.2.4) have a Lorentzian-like distribution with

many outliers [120], [132]. Hence, the median that is a better maximum likelihood

estimator under impulsive noise conditions is preferred, as compared with the mean to

represent the distribution of each gene in a specific class [118], [133]. Each gene in

(4.2.3) and (4.2.4) is then aggregated by taking the median to be a temporal datum as


( ) ( ) ( )

( ) ( ) ( )

Figure 4.1 shows two time series type data plots, ( ) and ( ), for the colon tumor


It is seen that both pattern classes in Figure 4.1 have the similar trend and are

very noisy. In order to determine the input weights of the neural classifier for the pattern

classification purpose, the low pass, high pass, and band pass filters are used to filter

( ) and ( ), respectively, where three filters use the normalized cut-off frequency of

0.4, and the band pass filter uses a bandwidth of ±0.05. As the microarray samples are

converted from data vectors to time series, there is no directly derivable sampling


frequency. Thus the normalized frequency is used to represent the unit of cycles per


Figure 4.1 Aggregated time series for the colon dataset

The filtered time series are defined as:

( ) ∑ ( ) ( )

( )

( ) ∑ ( ) ( )

( )

where { } are the ( )th order filter coefficients derived from

the impulse response of the respective FIR filters, a detailed discussion on the

generation of filter coefficients is given in [106-109]. The cross-correlation between the

two filtered time series (4.2.9) and (4.2.10) can then be obtained. The cross-correlation

may be interpreted as a measure of efficiency between the different filters used to

extract important features and reduce the effects of noise. It has been shown in [134]

that the generation of weakly correlated class features improves the machine learning

performance. Also, Yang et al. in [135] shows that the weak correlation can be used as a

decision-making condition to differentiate samples.

0 200 400 600 800 1000 1200 1400 1600 1800 20000



6000Y1 (Class 1)





0 200 400 600 800 1000 1200 1400 1600 1800 20000



6000Y2 (Class 2)






Before computing the cross-correlations, the non-stationary disturbances in the

two classes need to be removed. A first-order forward finite difference approximation

[136] is applied to the time series to remove the trend and attain stationarity. The

detrended time series is defined as:

( )

( ) ( ) ( )

( )

( ) ( ) ( )

Remark 4.2.1: In order to avoid introducing random errors which may be accumulated

throughout the differencing calculations, the first gene term is used as the initial state

instead of an arbitrary value. Thus, the time series (4.2.11) and (4.2.12) are reduced in

length by one to produce the forward finite difference approximation.

Figure 4.2 The filtered and detrended time series and

Figure 4.2 shows the new time series and

. The cross-correlation of the

two time series can then be obtained using the sample correlation coefficient defined in

[136], which is the well known Pearson correlation coefficient. The sample correlation

coefficient is denoted as:

0 200 400 600 800 1000 1200 1400 1600 1800 2000




Y1 (Class 1)





0 200 400 600 800 1000 1200 1400 1600 1800 2000




Y2 (Class 2)






∑ ( ( )


( ) )

√[∑ ( ( )


∑ ( ( )



( )

where and

are the means of and

, respectively.

The correlation coefficient in (4.2.13) calculates the level of linear association

between the two time series within [-1, 1] where 0 is completely uncorrelated and 1 is

proportionally correlated, while -1 is inversely correlated. Table 4.1 shows the

correlation coefficient of the two times series and

, respectively, after low pass,

high pass, and band pass filtering at the same normalized cut-off frequency of 0.4. The

correlation coefficient of the non-filtered data is also given as a reference. The results in

Table 4.1 show that the high pass filtering produces the time series pairs which are least

correlated among the filtered time series.

Table 4.1: Correlation coefficient of colon dataset binary classes

with different FIR filters

Filter type Low pass High pass Band pass No filter

0.7480 0.6746 0.6816 0.6386

Remark 4.2.2: Although the correlation coefficients of the filtered time series are higher

than those of the non-filtered case, their practicality in comparing the effectiveness of

different filter types is still relevant. The higher correlation coefficients among the

filtered time series are mainly due to the similarity of the non-distinctive residue

components that remain after the filtering process as seen in the first half of the plots in

Figure 4.2. This is consistent with the motivation of using the filtering process to

discover features at specific frequency ranges. Therefore it is adequate to compare the

results among the filtered time series only to determine the suitability of each filter


To visualize the differences in the actual time series (4.2.11) and (4.2.12), the

residual that is the squared differences between them can be computed as follows:


( ) ( ( )

( ))

( )

It can be seen from the plot of the residual in Figure 4.3 that a certain region of

genes contains expression values which are very different between the two classes. The

genes within the indices of 1000 to 1100 show significant residual magnitudes with the

highest at the index of 1015. Closer inspection of the two filtered time series overlaid on

each other to show the gene expressions between indices 1000 to 1040 in Figure 4.4

confirms the observations. The three genes 1014, 1015, and 1016 are seen to behave

dissimilarly (opposite of y axis), while the genes 1025 and 1026 have significant

differences in magnitudes between the two classes.

Remark 4.2.3: Generally the samples from both classes in a binary microarray dataset

tend to look similar. Therefore the genes with large residual magnitudes contribute to

the linear separability of the samples in the microarray dataset. However, it is quite

obvious that it is not sufficient to identify the genes with the cross-correlation. This

point can be seen in Table 4.1. Hence a more comprehensive test for linear separability

is required.

Figure 4.3 Plot of residual between and

0 200 400 600 800 1000 1200 1400 1600 1800 20000








8x 10



y resi


X: 1015Y: 7.135e+006


Figure 4.4 Overlaid plot of and

for genes 1000 to 1040

4.3 Linear Separability of Microarrays

The time series analysis using the cross correlation in the previous section reveals some

features of the separability of the classes in bioinformatics datasets. In this section,

however, the linear separability will be further investigated to provide a quantitative

measurement of linear separability of the data patterns.

Please note that the single gene test was first developed in [137] where each

gene was tested using all samples to find the total number of linearly separable genes.

Then the gene pair linear separability analysis was developed in [129], where pairs of

genes are tested using all possible combinations. The genes that are found to be linearly

separable in the gene pair analysis are also guaranteed to include the linearly separable

single genes. Therefore the gene pair analysis provides extra information on genes that

may only show the linear separability characteristic in pairs.

The gene pair analysis algorithm in [129] is used here because of the relatively

low computational cost of using an incremental testing approach. First, the

samples are separated into their respective classes as in (4.2.1) and (4.2.2). Then

each pair of genes in the dataset can be defined as ( ), and for a dataset with

1000 1005 1010 1015 1020 1025 1030 1035 1040-4000













X: 1014Y: -1637 X: 1016

Y: -2018

X: 1015Y: 2348 X: 1025

Y: 3011

X: 1026Y: -3811

Class 1Class 2


genes, there would be possible combinations. The pairs of genes can then be

projected on the 2D plane. The algorithm states that a pair of genes is linearly separable

if there exists a line where all the points of class 1 are located on one side of and

all the points of class 2 are located on the other side (no point is allowed to reside on

itself). Each gene pair sample is added incrementally and the algorithm stops

whenever a new gene pair introduced violates the separability condition.

However, as stated in [129], it might be impossible to find linearly separable

gene pairs in medium to large datasets even if they are highly separable. In order to

solve this problem, a new sample selection process is proposed for testing the leukemia

and colon tumor datasets described in Table 4.2, as follows: Choose the number of

samples according to the guidelines provided in [138], which states the minimum

number of genes required for statistical significance. Finally, the test is repeated 20

times to obtain averaged results.

Table 4.2: Summary of leukemia and colon datasets

Dataset Samples Genes Class 1 Class 2

Leukemia 72 7129 47 (ALL) 28 (AML)

Colon Tumor 62 2000 40 (tumor) 22 (normal)

Table 4.3: Linearly separable gene pairs for leukemia and colon datasets

Dataset Samples (N1+N2) Mean Std. Dev

Leukemia 30 (15+15) 29009 15875

Colon Tumor 30 (15+15) 93 171

Table 4.3 shows the sample selection from each class and the number of linearly

separable gene pairs for each dataset. The leukemia dataset has a large number of

linearly separable gene pairs, therefore it should be more easily classified. The colon

tumor dataset however is found to consist of only a small number of linearly separable


gene pairs. Hence the colon tumor dataset is defined intuitively as ‘harder’ to classify

than the leukemia dataset. The results presented here are descriptive of the original data

itself. These results will later be used as a benchmark in the analysis of the FIR-ELM in

the experiments section.

Remark 4.3.1: The standard deviations for the linearly separable gene pairs given in

Table 4.3 are large when compared to their mean values. This is typical for the

microarray datasets as the gene values are often disturbed by systematic and random

noise and hence follow a Lorentzian-like distribution with wider tails [132]. The

random selection of samples during each iteration also contributes to a wider

distribution of the trials as there may be samples with a large number of outliers.

4.4 Outline of the FIR-ELM

Recently, a new breed of SLFN introduced by Huang et al. in [3], called the ELM, has

been proven to simplify the neural network training process into solving a set of linear

equations. The ELM algorithm first initializes the hidden layer weights and biases

randomly and then proceeds to compute the output weights deterministically using the

Moore-Penrose pseudoinverse. The learning capabilities of the ELM have been shown

in [66], [70], [127], [128] to produce good results in terms of classification accuracy.

However the randomly generated weights provide highly sensitive classifiers which are

prone to noise and other disturbances within the data [9], [139].

The FIR-ELM in [9] is a modified version of the ELM with the purpose of

improving the robustness. The hidden layer weights of the SLFN are designed using the

FIR filter theory and the output layer weights are derived using convex optimization

methods. A brief overview of the FIR-ELM is as follows:

4.4.1 Basic FIR-ELM

For a set of distinct samples {( )| [ ] [ ]} where

is an input vector and is an target vector, the neuron SLFN,

with the activation function ( ) can be modeled as


( ) ∑ ( )

( )

where [ ] is the -dimensional weight vector connecting the th

hidden node and the input nodes, [ ] is the output weights vector

connecting the th hidden node and the output nodes, is the bias vector of the th

hidden node, and, [ ] is the target output vector with respect to

[ ] .

For any bounded non-constant piecewise continuous activation function ( ), it

has been proven that SLFNs with hidden nodes can approximate samples with

zero error such that ∑ | | [25]. Therefore there exists , , that

satisfies (4.4.1), and the equation can be simplified into matrices shown below

[ ( ) ( )

( ) ( )

] ( )

[ ] and [ ]

The definition of the SLFN up until (4.4.2) is similar to the ELM. However, in

addition to the typical SLFN architecture, the FIR-ELM introduces an input tapped

delay line with delay units at the front of the SLFN and uses both the linear

hidden nodes and linear output nodes. The input tapped delay line represents a finite

depth memory where the current state and past states of a variable are used as the

input to the SLFN. Such SLFN architecture introduces system dynamics in the training

process and is capable of universal approximation [9]. A diagram of the SLFN

architecture used in this chapter is given in Figure 4.5, where D is the unit delay

element, k is the index for the input sample , | are the

hidden layer weights, | are the output layer weights, and ( ) is the

output function.


Figure 4.5 A single hidden layer feedforward neural network with linear nodes and

an input tapped delay line

Remark 4.4.1: The FIR-ELM algorithm requires that the hidden layer weights be

assigned using FIR filter design techniques to reduce disturbances in the data. Hence

given that it is possible to have prior knowledge of the frequency responses from the

training datasets, appropriate hidden layer weights can be designed.

Without loss of generality a low pass filter for the th hidden layer node can be

represented in time domain as in [9] as

( )

∫ ( )

( ( ( ) ))

( ( ) ) ( )

where is the filter length and is the cut-off frequency. The filter coefficients in

(4.4.3) can then be assigned in the th hidden layer node as shown below

( ) ( ) ( ) ( )

It is also possible to design other types of filters such as the band pass and high pass

filters depending on the requirement of the dataset.






++ +






ci (k-1)

ci (k-n+2)

ci (k-n+1)

icNf ~





Nnw ~


The optimal output weights for the FIR-ELM are then calculated based on the

minimization of the norms of the output error and the output weights matrix, with two

risk balancing constants and introduced to balance the empirical and structural risk.

The output weights can be obtained as



( )

The FIR-ELM algorithm can be summarized as:

1) Given a training data set [ ], design the hidden layer weights of the SLFN as

in (4.4.3).

2) Calculate the hidden layer output .

3) Solve for using (4.4.5).

4.4.2 Frequency Domain Gene Feature Selection

It is usually hard to define specific frequency specifications to filter the microarray gene

expression data even when prior knowledge of the frequency response is available as

they contain many components of similar magnitude. Figure 4.6 shows an example of

the frequency response for a sample in the colon tumor dataset. Therefore, in order to

analyse the frequency profiles of the respective datasets, an exhaustive FIR filter design

search algorithm called frequency domain gene feature selection (FGFS) is proposed. In

this chapter, a frequency profile is defined as a collection of filter designs and their

respective classification performances.

First, the samples in a dataset are divided into training and testing sets, and

. Then within the training set , the samples are split once more into subsets and

specifically for filter design selection. The subsets and will be used iteratively

to evaluate the suitability of different FIR filter designs for the dataset over a range of

normalized cut-off frequencies from 0.1 to 0.9 with a step size of 0.1. A band width of ±

0.05 is assigned for the band pass filter. Finally, the frequency profile of the dataset can

be generated from the testing accuracies achieved for each filter design using the

training samples . The best performing combination of FIR filter design is


then selected to train the FIR-ELM using all the samples in , and the trained classifier

is tested on samples in . A flow chart of the algorithm is given in Figure 4.7.

Remark 4.4.2: In the above, a gene feature selection algorithm that preserves the

microarray vectors and utilizes all the genes within a sample for classification is

developed. Hence it is different from the conventional gene selection algorithm which

selects subsets of genes. The proposed method is more robust in terms of handling the

noise that may severely affect parts of the gene expression readings, such as

experimental errors which produce outliers and other disturbances which may cause

parts of the sample to be unusable.

Figure 4.6 Frequency response of a sample from the colon dataset

Figure 4.7 An FIR filter design search algorithm for FIR-ELM

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-8




0x 10


Normalized Frequency ( rad/sample)


se (d



0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 140





Normalized Frequency ( rad/sample)



de (d


Input Data (N)

Train (p1) Test (p2)

F-train (q1) F-test (q2)

FIR-ELM FIR-filters

- low pass - high pass - band pass (0.1<ωc<0.9)

Trained Classifier


Tested all?


Frequency Profile



4.5 Experiments and Results

The performance of the FIR-ELM with the FGFS algorithm is investigated in this

section for both the leukemia and colon tumor datasets. The classification results will

then be compared with popular algorithms such as the MLP, ELM, and SVM. A

conventional frequency based gene selection process known as the discrete cosine

transform (DCT) is implemented for the MLP, ELM, and SVM algorithms to compare

the two different approaches to gene feature selection. Lastly the linear separability of

the hidden layer output of the SLFN for ELM and FIR-ELM is discussed.

4.5.1 Biomedical Datasets

Two binary microarray gene expression datasets are investigated, namely, leukemia and

colon tumor from the Kent Ridge biomedical data repository [140]. The leukemia

dataset consists of two classes of acute leukemia known as acute lymphoblastic

leukemia (ALL), arising under lymphoid precursors and acute myeloid leukemia (AML),

arising under myeloid precursors. There are 72 bone marrow samples in the dataset with

47 ALL and 25 AML cases and each contains 7129 gene probes. For the colon tumor

dataset a total of 62 samples were collected from colon-cancer patients where 40

biopsies are tumors and 22 others are normal samples from healthy parts of the colon.

As conventional classifiers tend to have problems classifying microarray data

due to the high number of variables [119], [121], a frequency transformation based gene

feature selection method known as the DCT will be used to perform feature selection for

the MLP, ELM, and SVM tests. The DCT is a well-known method in pattern

recognition to compress the energy in a sequence and it has been successfully

implemented in cancer classification [141] and character recognition [191], [192]

applications. In [141], the author implemented the DCT with neural networks for the

detection of stomach cancer and achieved the classification accuracy of 99.6%, which is

among the highest ever reported. In addition, the DCT belongs to the similar frequency

transform based feature selection method as the FIR filters with the advantage of

dimension reduction. Therefore the DCT is the preferred feature selection algorithm in

this chapter. The DCT for one dimensional array is defined as


( ) ( )∑ ( ) ( ( )( )


( )

( )


where is the length of the array sequence .

The DCT generates coefficients that will then be used as input data for

evaluating the performance of the MLP, ELM, and SVM algorithms. In order to select

the most relevant coefficients the 90% criterion is employed to select coefficients that

represent 90% of the total energy. Although a lower percentage can be chosen, this

criterion is selected to avoid losing too much information from the dataset and reducing

the classification performance. A summary of properties for the datasets and the DCT

feature selection is presented in Table 4.4.

Table 4.4: Selection of DCT coefficients for leukemia and colon datasets

Dataset Samples Genes DCT Coefficients

Leukemia 72 7129 2325

Colon Tumor 62 2000 613

4.5.2 Experimental Settings

As the classification of microarray data concerns the critical diagnosis of cancer, the

misclassification rate for each class must also be minimized, hence both the

classification accuracy and minimum sensitivity are usually considered [127], [142].

The sensitivity is defined as the number of correct patterns predicted to be in a class

with respect to the total number of patterns in the class. The minimum sensitivity is

selected from the class with the lowest sensitivity measure within the confusion matrix.

In order to properly select the meta-parameters for each classification algorithm, a

classification performance measurement based on the classification accuracy ( ) and


minimum sensitivity ( ) is used. For each of the meta-parameter

[ ] considered, the optimal value is selected from (4.5.2)

( )

where is the mean of , and is the mean of . All the parameters are

evaluated using the repeated 10-fold stratified cross validation (CV) process with the

training data only. The CV process is repeated 20 times for each considered parameter.

For both the MLP and ELM, the number of neurons is considered from 1 to 200

in increments of 1, the sigmoidal activation function is used in the hidden layer, and the

linear activation function is used in the output layer. The scaled conjugate gradient

algorithm is used to train the MLP. The linear kernel is implemented in the SVM after

testing using several other popular kernels such as the RBF and polynomial gave poor

results. The regularization parameter for the SVM is considered within the range of

to with logarithmic increments of 1. The FIR-ELM filter length is similar to

the dimensions of the gene sample, and the regularization parameters are selected as

and based on the prior knowledge in [9]. Lastly, the targets for both

datasets are defined as [1, -1].

A repeated 2-fold stratified CV is implemented to train the classifier algorithms

after the meta-parameter selection process is completed. The CV cycle is repeated 20

times for each algorithm to obtain the mean classification accuracy. The confusion

matrices show the mean number of correctly classified as well as misclassified samples

for each algorithm.

4.5.3 Leukemia Dataset

The frequency profile of the leukemia dataset for the low pass, high pass, and band pass

filters are shown in Figures 4.8, 4.9, and 4.10 respectively, with error bars showing the

standard deviation of the classification performance. The optimal filter design based on

(4.5.2) is the high pass filter with a mean normalized cut-off frequency of 0.29.


However, it is not possible to state which filter type is better due to the large standard

deviations in the frequency profile plots. Instead, the results show that at each iteration

of testing, the optimal filter design is based on the selection of samples for training.

Different filtering criteria may be derived in the meta-parameter selection process based

on the training samples. The possibility of using different filter designs to produce

similar classification performances indicate that different filter design criteria produces

vastly different data patterns which can still be mapped by the output layer of the SLFN.

Therefore the selection of the appropriate filter remains subjective and dependent on the

classification requirements (e.g. type of noise present).

Figure 4.8 Classification performance for leukemia with low pass filter

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 150











normalized cut-off frequency c




n pe





Figure 4.9 Classification performance for leukemia with high pass filter

Figure 4.10 Classification performance for leukemia with band pass filter

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 150











normalized cut-off frequency c




n pe




0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 150











normalized cut-off frequency c




n pe





The classification performance in Table 4.5 shows that the FIR-ELM has

achieved the best result, with an accuracy of 96.53% and a standard deviation of 1.79%

which is better than the benchmark of the SVM. The worst performing algorithm is the

ELM with an accuracy of 76.90% and the largest standard deviation. The confusion

matrix is shown in Table 4.6, where the ALL cases are labeled as class 1 and AML

cases are labeled as class 2. The FIR-ELM has the most similar sensitivities for both

cases. From the meta-parameter selection process, the number of neurons for the MLP

is 6, the ELM requires 174 neurons, and the SVM regularization parameter is 0.0046.

Table 4.5: Classification performance for leukemia dataset


Accuracy (%) 88.01 95.50 76.90 96.53 94.49

Std. Dev.(%) 3.78 2.42 6.39 1.79 1.91

Time (s) 10.87 0.66 0.4 1.12 1.12

(R): FIR-ELM with random gene order

Table 4.6: Confusion matrix for classification of leukemia dataset


Prediction 1 2 1 2 1 2 1 2 1 2

1 44.3 5.9 46.2 2.45 39.1 8.65 46.6 2.1 46 3

2 2.6 19.1 0.8 22.6 8.0 16.4 0.4 22.9 1 22

Sen* (%) 94.2 76.4 98.3 90.2 83.1 65.4 99.2 91.5 97.9 88.0

(R): FIR-ELM with random gene order, * Sen: sensitivity


4.5.4 Colon Tumor Dataset

The frequency profiles for the colon tumor dataset using the low pass, high pass, and

band pass filters are shown in Figures 4.11, 4.12, and 4.13, with error bars showing the

standard deviation of the classification performance. The optimal filter design based on

(4.5.2) is the high pass filter with a mean normalized cut-off frequency of 0.47. Similar

to the leukemia dataset, it is not possible to state which filter type is better due to the

large standard deviations in the frequency profile plots. The classification performance

of the colon tumor dataset is then presented in Table 4.7. The SVM achieves the highest

mean accuracy followed by the FIR-ELM, which is the best performing algorithm

among the neural network based algorithms. However, due to the large standard

deviations of the classification accuracy, it is indeed impossible to declare a best

classifier for the colon tumor dataset. From the meta-parameter selection process, the

number of neurons for the MLP is 42, the number of neurons for the ELM is 142, and

the SVM regularization parameter is 0.0082.

Figure 4.11 Classification performance for colon dataset with low pass filter

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 150











normalized cut-off frequency c




n pe





Figure 4.12 Classification performance for colon dataset with high pass filter

Figure 4.13 Classification performance for colon dataset with band pass filter

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 150











normalized cut-off frequency c




n pe




0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 150











normalized cut-off frequency c




n pe





The confusion matrix for the colon tumor dataset is shown in Table 4.8. The

tumor cells are labeled as class 1 and the healthy cells are labeled as class 2. It can be

seen from Table 4.8 that the SVM has the best sensitivity for class 1 while the FIR-

ELM has the best sensitivity for class 2.

Table 4.7: Classification performance for colon dataset


Accuracy (%) 69.76 79.76 71.53 76.85 76.61

Std. Dev.(%) 8.52 3.57 6.09 6.18 4.1

Time (s) 5.72 0.41 0.12 22.74 22.74

(R): FIR-ELM with random gene order

Table 4.8: Confusion matrix for classification of colon dataset


Prediction 1 2 1 2 1 2 1 2 1 2

1 30.3 9.1 34.5 7.1 39.1 8.7 32.5 6.8 32.5 7

2 9.7 13 5.6 15 8 16.4 7.6 15.2 7.5 15

Sen* (%) 75.8 58.9 86.3 68 78.1 59.6 81.1 69.1 81.3 68.2

(R): FIR-ELM with random gene order, * Sen: sensitivity


4.6 Discussions

Overall the performance of the FIR-ELM has been shown to achieve comparable or

better results in both the leukemia and colon tumor classification problems. While the

design of the FIR filters remain as an art and is still widely open to interpretation, the

method proposed in this chapter gives a straightforward suggestion based on the

conventional training of neural networks. For both microarray datasets, the ELM is seen

to be the fastest followed by SVM, FIR-ELM, and MLP. The time recorded represents

20 iterations of training and testing for each algorithm. The results for the randomly

permuted gene order case for both datasets have shown that the classification accuracy

remains similar to that of the original gene order. Based on these results, the FIR-ELM

with FGFS seems to be insensitive to the gene ordering and is capable of learning from

different variants of the datasets. This is acceptable as large standard deviations have

been observed in Figures 4.8 to 4.13, which indicate that the FGFS algorithm adapts to

the sample characteristic in selecting the optimal filter design.

4.6.1 Linear Separability of the Hidden Layer Output for SLFN

Without loss of generality, SLFNs typically utilize the hidden layer as a pre-processor to

map the input data into the desired feature space so that the data points will be easily

separable. The output layer then maps the features to the target classes. In order to

compare the performance of the chosen hidden layer weights for the ELM and FIR-

ELM the linear separability gene pair testing algorithm is implemented for the outputs

of the hidden layer for both the classifiers, where the outputs of the hidden layer are

defined as


] ( )

It is seen that (4.6.1) omits the activation function from the earlier defined

hidden layer output (4.4.2). This is because the sigmoid activation functions would

squish the outputs into a much smaller range and therefore discards the original

mappings of the hidden layer weights.


Using the same allocation of samples as in Section 4.3, Table 4.9 shows the

linearly separable gene pair testing results of the hidden layer outputs for ELM and FIR-

ELM. It can be seen from Table 4.9 that the hidden layer of the FIR-ELM reveals more

linearly separable gene pairs compared to the ELM. The empirical results achieved

suggest that the hidden layer design of the FIR-ELM improves the performance in terms

of feature discovery and it is consistent with the improved classification accuracy

obtained for the leukemia and colon tumor dataset. It should be noted that this result is

only comparable relatively between the two classifiers under our constraints.

However, when the results in Table 4.9 are compared with Table 4.3 which

shows the linearly separable pairs for the original dataset, the positive correlation

between the number of linearly separable pairs and classification accuracy does not hold.

It is seen that the linearly separable pairs for the leukemia dataset have increased while

the linearly separable pairs for the colon tumor dataset decreased. Ideally the number of

linearly separable pairs is expected to increase to indicate the discovery of more features

at the hidden layer output. This may be due to the transformation of the original data

into the feature space which inhibits direct comparisons.

Table 4.9: Linearly separable gene pairs for the hidden layer output in ELM

and FIR-ELM for leukemia and colon datasets

Algorithm Dataset Mean Std. Dev

ELM Leukemia 17 46

Colon Tumor 3 7

FIR-ELM Leukemia 35420 19396

Colon Tumor 33 63

From the results obtained, it can be concluded that the positive correlation

between the number of linearly separable pairs and classification accuracy holds only

when comparing different SLFN training algorithms. The hidden layer output of SLFNs

need not be more linear separable than the original dataset to achieve good classification


performance. The criterion above could find many applications in the selection of the

optimal hidden layer weights for SLFNs.

4.7 Conclusion

In this chapter, the FIR-ELM has been implemented for the binary classification of two

biomedical datasets. It has been seen that the microarray gene expression samples are

treated as time series to form the input patterns for the classification with the FIR-ELM.

To assign the optimal input weights of the neural classifier, a frequency domain gene

feature selection (FGFS) algorithm has been proposed to evaluate the suitability of

different FIR filter designs. For both the leukemia and colon tumor datasets, the FIR-

ELM with the FGFS has shown a better performance compared to the other neural

networks based algorithms while achieving comparable results with the SVM. It has

been further shown in the simulation section that the FIR-ELM achieves much better

results in terms of gene feature discovery as compared to the ELM. Some future works

include the testing of more filter types for the hidden layer designs of the neural

classifier and extension to the multi-class pattern classifications.


Chapter 5

Frequency Spectrum Based

Learning Machine

In this chapter, a new robust single hidden layer feedforward network (SLFN) based

pattern classifier is developed. It is shown that the frequency-spectrums of the desired

feature vectors can be specified in terms of the discrete Fourier transform (DFT)

technique. The input weights of the SLFN are then optimized with the regularization

theory such that the error between the frequency components of the desired feature

vectors and the ones of the feature vectors from the outputs of the hidden layer is

minimized. For linearly separable input patterns, the hidden layer of the SLFN plays the

role of removing the effects of the disturbance from the noisy input data and providing

the linearly separable feature vectors for the accurate classification. However, for non-

linearly separable input patterns, the hidden layer is capable of assigning the DFTs of all

feature vectors to the desired positions in the frequency-domain such that the

separability of all non-linearly separable patterns is maximized. In addition, the output

weights of the SLFN are also optimally designed so that both the empirical and the

structural risks are well balanced and minimized under a noisy environment. Two

simulation examples are presented to show the excellent performance and effectiveness

of the proposed classification scheme.


5.1 Introduction

Neural networks consist of many single processing nodes operating in parallel. It is such

a parallel architecture of neural networks that allows engineers and scientists to develop

the data-based systems to process vast amounts of data in manufacturing, transportation,

process control, dynamic system modeling, digital signal and image processing and

information retrieval [1], [91], [92], [95-99], [143-146]. It has been further seen from

the extreme learning machine (ELM) and its applications [3], [18], [23-25], [89], [93],

[94] that the neural networks with a single hidden layer and the randomly assigned input

weights and hidden layer biases can perform accurate universal approximation and

powerful parallel processing for complex non-linear mappings and pattern

classifications with vast amounts of data.

In the area of neural computing, the commonly used weight training method is

the gradient-based backpropagation (BP) algorithm [1], [91], [143]. In order to train an

SLFN to learn a predefined set of input and output data pairs with the BP algorithm, an

input pattern is applied as a stimulus to the hidden layer of the neural network and then

is propagated to the output layer to generate an output pattern. The BP can then be

implemented, based on the computed error between the output pattern of the SLFN and

the desired one, to update the weights from the output layer to the hidden layer such that,

as the second input pattern is applied, the error between the generated output pattern and

the desired one is reduced. Such a training process is repeated until the error converges

to zero (or sufficiently small). Obviously, the BP training process is time-consuming

and the slow convergence has limited the BP in many practical applications where vast

amount of data are presented and fast on-line training is required.

Compared with the BP algorithm, the ELM algorithm developed in [3], [18],

[23-25], [89], [93], [94] has revolutionized the training of neural networks in the

following ways: (i) The input weights and the hidden layer biases of the SLFNs can be

randomly assigned if the activation functions of the hidden nodes are infinitely

differentiable; (ii) The SLFNs are simply treated as linear systems and the output

weights of the SLFNs can be analytically determined by using the generalized inverse

of the hidden layer output matrix; (iii) Because of the characteristics of the ELM’s batch


learning, that is, all examples in the training sample set are used in one operation of the

global optimization, the learning speed of the ELM can be much faster than those of the

BP as well as the BP-like algorithms. In view of these remarkable merits, recently the

ELM has received a great deal of attention in the area of computational intelligence with

application and extension to many other areas [18], [147].

However, from the perspective of engineering applications, engineers may be

concerned with whether the SLFNs with the ELM behave with a strong robustness

property with respect to the input disturbances and the random input weights in practice.

As a matter of fact, it has been noted from both the simulations and experiments that the

SLFNs trained with the ELM in many cases behave with a poor robustness property

with respect to the input disturbances. For instance, when the input weights and the

hidden layer biases of an SLFN are randomly assigned, the changes of the hidden layer

output matrix of the SLFN may be very large. This in turn will result in large changes of

the output weight matrix. According to the statistical learning theory [86], [100-103],

the large changes of the output weight matrix will greatly increase both the structural

and empirical risks of the SLFNs, which will in turn degenerate the robustness property

of SLFNs with respect to the input disturbances. Therefore, it is necessary to properly

design the input weights such that the hidden layer of SLFNs plays an important role in

improving the robustness with respect to disturbances, and reducing both structural and

empirical risks of SLFNs. In addition, although the SLFNs with random input weights

and hidden biases can exactly learn distinct observations, it seems that the high

potentiality of the hidden layer for information processing in the SLFNs has not been

fully explored as seen in [3], [18], [23-25], [89], [93], [94].

In order to improve the robustness of the SLFNs trained with the ELM, a new

training algorithm, called FIR-ELM, has been developed recently in [9]. It is seen that

the linear FIR filtering technique is used to design the input weights such that the

hidden layer of the SLFN performs as a pre-processor of the input data for removing the

input disturbance and the undesired signal components. The regularization theory [1] is

then adopted to design the output weight matrix to minimize the output error, balance

and reduce both the empirical and structural risks of the SLFN. The simulation results in

[9] and [11] have shown excellent robustness performance of the SLFNs with the FIR-


ELM under a noisy environment. It has been noted from [9] that both the hidden nodes

and the output nodes in the SLFN are linear. However, in order to make the SLFN with

the linear nodes have the learning capability, a tapped-delay line with a number of unit-

delay elements is added to the input layer of the SLFN. According to the filtering and

system theory [2], [107], [113], [148], such an SLFN with both the linear nodes and the

input tapped-delay line is equivalent to the SLFN with non-linear nodes regarding the

universal learning capability. Furthermore, the design and the analysis of the SLFNs

with the linear nodes as well as the input tapped-delay line are much easier than those of

the SLFNs with non-linear nodes.

In this chapter, the SLFNs with both the linear nodes and the input tapped-delay

line for pattern classifications will be further studied based on [9], and a new training

algorithm called DFT-ELM, will be developed. It will be seen that, unlike the FIR-ELM,

for the input data vectors with patterns, the desired feature vectors are specified in

terms of their frequency-spectrums. The input weights of the SLFN are then trained

with the regularization theory [1], [2] to minimize the error between the frequency

components of the desired feature vectors and the ones of the feature vectors from the

hidden layer of the SLFN.

It will be shown in the simulation section that, if all input patterns are linearly

separable, the hidden layer of the SLFN with the optimal input weights will play a role

of removing the effects of the disturbance from the noisy input data and providing the

linearly separable feature vectors for the accurate pattern classification. However, if the

input patterns are non-linearly separable, the input weights of the SLFN will be trained

in the sense that the hidden layer is capable of assigning the DFTs of all feature vectors

to the “desired positions” in the frequency-domain, specified by the DFTs of the desired

feature vectors, so that the separability of all non-linearly separable patterns are

maximized in the feature space.

In order to make sure that the SLFN can properly learn the maps between the

input and output training data pairs and reduce both the structural and empirical risks,

the regularization technique is also adopted to optimize the output weights. It is due to

the significant reduction of both the structural and the empirical risks through


optimizing both the input and the output weights, the SLFN classifier with the DFT-

ELM behaves with a much stronger robustness property with respect to the input

disturbances, compared with the SLFNs, trained with the regularized ELM (R-ELM)

[18] and the FIR-ELM [9], respectively. This merit will be confirmed in the simulation

section through the performance comparisons among the SLFN classifiers with the

DFT-ELM, the FIR-ELM and the R-ELM, respectively.

The rest of the chapter is organized as follows: In Section 5.2, the SLFN

classifier with both linear nodes and a tapped-delay line is formulated, the frequency-

domain presentation of a feature vector is defined, and the relationship between a

feature vector and its frequency-domain presentation is specified in terms of the DFT. In

Section 5.3, the optimizations of both the input and the output weights with the

regularization theory are discussed in detail and the effects of the regularization

parameters and the number of the hidden nodes on the performance of the SLFN

classifier are also explored. In Section 5.4, the SLFN based classifiers, trained with the

DFT-ELM, the FIR-ELM and the R-ELM are compared to demonstrate the

effectiveness and strong robustness of the SLFN classifier with the DFT-ELM. Section

5.5 gives the conclusion and some further work.

5.2 Problem Formulation

A single hidden layer feedforward network based pattern classifier is described in

Figure 5.1:

Figure 5.1 A single hidden layer network with linear nodes and an input delay line


where the output layer has linear nodes, the hidden layer has linear nodes, (for

and ) are the input weights, (for and )

are the output weights, ( ) (for ) are the outputs of the hidden nodes, a

string of s, seen at the input layer, are unit-delay elements that ensure the input

sequence ( ) ( ) ( ) represents a time series, consisting of both

the present and the past observations of the process.

Remark 5.2.1: It is well known from [113] that an SLFN with both linear nodes and a

string of unit-delay elements at the input layer, as seen in Figure 5.1, has the

capability of universal approximation of any continuous function. It is because of the

unit-delay elements added to the input layer that every hidden node in the SLFN

performs as an th-order FIR filter that can approximate any linear or non-linear

function by properly choosing the input weights. In addition, the design and analysis of

the SLFNs with the linear nodes and the unit-delay elements are much easier than the

ones of the SLFNs with non-linear nodes.

As seen in Figure 5.1, the input and the output data vectors of the SLFN can be

expressed in the forms:

( ) [ ( ) ( ) ( )] ( )

( ) [ ( ) ( ) ( )] ( )

the output of the th hidden node can be computed as:

( ) ∑ ( ) ( )

( )


[ ] ( )

and the th output of the network, ( ), is of the form:


( ) ∑ ( )

( ) ( )


[ ] ( )


[ ] ( )

Thus, the output data vector ( ) can be expressed as:

( ) [ ( ) ( ) ( )]

[ ( )

( ) ( )]

( ) ( )


[ ] ( )

Suppose that there are input pattern vectors ( ) ( ) ( ) and

corresponding desired output data vectors ( ) ( ) ( ) , respectively, for

training the SLFN in Figure 5.1, where

( ) [ ] ( )

Then, using (5.2.8), it can be seen that

[ ( ) ( )] [ ( )

( )] ( )


( )


[ ( ) ( ) ( )]

[ ( )

( ) ( )

( )

( ) ( )

( )

( ) ( ) ]

( )



[ ( ) ( ) ( )] ( )

Remark 5.2.2: The matrix in (5.2.13) is the hidden layer output matrix of the SLFN

that contains all output feature vectors corresponding to the input data vectors

, respectively. For instance, for the th given input data vector ( ), the

corresponding feature vector from the outputs of the hidden layer is the th row of the

matrix :

[ ( )

( ) ( ) ] ( )

Since the output of the th hidden node, corresponding to the th input pattern

vector, can be written as:

( ) ( ) ( )

the th output feature vector in (5.2.15) can then be written as:

[ ( )

( ) ( ) ] [ ( ) ( ) ( )] ( )

For further analysis, the frequency components of the th output feature vector

in (5.2.17) can be expressed, in terms of its discrete Fourier transform (DFT), as follows

[107], [148]:

[ ] ∑ ( )

( )

[ ] ∑ ( )

( ) ( )

[ ] ∑ ( )

( ) ( )


[ ] ∑ ( )

( ) ( )

where [ ] [ ] [ ] are the samples of the frequency spectrum of the th

output feature vector in (5.2.17) at equally spaced frequencies , for

, respectively.

For further processing, (5.2.18)-(5.2.21) can be expressed in the following

matrix form:

[ [ ]

[ ]

[ ]

[ ]]


( )

( ) ( )

( )

( )( ) ]

[ ( )

( )

( )

( )]

( )


[ [ ] [ ] [ ] [ ]]

( )


( )

( ) ( )

( )

( )( ) ]

( )


[ ( ) ( ) ( ) ( )] ( )

(5.2.22) can then be written as:

( )

Using (5.2.3) in (5.2.25) leads to:


[ ( )

( ) ( )

( )]



( )

Then, (5.2.26) becomes

( )

Considering all input data pattern vectors ( ) ( ) ( ), it can be

seen that

[ ] [ ] ( )


( )


[ ] ( )

Let the frequency spectrums of the desired feature vectors be described by

[ ] ( )

Remark 5.2.3: The selection of the frequency spectrums of the desired output feature

vectors in (5.2.32) is mainly based on one’s understanding of the characteristics of the

input patterns. For instance, if all input patterns are linearly separable, the frequency

spectrums of the desired feature vectors can be chosen as those of the filtered input

pattern vectors, by removing all frequency components of the disturbances from the

frequency spectrums of the input pattern vectors. In this case, the hidden layer of the

SLFN with the optimal input weights plays a role of pre-processing and providing the

linearly separable feature vectors for the accurate classification in the output layer. In

practice, the frequency spectrums of the desired output feature vectors in (5.2.32) can

be provided by a reference filter for designing the optimal input weight matrix in

(5.2.27). This point can be seen from the first example in the simulation section.


Remark 5.2.4: However, if the input patterns are non-linearly separable, the frequency

spectrums of the desired feature vectors in (5.2.32) should be assigned to the “desired

location” in the frequency-domain in the sense that, through the optimization of the

input weights of the SLFN, the separability of all feature vectors from the outputs of the

hidden layer is maximized in the feature space. This point will be explored in detail in

the second example of the simulation section.

In the next section the regularization theory in [1], [2] and [149] will be used to

design both the optimal input weight matrix and the optimal output weight matrix .

Also, the effects of the regularization parameters and the number of the hidden nodes in

the performance of the SLFN classifier will be explored in detail.

5.3 Design of the Optimal Input and Output Weights

As described in Section 5.2, the input weight matrix in the hidden layer of the SLFN

should be trained such that the error between the frequency components of the desired

feature vectors and the ones of the feature vectors from the outputs of the hidden layer is

minimized. For this purpose, the design issue can be formulated by the following

regularization problem [1], [2], [9], [149]:

Minimize { ‖ ‖

‖ ‖ } ( )

Subject to ( )

where is the error between the frequency components of the desired feature vectors

and the ones of the corresponding feature vectors from the outputs of the hidden layer,

and are the positive real regularization parameters, ‖ ‖ is the regularizer which,

through the proper choice of the regularization parameters and ,is used to solve

the ill-posed inverse of the data matrix (see the later discussion).

The optimization problem in (5.3.1) with the constraint in (5.3.2) can be solved

conveniently by using the method of Lagrange multipliers [2], [112] and [149]. For this,

construct the Lagrange function as:




∑∑ ( ∑


( )

where is the th element of the error matrix defined in (5.3.2), is the th

element of the input weight matrix , is the th element of , the desired

frequency spectrums of the output feature vectors defined in (5.2.34),

∑ is the th element of , the matrix of the frequency-

spectrum samples of the output feature vectors of the hidden layer, is the th row of

the transformation matrix in (5.2.25), is the th column vector of , the

transpose of the input weight matrix , is the th element of the input pattern

vector ( ), is the th Lagrange multiplier.

Differentiating with respect to , it can be seen that

(∑∑ (∑


) ( )

It is noted that and can be expressed as:

[ ] ( )


[ ] ( )

(5.3.4) can then be written as:


( )





∑ ( )

∑ [ ] [


[ ] [

] [

] ( )

[ ]

[ ] [

] [


( )





] [

] [


( )


( )


In addition, differentiating with respect to ,

( )

Using the Kuhn-Tucker condition,

, [110], [112], the following

relationship can be obtained:

( )


( )

Considering the constraint in (5.3.2), (5.3.14) can be expressed as:

( ) ( )

and using (5.3.15) in (5.3.11) leads to

( )

( )

Considering the fact that

( )

where is the unity matrix, (5.3.16) can then be expressed as

( )

Then, the optimal input weight matrix is derived as follows:



( )


Remark 5.3.1: In conventional regularization theory [1], the regularization parameter

in (5.3.1) is set to 1 and the regularization for solving the ill-posed inverse problem

of the data matrix depends only on the proper choice of the small regularization

parameter in the sense that the regularization term ‖ ‖ makes the augmented

cost function in (5.3.1) smoother around the minimum point (or vertex) of the cost

function in the input weight space. However, the dynamic behaviors of the hidden layer

of the SLFN are also related the slope of the augmented cost function in (5.3.1) that is

adjusted by the regularization parameter . This point can be indirectly seen from


Remark 5.3.2: The Lagrange multiplier matrix in (5.3.15) describes the sensitivity of

the objective function in (5.3.1) with respect to the constraint in (5.3.2), that is, how

tightly the constraint in (5.3.2) is binding at the optimal point with the optimal input

weights in (5.3.19) [2], [110], [112]. It is seen from (5.3.15) that, as the value of the

regularization parameter becomes smaller, the cost function in (5.3.1) is less

sensitive with respect to the change of the constraint in (5.3.2). Thus, the Lagrange

multiplier matrix plays a role of qualitatively estimating the effects of the input

disturbance, the structural risk and the empirical risk on the robustness of the SLFN

classifier with the DFT-EIM, seen from the output of the hidden layer.

Remark 5.3.3: Also, the effects of the regularization parameter on the robustness of

the output of the hidden layer can be seen from the slope of the regularized cost function

in (5.3.1), because the values of the regularization parameter affect the width or the

steepness of the regularized cost function in (5.3.1). For instance, if the value of the

regularization parameter is very large, the slope of the cost function in (5.3.1) will be

very steep. The hidden layer of the SLFN is thus very sensitive to the change of the

input weights and input disturbances. However, as the value of the regularization

parameter is reduced, the changing rate of the cost function in (5.3.1) becomes

smaller and the hidden layer of the SLFN is thus less sensitive to the change of the input

weights and input disturbances. Therefore, it is necessary to properly choose the

regularization parameter so that the robustness of the hidden layer can be guaranteed.


Remark 5.3.4: The effects of the number of hidden nodes, , on the smoothness of ill-

posed data matrix and the sensitivity of the output weight matrix have been

clearly revealed in (5.3.19). (i) As the number of the hidden nodes is large, the

regularization factor

in (3.19) is small, and thus the well-posed matrix term

in (5.3.19) can well approximate the ill-posed data matrix in vicinity of its

singular point; (ii) Although the inverse term ( )

in (5.3.19) may be

sensitive to the changes of the input data because of the input disturbances, the

sensitivity of the optimal output weight matrix is greatly reduced because the inverse

term (


has been weighted by a small factor

in (5.3.19). However, it

should be noted that the larger number of hidden nodes may need a longer training time.

This point can be clearly seen in the simulation section. Therefore, in practice, the

number of hidden nodes must be chosen properly so that the SLFN classifier can

provide a tradeoff between sensitivity and training time.

Next, the problem of how to obtain the optimal output weight matrix to

minimize the error between the desired output pattern and the actual output pattern of

the SLFN classifier is considered. Similar to the discussion in the above on the

optimization of the input weight matrix . The optimization of the output weight matrix

is stated as follows:

Minimize { ‖ ‖

‖ ‖ } ( )

Subject to ( )

with the corresponding Lagrange function :




( )

( )

where is the th element of the error matrix , is the th element of the output

weight matrix , is the th element of the desired output data matrix , is the th


column of the hidden layer output matrix , is the th column of the output weight

matrix , is the th Lagrange multiplier, and are real positive regularization

parameters and ‖ ‖ is the regularizer of the output layer.

Similar to the discussion from (5.3.4) to (5.3.19), it is required to first compute

the partial derivatives


, and, by solving equations and

, the

optimal output layer weight matrix can then be obtained as:

( )

( )

with the corresponding sensitivity matrix :

( ) ( )

Remark 5.3.5: It is seen from (5.3.23) that the optimal output layer weight matrix

depends on the ratio

instead of values of and . However, the dynamic behavior

of the output layer of the SLFN depends not only on the ratio

, but also on the values

of and . As discussed in Remarks 5.3.1 and 5.3.2, the regularization parameters

and as well the ratio

should be chosen to be properly small for ensuring the trade-

off between the sensitivity and the smoothness of the SLFN classifier.

Remark 5.3.6: Based on the discussions in the above, the proposed DFT-ELM

algorithm can be summarized as follows:

Step 1: Define the frequency-spectrum sample matrix of the desired

feature vectors in (5.2.32);

Step 2: Compute the transformation matrix from (5.2.24);

Step 3: Compute the optimal input weight matrix from (5.3.19);

Step 4: Compute the hidden layer output matrix from (5.2.13); Step 5: Compute the optimal output weight matrix from (5.3.23).


5.4 Experiments and Results

In order to verify the effectiveness of the proposed DFT-ELM, the following two

classification examples are presented in this simulation section:

5.4.1 Example 1: Classification of Low Frequency Sound Clips

Consider 10 given input pattern vectors that contain the samples of the computer-

generated low frequency sound clips with the frequencies of 100 Hz, 150 Hz, 200 Hz,

300 Hz,…, 900 Hz, respectively, and modulated by an envelope function to create tones.

In order to sufficiently reflect the characteristics of the signals to be classified, each

input pattern vector contains 1000 sample data. Figure 5.2 shows the 700 Hz sound clip

modulated by the envelope function ( ) (

). For the classification

of the sound clips, consider the SLFN classifiers, with a linear output node, 30 hidden

nodes and a tapped-delay-line memory with 29 unit-delay elements at each of the input

layers. The desired output pattern values of the SLFNs, representing the signal classifier

states, for the corresponding 10 training input pattern vectors, are generated by a sine

function, ( ) , for with equal increments. The SLFN classifiers

are then trained with the R-ELM, the FIR-ELM and the DFT-ELM, respectively. To

examine the robustness performance of these SLFN classifiers, an additive white

Gaussian noise (AWGN) with the signal-to-noise ratio (SNR) of 10 dB is added to all

sound clips during testing. The sound clip with SNR of 10 dB is shown in Figure 5.3.


Figure 5.2 A sound clip modulated by the envelope function

Figure 5.3 The disturbed sound clip with SNR of 10 dB

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-4













Clear Clip

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5-4













High Noise Clip


Figure 5.4 and Figure 5.5 show the classification results of the SLFN classifier

trained with the R-ELM ( ) in [18], where the input weights and the

hidden layer biases are generated randomly within [-1 1], and the logarithmic sigmoid is

used as the activation function of the non-linear nodes in the hidden layer. It is seen that

the output values of the classifier deviate largely from the desired values of the classifier

states. Such unsatisfied classification results are mainly due to the fact that the random

input weights, the random hidden layer biases and the AWGN make the classifier with

the R-ELM experience both the large structural and the large empirical risks.

Figure 5.6 and Figure 5.7 show the classification results and the corresponding

RMSEs of the SLFN classifier trained with the FIR-ELM developed in [9], where the

input weights are assigned such that each linear hidden node performs as a low-pass

filter with the Kaiser window of length 1001 and the cut-off frequencies 200 Hz to

4kHz, respectively. In order to balance and reduce the effects of both the structural and

empirical risks, the regularization parameters and are chosen as and

, respectively. It has been seen that, compared to Figure 5.4 and Figure 5.5

with the R-ELM, the effects of the input white Gaussian noise have been greatly

reduced since the hidden layer, as a pre-processor, is capable of removing the high

frequency background noise, and thus, the accuracy of classification with FIR-ELM has

been improved with much smaller deviation between the classifier outputs and the

desired values of the classifier states as seen from the RMSEs in Figure 5.7.

The classification results of the SLFN classifier with the DFT-ELM are shown

in Figure 5.8 and Figure 5.9, respectively. In this simulation, the frequency spectrum

sample matrix of the desired feature vectors is chosen from the DFTs of the outputs

of the hidden nodes of the SLFN trained with the FIR-ELM in Figure 5.6 and Figure 5.7.

It is noted that, after training with the DFT-ELM, the hidden layer of the SLFN

classifier has eliminated the effects of the input white Gaussian noise, and most

importantly, with the optimal choice of the regularization parameters

through the experiments, the high accurate classification

performance has been achieved with much smaller deviation between the classifier

outputs and the desired values of the classifier states, compared with the ones with the


R-ELM in Figure 5.4 and Figure 5.5, and the ones with the FIR-ELM in Figure 5.6 and

Figure 5.7, respectively.

Figure 5.4 Classification using R-ELM with SNR of 10 dB ( and )

Figure 5.5 The RMSE using R-ELM with SNR of 10 dB ( and )

0 1 2 3 4 5 6 7 8 9 10 110.6











sound clip




0 1 2 3 4 5 6 7 8 9 10 11-1











sound clip




Figure 5.6 The classification using FIR-ELM with SNR of 10 dB

( )

Figure 5.7 The RMSE using FIR-ELM with SNR of 10 dB ( )

0 1 2 3 4 5 6 7 8 9 10 110.6











sound clip




0 1 2 3 4 5 6 7 8 9 10 11-1











sound clip




Figure 5.8 Classification using DFT-ELM with SNR of 10 dB



Figure 5.9 The RMSE using DFT-ELM with SNR of 10 dB

( )

0 1 2 3 4 5 6 7 8 9 10 110.6











sound clip




0 1 2 3 4 5 6 7 8 9 10 11-1











sound clip




For further analysis, the performance comparisons of the SLFN classifiers,

trained with the R-ELM, the FIR-ELM and the DFT-ELM, respectively, are carried out,

in terms of the averaged RMSE over 50 iterations and the different SNRs, as shown in

Table 5.1. It is seen that, when the SNR is small (10 dB), the R-ELM is very sensitive to

the noisy environment with the result that both the mean and the standard deviation are

much larger than the ones of the FIR-ELM and the DFT-ELM. It is further noted that,

although the hidden layers of the SLFNs, trained with the FIR-ELM and the DFT-ELM,

have the same low pass filtering property, the DFT-ELM algorithm behaves with the

stronger robustness property then the FIR-ELM. The reason is that the input weights of

the SLFN with the DFT-ELM are optimized, which ensures that the hidden layer of the

SLFN is less sensitive than the one of the FIR-ELM, with respect to different SNRs.

Table 5.1: Comparisons of the R-ELM ( ),

FIR-ELM ( ) and DFT-ELM ( )


Mean Std. Dev. Mean Std. Dev. Mean Std. Dev.

10 0.1761 0.0389 0.0586 0.0138 0.0146 0.0041

20 0.0519 0.0154 0.0216 0.0052 0.0072 0.0010

30 0.0214 0.0057 0.0066 0.0011 0.0062 0.0003

Figure 5.10 shows the curves of both the RMSE and the classification error of

the SLFN classifier with the DFT-ELM, where the number of the hidden nodes is fixed

to 30, but the regularization parameter ratio

is changed from 0.01 to 0.1. It is

seen that the curves of the RMSE and the classification error intersect at the point with

the ratio of

. Obviously, this intersection is located at the point where the

optimal choices of the regularization parameters are , ,

and , respectively.


The following facts have been noted from Figure 5.10: (i) As the regularization

parameter ratio is less than 0.02, the classification error is small but the RMSE is

relatively large. In this case, the SLFN classifier is sensitive to the input disturbances; (ii)

As the regularization parameter ratio is greater than 0.02, the classification error is

increased but the RMSE is reduced. The small RMSE means that the SLFN classifier is

robust with respect to the input disturbances. Therefore, for the tradeoff between the

classification error and the RMSE, it is better to choose the values and the ratio of the

regularization parameters around the intersection of the curves of the RMSE and the

classification error as seen in Figure 5.10.

Figure 5.10 The RMSE and class error via

for the DFT-ELM

Figure 5.11 shows the curves of the RMSE and the classification error of the

SLFN with the DFT-ELM, where the regularization parameters are chosen as

and , that is, the regularization parameter ratio

is fixed

at 0.02, but the number of the hidden nodes is changed from 4 to 50. It is seen that,

when the number of hidden nodes is less than 30, the classification error is large. This

is because the feature vectors provided by the hidden layer could not sufficiently

represent the input patterns. However, as the number of hidden nodes is increased, the

classification performance is gradually improved and this is agreed with the discussions

in Remark 5.3.3 that, as the number of the hidden nodes is large, both the excellent

0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09 0.10














RMSE and Class Error vs d/

RMSEClass Error


classification results and strong robustness with respect to the input disturbance can be

achieved with the DFT-ELM. It is also noted that the decrease in the class error as the

number of neurons increase is not monotonic. The increase in the number of neurons

translates to the increase in the bandwidth of the frequency spectrum matrix , which

affect the feature mapping of the DFT-ELM hidden layer. One reason for the non-

monotonic decrease of the class error may be due to the effects of the nonlinear feature

mapping performed at the DFT-ELM hidden layer. Comparing the independent effects

of tuning the regularization ratio and the number of hidden nodes, it can be said that

finding the optimal number of hidden nodes is more important.

Figure 5.11 The RMSE and classification error via the number of hidden nodes with


5.4.2 Example 2: Classification of Handwritten Digits

In this example, a handwriting recognition experiment, using the SLFN classifier,

trained with the DFT-ELM, is implemented. The training data are taken from the

MNIST handwritten digits database, where 60,000 training samples and 10,000 testing

samples are collected from multiple disjoint sets of writers [82]. The MNIST

handwritten digits database consists of the handwritten digits from 0 to 9, respectively.

Each digit is a gray-scale image of 28x28 pixels with the intensity range of 0 (black) to

255 (white). In order to conduct an unbiased experiment, the training and testing sets are

5 10 15 20 25 30 35 40 45 500







Number of Neurons



RMSE and Class Error vs Number of Neurons

RMSEClass Error


all randomly selected from the pool of both the training and the testing samples. A

sample set of digits is shown in Figure 5.12.

Figure 5.12 A set of handwritten digits from the MNIST database

In order to classify the handwritten digits with the DFT-ELM algorithm, each

image, as in Figure 5.13(a), is first divided into 14 rows by 14 columns, that is, the

image is segmented into 196 small images, as seen in Figure 5.13(b).

Figure 5.13(a) Image of digit Figure 5.13(b) Segmented image

The mean of the pixel intensities of each segment in Figure 5.13(b) is computed as:

∑ ∑ ( )

( )

where ( ) is the intensity of the th pixel of th segment, (= 2) and (= 2) are

the numbers of rows and columns of pixels in each segment, respectively.

The means of all of the segments, from row 1 to row 14, in Figure 5.13(b), are

then arranged as the elements of the following time series type data vector:

[ ] ( )

A sample set of the time series type data vectors, corresponding to a set of the 10

handwritten digits in Figure 5.12, are given in Figure 5.14(a) to Figure 5.14(j),





Figure 5.14 (a)

Figure 5.14 (b)

0 20 40 60 80 100 120 140 160 1800






sample index



sample digit 0

0 20 40 60 80 100 120 140 160 1800






sample index



sample digit 1


Figure 5.14 (c)

Figure 5.14 (d)

0 20 40 60 80 100 120 140 160 1800






sample index



sample digit 2

0 20 40 60 80 100 120 140 160 1800






sample index



sample digit 3


Figure 5.14 (e)

Figure 5.14 (f)

0 20 40 60 80 100 120 140 160 1800






sample index



sample digit 4

0 20 40 60 80 100 120 140 160 1800






sample index



sample digit 5


Figure 5.14 (g)

Figure 5.14 (h)

0 20 40 60 80 100 120 140 160 1800






sample index



sample digit 6

0 20 40 60 80 100 120 140 160 1800






sample index



sample digit 7


Figure 5.14 (i)

Figure 5.14 (j)

Figure 5.14(a) ~ Figure 5.14(j) Encoded sample data set for images 0 to 9

0 20 40 60 80 100 120 140 160 1800






sample index



sample digit 8

0 20 40 60 80 100 120 140 160 1800






sample index



sample digit 9


Obviously, the coded time series type data vectors that represent 10 written

digits in Figure 5.14(a)~Figure 5.14(j) are non-linearly separable. Considering the fact

that, for pattern classification purpose, a feature vector is simply used to represent an

input pattern in the feature space, theoretically, one may assign an arbitrary vector in the

feature space to represent an input pattern. Recall the pole-placement method in control

engineering [114] where the poles of a closed-loop continuous system can be assigned

at any desired locations on the left-half of the complex s-plane. Similarly, one may

assign the DFTs of all desired feature vectors at the “desired positions” in the frequency

domain in the sense that, after training with the DFT-ELM, the separability of all feature

vectors from the outputs of the hidden layer of the SLFN classifier is maximized in

feature space.

Based on the above idea concerning the assignment of the DFTs of the desired

feature vectors in frequency domain, the desired feature assignment and the input

weights’ optimization can be summarized in the following three steps:

1) Assume that the SLFN classifier reads each time series type sample data vector in

(5.4.2) in one second, and the virtual frequency components of the data vectors

are then distributed between 0Hz and 100Hz.

2) Use the DFTs of 10 sine waves, whose lengths are all equal to the number of

hidden nodes and the frequencies are 2Hz, 12Hz, 22 Hz, …, 92 Hz, respectively,

as the frequency spectrums of the desired feature vectors, representing the

corresponding handwritten digits from 0 to 9, respectively, in feature space.

3) Minimize (5.3.1) with the constraint in (5.3.2) to derive the optimized input

weights in (5.3.19).

Table 5.2 shows the comparison of the classification accuracies of the SLFN

classifiers. It is seen that the SLFN with the DFT-ELM (

) achieves a high degree of classification accuracy as well as very small deviations

consistently over the 50 iterations, as the number of hidden nodes is increased from 50

to 200. The classification performances of the SLFNs with both the R-ELM and the


FIR-ELM are improved with the increase in the number of hidden nodes. However, the

SLFN with the R-ELM shows the much larger deviation. The reason is that the input

weights of the SLFN classifier with the R-ELM are chosen randomly, which makes the

output of the SLFN classifier very sensitive to the changes of the input pattern vectors.

This point has been discussed in detail in both Example 1 of this experiment section and

the reference [9].

Table 5.2: Classification accuracies of the handwritten digit classification



(%) (%) (%) (%) (%) (%)

50 68.55 1.59 42.49 0.56 86.81 0.24

100 78.57 0.78 74.80 0.33 86.58 0.24

150 82.80 0.51 84.14 0.29 86.30 0.28

200 85.07 0.47 85.24 0.27 86.52 0.24

Please note that the DFT-ELM needs a longer training time compared with both

the R-ELM and the FIR-ELM, especially as the number of hidden nodes is large. This is

because the optimization of the input weights of the SLFN with the DFT-ELM takes a

longer time during the training. However, it is the optimization of the input weights of

the SLFN that makes the DFT-ELM have the high degree of classification accuracy and

very small deviations as seen in Table 5.2.

Figure 5.15 shows the comparisons of the sample means of the classification

accuracies of the SLFN classifiers with 50 hidden nodes, trained with the DFT-ELM,

the FIR-ELM and the R-ELM, respectively, and tested with 20 samples, as the

regularization parameter ratio is changed from 0.0001 to 100. It is seen that the SLFN

with the DFT-ELM achieves the best classification accuracy up to 86.77%, the SLFN

with the R-ELM achieves the second best up to 68.56% and, however, the SLFN with

the FIR-ELM (where the hidden layer is designed as the band-pass filter with the cut-off


frequencies 30Hz and 70Hz, respectively) achieves up to 47% only. It is also seen that

the performance of the SLFN with the FIR-ELM is further degenerated, as the

regularization parameter ratio is increasing. The reason is that the changing rate of the

cost function in (5.3.1) is getting smaller as the regularization parameter ratio is

increasing. In such a case, the dominant factor for determining the output weights is the

DC component rather the feature vectors, embedded in the data matrix, as seen

in (4.16) of [9]. From the viewpoint of the signals and systems [107], [148], the larger

value of makes the frequency band of the output layer narrower, centered at the DC

component, the useful frequency components of the feature vectors from the outputs of

the hidden layer are then removed by the output layer of the SLFN with the FIR-ELM.

Figure 5.15 Classification accuracies of the SLFN classifiers with DFT-ELM, FIR-

ELM and R-ELM via the regularization parameter ratio

Figure 5.16 shows the sample means of the classification accuracies of the

SLFN classifiers with the DFT-ELM ( ) , the FIR-ELM

( ) and the R-ELM ( ), respectively, versus the number

of training samples. It is noted that the classifier with the DFT-ELM performs

significantly better than the ones with both the FIR-ELM and the R-ELM when the

number of training samples is greater than 50 sets. This point agrees with the ones in

Figure 5.16. The classification performance curve for the DFT-ELM also reveals that

0.0001 0.001 0.01 0.1 1 10 1000
















the SLFN with the DFT-ELM is capable of extracting core features with a small sample


Figure 5.16 Classification accuracies of the SLFN classifiers with DFT-ELM, FIR-

ELM and R-ELM via the number of training samples

5.5 Conclusion

In this chapter, a robust SLFN pattern classifier has been developed with the DFT-ELM.

Because of the optimal designs of both the input and the output weights, the SLFN

classifier has demonstrated excellent performance for the classifications of both the

linearly separable patterns and the non-linearly separable patterns. The excellent

classification performance of the SLFN with the DFT-ELM has been evaluated and

compared with those of the R-ELM and the FIR-ELM in the simulation examples. The

further work to apply the SLFN classifier with the DFT-ELM for the molecular

classification of cancers, the biomedical image processing and the fault detection of

mechatronic systems are under investigation.

10 50 100 250 500 1000 1500 2000 3000 4000 50000











number of training sets





Chapter 6

An Optimal Weight Learning Machine

for Handwritten Digit Image Recognition

An optimal weight learning machine for a single hidden layer feedforward network

(SLFN) with application to handwritten digit image recognition is developed in this

chapter. It is seen that both the input weights and the output weights of the SLFN are

globally optimized with the batch learning type of least squares. All feature vectors of

the classifier can then be placed at the prescribed positions in the feature space in the

sense that the separability of all non-linearly separable patterns can be maximized, and a

high degree of recognition accuracy can be achieved with a small number of hidden

nodes in the SLFN. An experiment for the recognition of the handwritten digit image

from both the MNIST database and the USPS database is performed to show the

excellent performance and effectiveness of the proposed methodology.

6.1 Introduction

Neural network-based pattern classification techniques have been widely used for

handwritten digit image recognition over the last twenty years [82], [150-153]. The

merits of the neural classifiers for image recognition are attributed to (i) their powerful

learning abilities, through training, from a large amount of training data, (ii) their

capability of accurately approximating unknown functions with complex dynamics

embedded in the training data, and (iii) their parallel structures to perform fast and


efficient parallel computing during the training as well as in the process of image


It has been noted that most neural pattern classifiers for the handwritten digits

recognition are designed with multi-layered neural networks, trained with the recursive

gradient-based backpropagation (BP) algorithms, to perform the pre-filtering, feature

extraction and pattern recognition. Because the BP training process is time-consuming

with a slow convergence [1], [2], [91], [144], [145], these types of neural classifiers are

hard to be used in many practical applications where fast on-line training is required. In

addition, most existing neural classifiers require a large number of hidden nodes in their

hidden layers in order to obtain highly separable features in the feature space. Such a

requirement, in fact, will greatly increase the physical size of the neural classifiers’

hardware as well as the training time in practice.

In view of all the above issues, the researchers in the areas of pattern

classification and computational intelligence have been exploring the new types of

neural classifiers with a single hidden layer, a small number of hidden nodes, and the

fast training algorithms to fulfill the industrial requirements such as small size hardware

and easy implementation in industrial environments. One of the state-of-the-art neural

classifiers is a single hidden layered feedforward neural network based classifier

developed in [18], [29], [147] where the input weights and the hidden layer biases are

randomized and the output weights are computed by using a generalized inverse of the

hidden output matrix. Although the neural networks with randomized weights have

been studied by many researchers [18], [29], [147], [154-156], the theoretical

background of randomizing the input weights of the SLFN classifiers can be traced to

the functional approximation with the Monte Carlo method [29]. It is seen that, if a

continuous function can be represented by a limit integral, the limit integral can then be

approximated by a sum of weighted activation functions that are the functions of a set of

random samples of a parameter vector in the limit integral domain. With the increase of

the number of random samples of the parameter vector, the accuracy of the Monte Carlo

approximation can be further improved. Obviously, such a Monte Carlo approximation

can be implemented by using an SLFN, where the activation functions in the Monte

Carlo approximation can be represented by the non-linear hidden nodes, the parameter


vector that has been randomly sampled in the Monte Carlo approximation is actually the

input weights and the hidden layer biases, and the output weights of the SLFN play the

role of the weights in the Monte Carlo approximation. The universal approximation

capability of the SLFNs with the random input weights and the trained output weights

has been reported in [18], [29], [156].

Batch learning has been actually used for many years for training neural

networks [1], [2]. Unlike the sample-by-sample based recursive learning methodologies,

batch learning uses all the training samples at the same time for deriving the optimal

weights. If a given set of the training samples can sufficiently represent the dynamics of

an unknown complex system to be learnt, the weights derived from the batch training

are globally optimal. Thus, the issues concerning the slow convergence and the local

minima, occurred frequently in recursive learning schemes, are all avoided in the batch

learning process. It has been noted that, for the SLFNs with the randomized weights and

hidden layer biases, only the output weights are trained with the batch learning. Thus,

the training speed is extremely fast compared with those of all existing recursive

learning algorithms [1], [2], [82], [91], [144], [145], [150-153]. Recently such a batch

learning technique for the SLFNs with randomized input weights and hidden layer

biases has been called the Extreme Learning Machine (ELM).

However, with regard to the reality of the pattern classifications, the researchers

have noted many insightful problems of using the ELM. For instance, no guidance has

been proposed on how the upper and the lower bounds of the random input weights and

the hidden layer biases should be chosen. And the researchers just use the trial and error

method to determine the bounds of the random input weights. In many cases, the upper

and the lower bounds of the random input weights are simply set to -1 and 1,

respectively, which results in the feature vectors whose elements are -1 or 1, as the

number of the sigmoid hidden nodes is very large. The advantage of this sort of upper

and lower bounds’ selection for the random input weights and hidden layer biases is that

the feature vectors generated by the hidden layer of the SLFN can widely spread out in

the high dimensional feature space in most cases, and the output layer, as the pattern

classifier, can then easily recognize the patterns represented by the corresponding

feature vectors. However, the researchers have still been trying to design the optimal


input weights to maximize the separability of the feature vectors, enhance the

robustness of the feature vectors with respect to the changes of the input disturbances,

and thus further improve the classification/recognition accuracy [9].

In this chapter, an optimal weight learning machine (OWLM) for a class of

SLFN classifiers with both the linear nodes and the input tapped-delay line for

handwritten digit image recognition will be developed. Borrowing the concept of model

reference control from control engineering [114], the SLFN classifier with the non-

linear hidden nodes, trained with the ELM in [18], is used as the reference classifier that

provides the reference feature vectors for the outputs of the hidden layer of the proposed

SLFN classifier to follow. The input weights are then optimized to assign the feature

vectors of the neural classifier to the “desired reference positions” defined by the

reference feature vectors, and maximize the separability of all feature vectors in the

feature space. In terms of the optimization of the output weights, the recognition

accuracy at the output layer of the SLFN classifier can be further improved. It will be

seen that both the input weights and the output weights of the proposed SLFN classifier

are optimized by using the regularization theory [1], [2], [149] with the results that (i)

the error between the reference feature vectors and the feature vectors generated by the

outputs of the hidden layer of the new classifier is minimized globally; (ii) the

singularity of the correlation matrix of the feature vectors is avoided by smoothing the

singular point of the correlation matrix of the feature vectors with the regularization

term in the cost function; (iii) the sensitivity of the proposed SLFN classifier to the

input disturbances is reduced by adjusting the regularization parameters; (iv), most

importantly, only a small number of hidden nodes are required in the new SLFN

classifier to achieve a high degree of recognition accuracy for handwritten digit image


The rest of the chapter is organized as follows: In Section 6.2, the concept of the

Monte Carlo approximation and the basics of the ELM are formulated. In Section 6.3,

the SLFN classifier with both linear nodes and a tapped-delay line is described, the

optimizations of both the input weights and output weights with the regularization

theory are explored in detail, and the effects of the regularization parameters on the

classification performance of the SLFN classifier are also studied. In Section 6.4, the


SLFN classifiers with the OWLM, the ELM, and the R-ELM are compared to

demonstrate the effectiveness of the new SLFN classifier with the OWLM for the

handwritten digit recognition. Section 6.5 gives conclusions and some further work.

6.2 Problem Formulation

Consider a data set, from a handwritten digits database of interest, with the input

pattern vectors ( ), ( ), , ( ) and the output data vectors ( ), ( ), ,

( ), respectively, where

( ) [ ( ) ( ) ( )] ( )


( ) [ ( ) ( ) ( )] ( )


It is assumed that the above input pattern vectors and the output data vectors are

generated by the following continuous vector function:

( ) ( ( )) ( )


( ( )) [ ( ( )) ( ( )) ( ( ))] ( )

The th element of ( ( )) is represented by the following limit integral [29]:

( ( ))

∫ [ ( )] ( ( ))

( )

where is a scalar parameter, is a high dimensional parameter vector, is an

activation function, is an operator, and is the domain of the parameter vector .


In order to compute ( ( )) represented by the limit integral in (6.2.5), the

right side of (6.2.5) can be approximated as follows:

∫ [ ( )] ( ( ))

∫ [ ( )] ( ( ))

( )

where . Considering the complexity of the integrand in (6.2.6), the Monte-Carlo

method [29] can be used to approximate the right side of (6.2.6) as follows:

( ( )) ∫ [ ( )] ( ( ))

[ ( )] ( ( ))

( ( )) ( )


| |

[ ( )] ( )

It is noted that the vectors in (6.2.7) and (6.2.8) are random samples

drawn from uniformly. On the other hand, can be treated as a set of

random variables that are uniformly distributed on [29].

Thus, the function ( ( )) can be approximated as:

( ( ))

[ ∑ ( ( ))

∑ ( ( ))

∑ ( ( ))


[ ( ( ))

( ( ))

( ( ))]


( ( )) ( )


[ ] ( )

[ ] ( )

[ ] ( )


( ( )) [ ( ( )) ( ( )) ( ( ))] ( )

It is easy to confirm, based on the discussion in [29], that the approximation error in

(6.2.9) is of the order of √ , and thus the approximation error will converge to zero

as .

Remark 6.2.1: The functional approximation using the Monte Carlo method in (6.2.9)

can be implemented by using a multiple-inputs-multiple-outputs (MIMO) single hidden

layer feedforward neural network (SLFN), with the non-linear hidden nodes, and the

randomized input weights and hidden layer biases, as described in Figure 6.1, where the

output layer has linear nodes, the hidden layer has non-linear nodes with the non-

linear activation function ( ) ( ( )) with , (for

and ) are the random input weights, (for ) are the

random biases of the hidden layer, (for and ) are the output

weights to be optimized later, ( ) (for ) are the outputs of the hidden



Figure 6.1 A single hidden layer neural network with both random input weights and

random hidden layer biases

As seen in Figure 6.1, the input and the output data vectors of the SLFN can be

expressed in the forms:

( ) [ ( ) ( ) ( )] ( )

( ) [ ( ) ( ) ( )] ( )

the output of the th hidden node can be computed as:

( ) (∑ ( )

) ( ( ) ) ( )

for , with

[ ] ( )


and the th output of the network, ( ), is of the form:

( ) ∑ ( ( ) )

( ( )) ( )

for , with

[ ] ( )


( ( )) [ ( ( ) ) (

( ) )] ( )


( ) [ ( ) ( ) ( )] ( ( )) ( )

with the output weight matrix

[ ] ( )

Remark 6.2.2: Over the last twenty years, many researches have been done by using the

SLFNs in Figure 1, with both the random input weights and the random hidden layer

biases, for pattern classifications. It has been noted that a good performance can be

achieved only when the number of hidden nodes is large enough [18], [29], [147]. In

[18], [147], batch learning was used to train the output weights of the SLFNs with both

the random input weights and random hidden layer biases, for a given set of training

data pairs. Because the batch training of the output weights is extremely fast and the

global minimum training error can be achieved, such a batch learning technique for the

SLFNs with the randomized input weights and hidden layer biases has been called the

Extreme Learning Machine (ELM). The basics of the ELM are briefly outlined as



Consider a set of training pairs with the input pattern vectors ( ) ,

( ) ( ), and the desired output vector ( ), ( ) ( ), respectively.

For the input pattern vectors, the output vectors ( ) ( ) ( ) can be

generated based on (6.2.21) as follows:

( )


[ (

( ) ) ( ( ) )

( ( ) ) (

( ) )

( ( ) )

( ( ) )


( ) ) ( ( ) ) (

( ) )] ( )

[ ( ) ( ) ( )] ( )

Assume that

( ) ( ) ( ) ( ) ( ) ( )

(6.2.23) can then be written as:

( )


[ ( ) ( ) ( )] ( )

By solving (6.2.26), the optimal output weight matrix is obtained as follows:

( )

where is the Moore-Penrose generalized inverse of the matrix .

Figure 6.2 ~ Figure 6.4 show the recognition results of the handwritten digit 1 by

using the SLFN classifiers trained with the ELM, where the numbers of the hidden


nodes are 10, 50, and 100, respectively, and both the training data and testing data are

randomly selected from the MNIST database [82].

Figure 6.2 Recognition of the handwritten digits by using the SLFN classifier with 10

hidden nodes trained with the ELM

Figure 6.3 Recognition of the handwritten digits by using the SLFN classifier with 50

hidden nodes trained with the ELM

0 1 2 3 4 5 6 7 8 9 10 11-2









Output neurons




Digit 1


0 1 2 3 4 5 6 7 8 9 10 11-2









Output neurons




Digit 1



Figure 6.4 Recognition of the handwritten digits by using the SLFN classifier with 100

hidden nodes trained with the ELM

It has been seen that, when the number of the hidden nodes is very small, say 10,

the recognition accuracy is only about 36%. However, with the increase of the number

of hidden nodes, the recognition accuracy is gradually improved. For instance, the

recognition accuracies are about 67% and 78% when the numbers of the hidden nodes

are 50 and 100, respectively, as seen in Figure 6.3 and Figure 6.4.

Remark 6.2.3: Unlike the concept of conventional feature vectors that contain the

primary information of the corresponding input pattern vectors, the generated feature

vectors from the output layer of the SLFNs with the ELM have little similarities with

the input patterns. Obviously, this characteristic of the feature vectors is mainly due to

the randomness of the input weights and the hidden layer biases. In fact, the main

concern in handwritten digit recognition is whether the generated features can properly

represent the corresponding input handwritten digit patterns, in the sense that the input

handwritten digit patterns can be accurately recognized from the output layer of the

neural classifiers under a noisy environment. Thus, from this viewpoint, it is not

important whether or not the input patterns and the corresponding feature vectors have


0 1 2 3 4 5 6 7 8 9 10 11-2









Output neurons




Digit 1



Remark 6.2.4: In order to avoid the singularities that may occur in the inverse of the

feature vector data correlation matrix for deriving the optimal output weights by using

the batch learning type of least squares method in the ELM, the regularization theory

has been used to smooth the cost function at the singular point of the correlation matrix

of the feature vector data from the output of the hidden layer [18]. Such a modified

ELM is called the Regularized Extreme Learning Machine (R-ELM). However, when

the feature vector data correlation matrix has no singularity, the recognition

performance of the SLFN classifiers with the R-ELM has no significant difference from

the SLFNs with the ELM. This point can be clearly seen from Figure 6.5 ~ Figure 6.7,

where the SLFNs trained with the R-ELM, as the numbers of the hidden nodes are 10,

50, and 100, respectively.

Figure 6.5 Recognition of the handwritten digits by using the SLFN classifier with 10

hidden nodes trained with the R-ELM

0 1 2 3 4 5 6 7 8 9 10 11-2









Output neurons




Digit 1



Figure 6.6 Recognition of the handwritten digits by using the SLFN classifier with 50

hidden nodes trained with the R-ELM

Figure 6.7 Recognition of the handwritten digits by using the SLFN classifier with 100

hidden nodes trained with the R-ELM

0 1 2 3 4 5 6 7 8 9 10 11-2









Output neurons



deDigit 1


0 1 2 3 4 5 6 7 8 9 10 11-2









Output neurons




Digit 1



Remark 6.2.5: Although the ELM and the R-ELM can work well as the number of the

hidden nodes is large enough as shown in [18], [29], [147], a few open problems have

been noted recently. For instance, (i) how can the upper and the lower bounds of the

random input weights as well as the random hidden layer biases be chosen in the sense

that the separability of the feature vectors from the outputs of the hidden layer is

maximized?; (ii) is it possible to minimize the input weights so that the SLFN classifiers

can achieve a better performance than those of the SLFNs trained with both the ELM

and the R-ELM for image recognition?; (iii) can the SLFN classifiers with a small

number of hidden nodes achieve a high degree of classification accuracy in practice?

The first open issue in the above is related to the determination of the integral

domain of the limit integral in (6.2.5) or the domain of the parameter vector .

Obviously, if the domain of the parameter vector cannot be chosen properly, the limit

integral in (6.2.5) cannot represent the function ( ( )) well and thus, the function

( ( )) cannot be properly approximated as seen in (6.2.9).

The second open issue in Remark 6.2.5 is to search for the optimal design of the

input weights so that the SLFN classifiers can perform better than the ones with the

ELM and the R-ELM regarding the recognition accuracy, the robustness with respect to

the input disturbances, and the separability of the feature vectors. In [9], the authors

used the signal processing techniques to design the input weights of the SLFN

classifiers for improving the approximation accuracy, the robustness with respect to the

input disturbances and the separability of the feature vectors. However, the optimal

design of the input weights was not explored.

The third open question in Remark 6.2.5 is to look for the optimal designs of the

SLFN classifiers with a small number of hidden nodes to achieve high recognition

accuracy for practical applications. In the following, the second and the third open

problems in Remark 6.2.5 will be explored through optimizing both the input and output

weights of the SLFNs with a small number of hidden nodes, to achieve a high degree of

accuracy for the handwritten digit image recognition.


6.3 Optimal Weight Learning Machine

Figure 6.8 shows a single hidden layer feedforward neural network with linear nodes

and an input tapped delay line, where the output layer has linear nodes, the hidden

layer has linear nodes, (for and ) are the input weights,

(for and ) are the output weights, ( ) (for ) are

the outputs of the hidden nodes, a string of s, seen at the input layer, are unit-delay

elements that ensure the input sequence ( ) ( ) ( ) represents

a time series, consisting of both the present and the past observations of the process.

Figure 6.8 A single hidden layer neural network with linear nodes

and an input tapped delay line

Remark 6.3.1: One of the many advantages of using linear hidden nodes in the SLFNs

is that the biases are not required. This is because adding the biases to the linear hidden

nodes is equivalent to shifting the outputs of the hidden nodes up or down. It can be

further seen from the later discussions that the function of the biases can also be

alternatively achieved by the proper adjustment of the positions of the reference feature

vectors in the feature space.

Remark 6.3.2: It has been shown in [9] that an SLFN with both linear nodes and a string

of unit-delay elements at the input layer in Figure 6.8 has the capability of

universal approximation of any continuous function as the number of the hidden nodes


is large enough [113]. On the other hand, because of the unit-delay elements

added to the input layer, every hidden node in the SLFN performs as an th-order finite

impulse response (FIR) filter that is often used to approximate a linear or a non-linear

function by properly choosing the input weights in signal processing and control

engineering. Also, in practice, the design and analysis of the SLFNs with linear nodes

are much easier than the ones of the SLFNs with non-linear nodes in Figure 6.1.

Similar to the discussion in Section 6.2, for the given input pattern vectors

( ) ( ) ( ) and the corresponding desired output data vectors

( ) ( ) ( ), respectively, linear output equations of the SLFN in Figure

6.8 can be obtained as:

( )


[ ( ) ( ) ( )]

[ ( )

( )

( )

( )

( ) ( )

( )

( ) ( ) ]

( )


[ ( ) ( ) ( )] ( )

Remark 6.3.3: The matrix in (6.3.2) contains all feature vectors ,

corresponding to the input data vectors , respectively. For instance, for

the th given input pattern vector ( ), the corresponding feature vector generated by

the outputs of the hidden layer is the th row of the matrix :

[ ( )

( ) ( ) ]

( )


Since the output of the th hidden node, corresponding to the th input pattern

vector, can be written as:

( ) ( ) ( )

the th feature vector in (6.3.4) can then be written as:

[ ( )

( ) ( ) ]

[ ( ) ( ) ( )]

( )

Considering all input training data pattern vectors ( ) ( ) ( ), it

can be seen that,

[ ] [ ] ( )


( )


[ ] ( )

Let the reference feature vectors be described by

[ ] ( )

Conventionally, the selection of the desired feature vectors in (6.3.10) is

mainly based on one’s understanding of the characteristics of the input patterns.

However, in this chapter, borrowing the idea of model reference control from modern

control engineering [114], the feature vectors generated by the SLFNs trained with the

R-ELM will be used as the “desired reference feature vectors” for the ones generated by

the outputs of the SLFN in Figure 6.8 to follow. Through optimizing the input weights

of the SLFN in Figure 6.8, the feature vectors of the SLFN in Figure 6.8 can be placed

at the “desired position” in feature space. The purpose of such feature vectors’


assignment is to further maximize the separability of the feature vectors so that the

classification or recognition accuracy, seen from the output layer of the SLFN, can be

greatly improved, compared with the SLFN classifiers in Figure 6.1, trained with both

the ELM and the R-ELM.

In the following, the modified regularization technique in [9] and the batch

training methodology in [1], [2] will be used to develop an optimal weight learning

machine that optimizes both the input weight matrix and the output weight matrix

in the sense that (i) the error between the reference feature vectors and the feature

vectors generated by the hidden layer of the SLFN classifier in Figure 6.8 can be

minimized and then (ii) the error between the desired output pattern and the actual

output pattern of the SLFN classifier is minimized. For convenience, the learning

machine described below with both the optimal input weights and the optimal output

weights for the SLFN classifier in Figure 6.8 is called the optimal weights learning

machine (OWLM).

The design of the optimal input weights can be formulated by the following

optimization problem [9]:

Minimize { ‖ ‖

‖ ‖ } ( )

Subject to ( )

where is the error between the reference feature vectors and the feature vectors

generated by the hidden layer of the SLFN in Figure 6.2, and are the positive real

regularization parameters, ‖ ‖ is the regularizer that, through the proper choice of the

regularization parameters and , is used to smooth the cost function at the singular

point of the correlation matrix of the feature vectors to avoid the ill-posed inverse of the

data matrix.

The optimization problem in (6.3.11) with the constraint in (6.3.12) can be

solved by using the method of Lagrange multipliers, with the following Lagrange

function :




∑∑ ( ∑


( )

where is the th element of the error matrix defined in (6.3.12), is the th

element of the input weight matrix , is the th element of the reference feature

vector matrix , defined in (6.3.10), ∑

, and is the th

Lagrange multiplier.

Differentiating with respect to ,

(∑∑ (∑



( )



[ ] [

] ( )


[ ] [ ] [


( )




] [

] [


( )


( )

In addition, differentiating with respect to ,

( )


, the following relationship can be obtained:

( )


( )

Considering the constraint in (6.3.12), (6.3.21) can be expressed as:

( ) ( )

and using (6.3.21) in (6.3.18) leads to

( )

( )

Then, the optimal input weight matrix is derived as follows:

( )

( )


Remark 6.3.4: It is important to note that, in the conventional regularization theory [1],

the regularization parameter is set to 1 and the regularization for solving the ill-posed

inverse problem of the data matrix depends only on the small regularization parameter

for smoothing around the minimum point of the cost function in the input weight

space. However, it has been noted that the value of the regularization parameter

affects the width or the sharpness of the regularized cost function in (6.3.11) [9]. For

instance, if the value of the regularization parameter is very large, the slope of the

cost function in (6.3.11) will be very steep. The hidden layer of the SLFN is thus very

sensitive to the change of the input weights and input disturbances. However, if the

value of the regularization parameter is very small, the changing rate of the cost

function in (6.3.11) will be very small, and the hidden layer of the SLFN is thus

insensitive to the changes of the input weights and input disturbances. Therefore, it is

essential to choose the regularization parameter properly so that the sensitivity of the

hidden layer can be controlled and also the separability of the feature vectors at the

outputs of the hidden layer of the SLFN can be maximized.

Remark 6.3.5: As described in optimization theory [110], the Lagrange multiplier

matrix in (6.3.22) describes the sensitivity of the cost function in (6.3.11) with respect

to the constraint in (6.3.12), that is, the Lagrange multiplier matrix determines how

tightly the constraint in (6.3.12) is binding at the optimal point with the optimal input

weights in (6.3.19). Thus, the Lagrange multiplier matrix qualitatively specifies the

effects of the noisy environment on the input weights’ optimization, and the effects of

the structural risk and the empirical risk on the robustness of the SLFN classifier with

the OWLM.

Similarly, the output weight matrix can be optimized to minimize the error

between the desired output pattern and the actual output pattern of the SLFN classifier.

The optimization is formulated as [9]:

Minimize { ‖ ‖

‖ ‖ } ( )

Subject to ( )


with the corresponding Lagrange function :




( )

( )

where is the th element of the error matrix , is the th element of the output

weight matrix , is the th element of the desired output data matrix , is the th

column of the hidden layer output matrix , is the th column of the output weight

matrix , is the th Lagrange multiplier, and are the real positive

regularization parameters and ‖ ‖ is the regularizer of the output layer.

Following the discussions from (6.3.14) to (6.3.24), the optimal output layer

weight matrix is obtained as:

( )

( )

with the sensitivity matrix :

( ) ( )

Remark 6.3.6: It is seen from (6.3.28) that the optimal output layer weight matrix

depends only on the ratio

. However, the sensitivity and the robustness property of the

output layer of the SLFN with respect to the changes of the feature vectors depend on

the value of . Generally, the value of should be chosen properly to ensure the

trade-off between the sensitivity and the smoothness of the SLFN classifier.


6.4 Experiments and Results

In this section, the handwritten digit recognition, using the SLFN classifiers, trained

with the ELM, the R-ELM and the OWLM, are implemented. The training data are first

taken from the MNIST handwritten digits database with 60,000 training samples as well

as 10,000 testing samples from multiple disjoint sets of writers [82]. The MNIST

handwritten digits database consists of the handwritten digits from 0 to 9, respectively.

Each digit is a gray-scale image of 28x28 pixels with the intensity range of 0 (black) to

255 (white). For conducting an unbiased experiment, the training and testing sets are all

randomly selected from the pools of the training and the testing samples. A sample set

of handwritten digits is shown in Figure 6.9.

Figure 6.9 A set of handwritten digits from the MNIST database

Before classifying the handwritten digits, each digit image, as in Figure 6.10(a),

is first divided into 14 rows by 14 columns, that is, the image is segmented into 196

small images, as seen in Figure 6.10(b).

Figure 6.10(a) Image of digit Figure 6.10(b) Segmented image

The mean of the pixel intensities of each small segment is computed as:

∑ ∑ ( )




where ( ) is the intensity of the th pixel of the th segment, (= 2) and (= 2)

are the numbers of rows and columns of pixels in each segment, respectively.

The means of all of the segments, from row 1 to row 14, in Figure 6.10(b), are

then arranged as the elements of the following time series type data vector:

[ ]

In this experiment, the randomly generated input weights and the hidden layer

biases for the SLFNs with both the ELM and the R-ELM are set within [1, -1]. The

SLFN classifiers with the ELM, the R-ELM and the OWLM are trained with the 1000

digit sets from the training set pool, and tested with the 800 digit sets from the testing

pool of the MNIST database, to evaluate the classifiers’ performances. The

regularization parameters for the OWLM are selected as and

, respectively, while the regularization ratio for R-ELM is set as 0.01.

The sample reference feature vectors for digit 0 from the hidden layer outputs of

the SLFNs with 10, 50, and 100 hidden nodes, respectively, trained with the R-ELM,

are shown in Figure 6.11(a) ~ Figure 6.11(c), respectively:

Figure 6.11 (a)

0 1 2 3 4 5 6 7 8 9 10 11-1.5







sample index



sample feature of digit 0 with 10 nodes


Figure 6.11 (b)

Figure 6.11 (c)

Figure 6.11(a) ~ Figure 6.11(c) Sample feature vectors for digit 0

0 5 10 15 20 25 30 35 40 45 50-1.5







sample index



sample feature of digit 0 with 50 nodes

0 10 20 30 40 50 60 70 80 90 100-1.5







sample index



sample feature of digit 0 with 100 nodes


It is seen from Figure 6.11(a) ~ Figure 6.11(c) that the amplitudes of the feature

vectors’ components are always 1 or -1, the hidden nodes behave like the standard

binary perceptron and the SLFN classifier with the R-ELM is more like the well-known

threshold networks. Through a detailed observation, it is seen that the values of the dot

products of the input pattern vectors and the input weights are always greater than 1 or

less than -1, and thus all the hidden nodes work in their saturation regions with the

outputs of 1s or -1s.

The handwritten digit recognition results of the SLFNs trained with the ELM

and the R-ELM have been presented in Figure 6.2 ~ Figure 6.7, respectively. The

handwritten digit recognition results of the SLFNs trained with the OWLM are given in

Figure 6.12 ~ Figure 6.14, where the numbers of the hidden nodes are 10, 50, and 100,

respectively. It is seen that, compared to the recognition results with the ELM and the

R-ELM in Figure 6.2 ~ Figure 6.7, the recognition accuracy of the SLFN classifier

trained with the OWLM has been greatly improved.

Figure 6.12 Recognition of the handwritten digits by using the SLFN classifier with 10

hidden nodes trained with the OWLM

0 1 2 3 4 5 6 7 8 9 10 11-2









Output neurons




Digit 1



Figure 6.13 Recognition of the handwritten digits by using the SLFN classifier with 50

hidden nodes trained with the OWLM

Figure 6.14 Recognition of the handwritten digits by using the SLFN classifier with

100 hidden nodes trained with the OWLM

0 1 2 3 4 5 6 7 8 9 10 11-2









Output neurons



deDigit 1


0 1 2 3 4 5 6 7 8 9 10 11-2









Output neurons




Digit 1



Table 6.1 shows the summary of the comparisons of the recognition accuracies

of the SLFN classifiers with the ELM, R-ELM and the OWLM, where the feature

vectors of the SLFN with the R-ELM are used as the reference feature vectors of the

SLFN with the OWLM, and the input weights of the SLFN with the OWLM are

computed by using (6.3.24). It is seen that the SLFN with the OWLM achieves the

highest degree of recognition accuracy and the smallest standard deviation compared to

those of the SLFNs with both the ELM and the R-ELM, over 50 iterations, as the

number of hidden nodes is increased from 10 to 150. The smallest standard deviations

of the SLFN classifier with the OWLM also show a stronger robustness with respect to

the changes of the digit writing styles, which are equivalent to the bounded input noises.

Obviously, the optimization of both the input weights and the output weights of the

SLFN with the OWLM plays a very important role in greatly improving the recognition


Table 6.1: Classification accuracies of the handwritten digit classification

with the ELM, R-ELM and OWLM for the MNIST dataset


(%) (%) (%) (%) (%) (%)

10 36.55 3.59 37.52 3.51 49.46 3.19

25 54.51 2.37 55.71 2.65 69.58 1.66

50 68.24 1.70 68.86 1.60 79.57 0.76

100 78.58 0.86 78.74 0.71 84.36 0.32

150 82.83 0.55 82.96 0.53 85.16 0.11


Table 6.2 summarizes the recognition accuracies and the standard deviations of

the SLFN classifiers with the ELM, the R-ELM and the OWLM, respectively, for the

dataset from the USPS database [157]. It is seen that, similar to the results in Table 6.1

for the MNIST database, the OWLM achieves the highest degree of classification

accuracy and the smallest standard deviations compared to those of both the ELM and

the R-ELM. Such a consistency of the OWLM in achieving the best classification

performance for both the MNIST and USPS databases in the above has confirmed its

effectiveness for pattern classification and robustness with respect to the changes of the

handwriting styles.

Table 6.2: Classification accuracies of the handwritten digit classification

with the ELM, R-ELM and OWLM for the USPS dataset


(%) (%) (%) (%) (%) (%)

10 36.69 3.07 36.02 3.04 46.66 2.69

25 54.27 2.30 54.38 2.34 68.07 2.30

50 68.29 1.53 67.54 1.75 79.06 0.91

100 78.47 0.85 78.47 0.84 84.00 0.41

150 82.40 0.67 82.66 0.75 85.45 0.24

From the viewpoint of industrial applications, a recognition accuracy up to about

85% is acceptable in many cases. Most importantly, the SLFN classifier with about 150

hidden nodes that is capable of learning the dataset well is generally favorable in

practical applications.

In the experimental results presented in Table 6.1 and Table 6.2 for the MNIST

dataset as well as the USPS dataset, the regularization ratios for both the R-ELM and

the OWLM are set to the same (

). In order to study the effects of the

changes of the regularization ratios on the classification performances, Figure 6.15


shows the classification accuracies of the SLFN classifiers trained with the R-ELM and

OWLM, respectively, for the MNIST dataset, where each SLFN classifier has 150

hidden nodes and the regularization ratios (

) are changed from 0.001 to

1000. It is seen that, when the regularization ratios (

) are chosen between

0.1 and 1, the highest classification accuracy of about 86% can be achieved by the

SLFN classifier with the OWLM. However, the highest classification accuracy achieved

by the SLFN classifier with the R-ELM is only about 82.5%. Figure 6.16 shows the

classification accuracies of the SLFN classifiers trained with the R-ELM and OWLM,

respectively, for the USPS dataset. It is seen that, when the regularization ratios


) are about 100, the highest classification accuracies can be achieved by

the SLFN classifiers trained with the R-ELM and OWLM, respectively. Again, the

SLFN classifier with the OWLM is confirmed to perform much better than the one with

the R-ELM.

Figure 6.15 Classification accuracy versus regularization ratio for the MNIST dataset

0.001 0.01 0.1 1 10 100 100080
















Figure 6.16 Classification accuracy versus regularization ratio for the USPS dataset

Figure 6.17 and Figure 6.18 show the comparisons of the sample means of the

recognition accuracies of the SLFN classifiers with 150 hidden nodes, trained with the

OWLM, the R-ELM and the ELM, respectively, versus the number of training samples

from both the MNIST and USPS databases. It is noted again that the classifier with the

OWLM performs the best compared to the ones with both the ELM and the R-ELM.

Figure 6.17 Classification accuracies with OWLM, R-ELM and ELM via the number

of training samples for the MNIST dataset

0.001 0.01 0.1 1 10 100 100080















10 50 100 250 500 1000 1500 2000 3000 4000 500030








number of training sets





Figure 6.18 Classification accuracies with OWLM, R-ELM and ELM via the number

of training samples for the USPS dataset

6.5 Conclusion

In this chapter, a new optimal weight learning machine for the handwritten digit image

recognition has been developed. Because of the optimal designs of both the input and

the output weights, the SLFN classifier with a small number of hidden nodes has

demonstrated an excellent performance for the recognition of the handwritten digit

images. The excellent recognition performance of the SLFN classifier equipped with the

OWLM has been evaluated and compared with those of the SLFN classifiers with both

the ELM and the R-ELM.

10 50 100 250 500 60030








number of training sets






Chapter 7

Conclusions and Future Work

In this chapter, the contributions in this thesis are summarized, and some of the

interesting research directions are given as possible topics for future research.

7.1 Summary of Contributions

This thesis has investigated the robust designs of the training algorithms for SLFN

pattern classifiers. The focus of this research is on the methods of optimally designing

both the input and output weights of the SLFNs, in order to minimize the prediction

risks, and reduce the effects of noise and undesired pattern components when learning

from a finite sample set. Different from conventional neural network training algorithms,

the proposed new algorithms have not only demonstrated a fast batch learning

characteristic, but also achieved strong robustness with respect to disturbances and

highly non-linear separability. The key contributions of this thesis are summarized as


In Chapter 3, the FIR-ELM training algorithm has been proposed by using the

FIR filtering techniques to assign the input weights of the SLFNs, while the output layer

weights have also been optimally designed with the regularization method for balancing

and reducing both the structural and empirical prediction risks. The robustness and

generalization capability of the FIR-ELM have been verified with the excellent

classification results in the classification of the audio clips.


Chapter 4 has investigated the performance of the FIR-ELM in the classification

of bioinformatics datasets for cancer diagnosis. It has been seen that the frequency

domain feature selection method proposed in this chapter plays a very important role of

determining the filtering type for the input weights’ design. The simulation results have

further shown that the FIR-ELM is very prominent for the pattern classification of both

the leukemia and colon cancer datasets.

In Chapter 5, the DFT-ELM training algorithm has been developed for the

SLFNs. The DFT-ELM addresses the feature assignment in the frequency-domain in

terms of the optimal design of the input weights. Different from the FIR-ELM, the

regularization theory is used to design both the input weights and output weights in

order to balance and reduce the structural and empirical prediction risks in the training

phase. The two stage optimization process has significantly reduced the structural

sensitivity of the SLFN classifiers, making it even more robust in its implementations,

as seen in the simulation example.

In Chapter 6, an optimal weights training algorithm, called the OWLM, has been

developed for the SLFNs based on the ELM framework. The reason of using the feature

vectors of an ELM based classifier as the reference feature vectors is that, most of the

time, the ELM algorithm can produce separable features as the number of hidden nodes

is large enough. Both the theoretical analysis and the simulation examples have shown

the excellent classification performances with OWLM.

7.2 Future Research

This section lists some interesting research directions for possible future research.

7.2.1 Ensemble Methods

All of the theories developed in this thesis relates to the single SLFN classifiers only,

however, there are interesting works on combinations of SLFNs in the literature that

exploit the strength of neural classifier groups in making better classification decisions.

One of the recent works includes the voting based ELM ensemble [64], where the


improved performance is obtained for almost every case tested using real-world datasets

when a number of SLFNs trained with the ELM work parallely to make co-operative

classification decisions. It would be interesting to implement the FIR-ELM, the DFT-

ELM, or the OWLM as an ensemble architecture, in order to investigate the possible

gains in performance. Especially in the cases of FIR-ELM and DFT-ELM, the ensemble

architecture expands the possible design space for the hidden layer weights in terms of

the digital filtering theory.

7.2.2 Analytical Determination of Regularization Parameters

For all of the SLFN training algorithms developed in this thesis, the weights trainings

are performed with the regularization method, where the regularizer is commonly

defined as the sum of weight magnitudes, as seen in chapters 3, 5, and 6. Two

regularization constants are introduced to balance both the structural and empirical risks

respectively in the optimization processes. However, throughout the simulation studies,

the optimal values of these regularization constants are determined empirically using

some version of cross validation. It is therefore highly desired that the optimal values of

the regularization constants can be calculated deterministically in a finite time, such that

the chosen values can automatically balance both the structural and empirical risks

based on the characteristics of the sample datasets, or some a priori information.

7.2.3 Analysis of the Effects of Non-linear Nodes

It can be seen from chapters 3, 5, and 6 that the designs of the training algorithms use

the SLFNs with the linear hidden nodes, as well as the linear output nodes, and an input

tapped delay line memory. This SLFN architecture contains finite depth memory which

makes the learning with dynamics possible. Moreover, the analysis of these

architectures has convenient interpretations in terms of frequency domain and spatial

domain mappings. However, the conventional neural network architecture uses the non-

linear nodes at the hidden layer of the SLFNs, in order to allow the SLFNs with no

dynamics to learn the complex non-linear mappings. It is then interesting to examine the

theoretical interpretation of the SLFNs used in chapters 3, 5, and 6 if non-linear nodes

are used instead. Some ideas of the possible outcomes are investigated in the simulation


section of chapter 3, where it was found that different non-linear nodes produce vastly

different results, even if only the meta-parameters of the activation function are changed.

A more thorough investigation is required to test the effects of different types of

activation functions at the hidden layer of the SLFNs, in order to specify the theoretical

significance in the choice of nodes to be selected.

7.2.4 Multi-class Classification of Real-World Dataset

The analysis of the practical implementation of the FIR-ELM in chapter 4 shows that

the FIR-ELM has good performance in classifying bioinformatics datasets with binary

outcomes. However, it can be seen in [69], [125], [126] that the current advances in the

field of cancer diagnosis have shown that it is actually possible to predict multiple types

of cancer diseases from the same sample datasets. Therefore an important question to

answer is whether the FIR-ELM can be easily extended to the multi-class case, where

the possibility of implementing a single unit classifier, the OAA classifier, the OAO

classifier, as well as the ensemble type classifier architecture should be considered.

Lastly, it is also interesting to expand the real-world applications of the DFT-

ELM and the OWLM so that more complex problems can be solved in order to

demonstrate the effectiveness of these learning schemes.



[1] S. Haykin, Neural networks and learning machines (3rd Edition), Pearson,

Prentice Hall, New Jersey, 2009.

[2] S. Kumar, Neural networks, McGraw-Hill Companies, Inc., 2006.

[3] G.-B. Huang, Q.-Y. Zhu, and C.-K. Siew, “Extreme learning machine:

Theory and applications,” Neurocomputing, vol. 70, no. 1-3, pp. 489-501,

Dec. 2006.

[4] G. Cybenko, “Approximation by superpositions of sigmoidal function,”

Mathematics of Control, Signals, and Systems, vol. 2, pp. 303-314, 1989.

[5] K. I. Funahashi, “On the approximate realization of continuous mappings by

neural networks,” Neural Networks, vol. 2, pp. 183-192, 1989.

[6] K. Hornik, “Approximation capabilities of multilayer feedforward networks,”

Neural Networks, vol. 4, pp. 251-257, 1991.

[7] M. Leshno, V. Y. Lin, A. Pinkus, and S. Schocken, “Multilayer feedforward

networks with a nonpolynomial activation function can approximate any

function,” Neural Networks, vol. 6, pp. 861-867, 1993.

[8] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal

representations by error propagation,” Parallel Distribution Processing:

Explanations in the Microstructure of Cognition, vol. 1, pp. 318-362, 1986.


[9] Z. Man, K. Lee, D. H. Wang, Z. Cao, and C. Miao, “A new robust training

algorithm for a class of single hidden layer neural networks,”

Neurocomputing, vol. 74, pp. 2491-2501, 2011.

[10] Z. Man, K. Lee, D. H. Wang, Z. Cao, and C. Miao, “A modified ELM

algorithm for single-hidden layer feedforward neural networks with linear

nodes,” 6th IEEE Conference on Industrial Electronics and Applications,

ICIEA 2011, pp. 2524-2529, 21-23 Jun. 2011.

[11] K. Lee, Z. Man, D. H. Wang, and Z. Cao, “Classification of Bioinformatics

Dataset using Finite Impulse Response Extreme Learning Machine for

Cancer Diagnosis,” Neural Computing & Applications, Available online 30

Jan. 2012, Doi: 10.1007/s00521-012-0847-z.

[12] K. Lee, Z. Man, D. H. Wang, and Z. Cao, “Classification of microarray

datasets using finite impulse response extreme learning machine for cancer

diagnosis,” 37th Annual Conference on IEEE Industrial Electronics Society,

IECON 2011, pp. 2347-2352, 7-10 Nov. 2011.

[13] W. McCulloch and W. Pitts, “A logical calculus of the ideas immanent in

nervous activity,” Bulletin of Mathematical Biophysics, vol. 7, pp. 115-133,


[14] J. C. Principe, N. R. Euliano, and W. C. Lefebvre, Neural and Adaptive

Systems: Fundamentals through Simulations, John Wiley & Sons, Inc., New

York, 1999.

[15] S. Kleene, “Representation of events in nerve nets and finite automata,” In

C. Shannon and J. McCarthy, editors, Automata Studies, pages 3-42,

Princeton University Press, Princeton, N.J., 1956.


[16] F. Rosenblatt, “The Perceptron: A Probabilistic Model for Information

Storage and Organization in the Brain,” Cornell Aeronautical Laboratory,

Psychological Review, vol. 65, no. 6, pp. 386-408, 1958.

[17] M. Minsky and S. Papert, “Perceptrons: An Introduction to Computational

Geometry,” M.I.T. Press, Cambridge, Mass., 1969.

[18] G.-B. Huang, D. H. Wang, and Y. Lan, “Extreme learning machines: a

survey,” Int. J. Mach. Learn. & Cyber, vol. 2, no. 2, pp. 107-122, 2011.

[19] S. Abe, Support Vector Machines for Pattern Classification, Springer, 2005.

[20] C. Cortes and V. Vapnik, “Support-Vector Networks,” Machine Learning,

vol. 20, pp. 273-297, 1995.

[21] D. Lowe, “Adaptive radial basis function nonlinearities, and the problem of

generalisation,” First IEE International Conference on Artificial Neural

Networks, 1989, (Conf. Publ. No. 313), pp. 171-175, 16-18 Oct. 1989.

[22] S. Chen, C. F. N. Cowan, and P. M. Grant, “Orthogonal least squares

learning algorithm for radial basis function networks,” IEEE Transactions

on Neural Networks, vol. 2, no. 2, pp. 302-309, Mar. 1991.

[23] G.-B. Huang and L. Chen, “Enhanced random search based incremental

extreme learning machine,” Neurocomputing, vol. 71, no. 16-18, pp. 3460-

3468, Oct. 2008.

[24] G.-B. Huang and L. Chen, “Convex incremental extreme learning machine,”

Neurocomputing, vol. 70, no. 16-18, pp. 3056-3062, Oct. 2007.

[25] G.-B. Huang, L. Chen, and C.-K. Siew, “Universal approximation using

incremental constructive feedforward networks with random hidden nodes,”

IEEE Transactions on Neural Networks, vol. 17, no. 4, pp. 879-892, 2006.


[26] D. S. Broomhead and D. Lowe, “Multi-variable Functional Interpolation

and Adaptive Networks,” Complex Systems, vol. 2, no. 3, pp. 269-303,


[27] Y. H. Pao and Y. Takefuji, “Functional-link net computing, theory, system

architecture, and functionalities,” IEEE Comput., vol. 25, no. 5, pp. 76-79,

May 1992.

[28] Y. H. Pao, G. H. Park, and D. J. Sobajic, “Learning and generalization

characteristics of random vector functional-link net,” Neurocomputing, vol.

6, pp. 163-180, 1994.

[29] B. Igelnik and Y. H. Pao, “Stochastic choice of basis functions in adaptive

function approximation and the functional-link net,” IEEE Transactions on

Neural Networks, vol. 6, no. 6, pp. 1320-1329, Nov. 1995.

[30] G.-H. Park, Y.-H. Pao, B. Igelnik, K. G. Eyink, and S. R. LeClair, “Neural-

net computing for interpretation of semiconductor film optical ellipsometry

parameters,” IEEE Transactions on Neural Networks, vol. 7, no. 4, pp. 816-

829, Jul. 1996.

[31] B. Igelnik, Y.-H. Pao, S. R. LeClair, and C. Y. Shen, “The ensemble

approach to neural-network learning and generalization,” IEEE Transactions

on Neural Networks, vol. 10, no. 1, pp. 19-30, Jan. 1999.

[32] Z. Meng and Y. H. Pao, “Visualization and self-organization of

multidimensional data through equalized orthogonal mapping,” IEEE

Transactions on Neural Networks, vol. 11, no. 4, pp. 1031-1038, Jul. 2000.

[33] Y. Wang, F. Cao, and Y. Yuan, “A study on effectiveness of extreme

learning machine,” Neurocomputing, vol. 74, no. 16, pp. 2483-2490, Sept.



[34] X. Tang and M. Han, “Partial Lanczos extreme learning machine for single-

output regression problems,” Neurocomputing, vol. 72, no. 13-15, pp. 3066-

3076, Aug. 2009.

[35] G. Zhao, Z. Shen, C. Miao, and Z. Man, “On improving the conditioning of

extreme learning machine: A linear case,” 7th International Conference on

Information, Communications and Signal Processing, ICICS 2009, pp. 1-5,

8-10 Dec. 2009.

[36] H. T. Huynh and Y. Won, “Evolutionary algorithm for training compact

single hidden layer feedforward neural networks,” IEEE International Joint

Conference on Neural Networks, IJCNN 2008, pp. 3028-3033, 1-8 Jun.


[37] Q.-Y. Zhu, A.-K. Qin, P. N. Suganthan, and G.-B. Huang, “Evolutionary

extreme learning machine,” Pattern Recognition, vol. 38, no. 10, pp. 1759-

1763, Oct. 2005.

[38] X.-K. Wei and Y.-H. Li, “Linear programming minimum sphere set

covering for extreme learning machines,” Neurocomputing, vol. 71, no. 4-6,

pp. 570-575, Jan. 2008.

[39] Y. Yuan, Y. Wang, and F. Cao, “Optimization approximation solution for

regression problem based on extreme learning machine,” Neurocomputing,

vol. 74, no. 16, pp. 2475-2482, Sept. 2011.

[40] F.-N. Francisco, H.-M. César, S.-M. Javier, and A. G. Pedro, “MELM-

GRBF: A modified version of the extreme learning machine for generalized

radial basis function neural networks,” Neurocomputing, vol. 74, no. 16, pp.

2502-2510, Sept. 2011.


[41] H. T. Huynh and Y. Won, “Extreme Learning Machine with Fuzzy

Activation Function,” Fifth International Joint Conference on INC, IMS and

IDC, NCM 2009, pp. 303-307, 25-27 Aug. 2009.

[42] X.-Z. Wang, A. Chen, and H. Feng, “Upper integral network with extreme

learning mechanism,” Neurocomputing, vol. 74, no. 16, pp. 2520-2525, Sept.


[43] M.-B. Li, G.-B. Huang, P. Saratchandran, and N. Sundararajan, “Fully

complex extreme learning machine,” Neurocomputing, vol. 68, pp. 306-314,

Oct. 2005.

[44] G.-B. Huang, M.-B. Li, L. Chen, and C.-K. Siew, “Incremental extreme

learning machine with fully complex hidden nodes,” Neurocomputing, vol.

71, no. 4-6, pp. 576-583, Jan. 2008.

[45] J.-S. Lim, “Recursive DLS solution for extreme learning machine-based

channel equalizer,” Neurocomputing, vol. 71, no. 4-6, pp. 592-599, Jan.


[46] F. Han and D.-S. Huang, “Improved extreme learning machine for function

approximation by encoding a priori information,” Neurocomputing, vol. 69,

no. 16-18, pp. 2369-2373, Oct. 2006.

[47] Y. Lan, Y. C. Soh, and G.-B. Huang, “Constructive hidden nodes selection

of extreme learning machine for regression,” Neurocomputing, vol. 73, no.

16-18, pp. 3191-3199, Oct. 2010.

[48] S.-S. Kim and K.-C. Kwak, “Incremental modeling with rough and fine

tuning method,” Applied Soft Computing, vol. 11, no. 1, pp. 585-591, Jan.



[49] J. Deng, K. Li, and G. W. Irwin, “Fast automatic two-stage nonlinear model

identification based on the extreme learning machine,” Neurocomputing, vol.

74, no. 16, pp. 2422-2429, Sept. 2011.

[50] L. Chen, G.-B. Huang, and H. K. Pung, “Systemical convergence rate

analysis of convex incremental feedforward neural networks,”

Neurocomputing, vol. 72, no. 10-12, pp. 2627-2635, Jun. 2009.

[51] Y. Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, and A. Lendasse, “OP-

ELM: Optimally Pruned Extreme Learning Machine,” IEEE Transactions

on Neural Networks, vol. 21, no. 1, pp. 158-162, Jan. 2010.

[52] Y. Miche, M. Heeswijk, P. Bas, O. Simula, and A. Lendasse, “TROP-ELM:

A double-regularized ELM using LARS and Tikhonov regularization,”

Neurocomputing, vol. 74, no. 16, pp. 2413-2421, Sept. 2011.

[53] J. Yin, F. Dong, and N. Wang, “Modified Gram-Schmidt Algorithm for

Extreme Learning Machine,” Second International Symposium on

Computational Intelligence and Design, ISCID 2009, vol. 2, pp. 517-520,

12-14 Dec. 2009.

[54] H.-J. Rong, Y.-S. Ong, A.-H. Tan, and Z. Zhu, “A fast pruned-extreme

learning machine for classification problem,” Neurocomputing, vol. 72, no.

1-3, pp. 359-366, Dec. 2008.

[55] N.-Y. Liang, G.-B. Huang, P. Saratchandran, and N. Sundararajan, “A fast

and accurate online sequential learning algorithm for feedforward networks,”

IEEE Transactions on Neural Networks, vol. 17, no. 6, pp. 1411-1423, 2006.

[56] Y. Jun and M. J. Er, “An enhanced online sequential extreme learning

machine algorithm,” in Proc. of Control and Decision Conference, CCDC

2008, pp. 2902-2907, Yantai, Shandong, 2-4 Jul. 2008.


[57] Y. Lan, Y. C. Soh, and G.-B. Huang, “A constructive enhancement for

Online Sequential Extreme Learning Machine,” International Joint

Conference on Neural Networks, IJCNN 2009, pp. 1708-1713, 14-19 Jun.


[58] H.-J. Rong, G.-B. Huang, N. Sundararajan, and P. Saratchandran, “Online

Sequential Fuzzy Extreme Learning Machine for Function Approximation

and Classification Problems,” IEEE Transactions on Systems, Man, and

Cybernetics, Part B: Cybernetics, vol. 39, no. 4, pp. 1067-1072, Aug. 2009.

[59] H. T. Huynh and Y. Won, “Online training for single hidden-layer

feedforward neural networks using RLS-ELM,” IEEE International

Symposium on Computational Intelligence in Robotics and Automation,

CIRA 2009, pp. 469-473, 15-18 Dec. 2009.

[60] G. Li, M. Liu, and M. Dong, “A new online learning algorithm for structure-

adjustable extreme learning machine,” Computers & Mathematics with

Applications, vol. 60, no. 3, pp. 377-389, Aug. 2010.

[61] D. H. Wang, “ELM-based multiple classifier systems,” in Proc. of 9th

International Conference on Control, Automation, Robotics and Vision,

Singapore, Dec. 2006.

[62] Y. Lan, Y.-C. Soh, and G.-B. Huang, “Ensemble of online sequential

extreme learning machine,” Neurocomputing, vol. 72, no. 13-15, pp. 3391-

3395, 2009.

[63] N. Liu and H. Wang, “Ensemble based extreme learning machine,” IEEE

Signal Processing Letters, vol. 17, no. 8, pp. 754-757, 2010.

[64] J. Cao, Z. Lin, G.-B. Huang, and N. Liu, “Voting based extreme learning

machine,” Information Sciences, vol. 185, no. 1, 15, pp. 66-77, Feb. 2012.


[65] M. Heeswijk, Y. Miche, E. Oja, and A. Lendasse, “GPU-accelerated and

parallelized ELM ensembles for large-scale regression,” Neurocomputing,

vol. 74, no. 16, pp. 2430-2437, Sept. 2011.

[66] T. Helmy and Z. Rasheed, “Multi-category bioinformatics dataset

classification using extreme learning machine,” IEEE Congress on

Evolutionary Computation, CEC 2009, pp. 3234-3240, 18-21 May 2009.

[67] D. Wang and G.-B. Huang, “Protein sequence classification using extreme

learning machine,” in Proc. IEEE International Joint Conference on Neural

Networks, IJCNN 2005, vol. 3, pp. 1406- 1411, 31 Jul.-4 Aug. 2005.

[68] G. Wang, Y. Zhao, and D. Wang, “A protein secondary structure prediction

framework based on the Extreme Learning Machine,” Neurocomputing, vol.

72, no. 1-3, pp. 262-268, Dec. 2008.

[69] R. Zhang, G.-B. Huang, N. Sundararajan, and P. Saratchandran,

“Multicategory Classification Using An Extreme Learning Machine for

Microarray Gene Expression Cancer Diagnosis,” IEEE/ACM Transactions

on Computational Biology and Bioinformatics, vol. 4, no. 3, pp. 485-495,

Jul.-Sept. 2007.

[70] S. Baboo and S. Sasikala, “Multicategory classification using an Extreme

Learning Machine for microarray gene expression cancer diagnosis,” IEEE

International Conference on Communication Control and Computing

Technologies, ICCCCT 2010, pp. 748-757, 7-9 Oct. 2010.

[71] D. Yu and L. Deng, “Efficient and effective algorithms for training single-

hidden-layer neural networks,” Pattern Recognition Letters, vol. 33, no. 5,

pp. 554-558, Apr. 2012.


[72] B. P. Chacko, V. R. V. Krishnan, G. Raju, and P. B. Anto, “Handwritten

character recognition using wavelet energy and extreme learning machine,”

Int. J. Mach. Learn. & Cyber, vol. 3, pp. 149-161, 2012.

[73] F.-C. Li, P.-K. Wang, and G.-E. Wang, “Comparison of the primitive

classifiers with Extreme Learning Machine in credit scoring,” IEEE

International Conference on Industrial Engineering and Engineering

Management, IEEM 2009, pp. 685-688, 8-11 Dec. 2009.

[74] G. Duan, Z. Huang, and J. Wang, “Extreme Learning Machine for Bank

Clients Classification,” International Conference on Information

Management, Innovation Management and Industrial Engineering, 2009,

vol. 2, pp. 496-499, 26-27 Dec. 2009.

[75] W. Deng, Q.-H. Zheng, S. Lian, and L. Chen, “Adaptive personalized

recommendation based on adaptive learning,” Neurocomputing, vol. 74, no.

11, pp. 1848-1858, May 2011.

[76] X.-G. Zhao, G. Wang, X. Bi, P. Gong, and Y. Zhao, “XML document

classification based on ELM,” Neurocomputing, vol. 74, no. 16, pp. 2444-

2451, Sept. 2011.

[77] Y. Sun, Y. Yuan, and G. Wang, “An OS-ELM based distributed ensemble

classification framework in P2P networks,” Neurocomputing, vol. 74, no.

16, pp. 2438-2443, Sept. 2011.

[78] W. Deng, Q.-H. Zheng, and L. Chen, “Real-Time Collaborative Filtering

Using Extreme Learning Machine,” IEEE/WIC/ACM International Joint

Conferences on Web Intelligence and Intelligent Agent Technologies, WI-

IAT 2009, vol. 1, pp. 466-473, 15-18 Sept. 2009.


[79] Q. J. B. Loh and S. Emmanuel, “ELM for the Classification of Music

Genres,” 9th International Conference on Control, Automation, Robotics

and Vision, ICARCV 2006, pp. 1-6, 5-8 Dec. 2006.

[80] I. W. Sandberg, “General structures for classification,” IEEE Transactions

on Circuits and Systems I, vol. 41, pp. 372-376, May 1994.

[81] G.-B. Huang, Y. Chen, and H. A. Babri, “Classification ability of single

hidden layer feedforward neural networks,” IEEE Transactions on Neural

Networks, vol. 11, pp. 799-801, May 2000.

[82] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-Based Learning

Applied to Document Recognition,” in Proc. of the IEEE, vol. 86, no. 11, pp.

2278-2324, 1998.

[83] V. S. Cherkassky and F. Mulier, Learning from Data: Concepts, Theory,

and Methods (2nd Edition), John Wiley & Sons, Inc., New York, 2007.

[84] N. Tikhonov, “On solving ill-posed problem and method of regularization,”

Doklady Akademii Nauk USSR, vol. 153, pp. 501-504, 1963.

[85] N. Tikhonov and V. Y. Arsenin, “Solution of Ill-Posed Problems,”

Washington, DC: Winston, 1977.

[86] V. N. Vapnik, The Nature of Statistical Learning Theory (2nd Edition), New

York, Springer, 1999.

[87] C. M. Bishop, Neural Networks for Pattern Recognition, Oxford University

Press, Walton Street, Oxford, 1995.

[88] K.-A. Toh, “Deterministic neural classification,” Neural Computation, vol.

20, no. 6, pp. 1565-1595, Jun. 2008.


[89] W. Deng, Q.-H. Zheng, and L. Chen, “Regularized extreme learning

machine,” in Proc. IEEE Symp. CIDM, pp. 389-395, Mar. 30-Apr. 2, 2009.

[90] G.-B. Huang, X. Ding, and H. Zhou, “Optimization method based extreme

learning machine for classification,” Neurocomputing, vol. 74, Issues 1-3,

pp. 155-163, Dec. 2010.

[91] J. Hertz, A. Krogh, and R. G. Palmer, Introduction to the theory of neural

computations, Addison-Wesley Publishing Company, 1991.

[92] Z. Man, H. R. Wu, and M. Palaniswami, “An adaptive tracking controller

using Neural Networks for a class of nonlinear systems,” IEEE Transactions

on Neural Networks, vol. 9, no. 5, pp. 947-955, 1998.

[93] G.-B. Huang, Q. Zhu, K. Mao, C. K. Siew, P. Sarachandran, and N.

Sundararajan, “Can threshold networks be trained directly?,” IEEE

Transactions on Circuits and Systems II, vol. 53, no. 3, pp. 187-191, 2008.

[94] G.-B. Huang and H. A. Babri, “Upper bounds on the number of hidden

neurons in feedforward networks with arbitrary bounded nonlinear

activation functions,” IEEE Transactions on Neural Networks, vol. 9, pp.

224-228, 1998.

[95] I. Mrazova and D. H. Wang, “Improved generalization of neural classifiers

with enforced internal representation,” Neurocomputing, vol. 70, pp. 2940-

2952, 2007.

[96] D. H. Wang, Y.-S. Kim, S. C. Park, C. S. Lee, and Y. K. Han, “Learning

based neural similarity metrics for multimedia data mining,” Soft

Computing, vol. 11, pp. 335-340, 2007.


[97] D. H. Wang and X. H. Ma, “A hybrid image retrieval system with user's

relevance feedback using neurocomputing,” Informatica, vol. 29, pp. 271-

279, 2005.

[98] P. Bao and D. H. Wang, “An edge-preserving image reconstruction using

neural network,” International Journal of Mathematical Imaging and Vision,

vol. 14, pp. 117-130, 2001.

[99] D. H. Wang and P. Bao, “Enhancing the estimation of plant Jacobian for

adaptive neural inverse control,” Neurocomputing, vol. 34, pp. 99-115, 2000.

[100] L. Breiman, J. Friedman, R. Olshen, and C. Stone, Classification and

regression trees, Wadsworth International, Belmont, CA, 1984.

[101] M. Anthony and P. L. Bartlett, Neural network learning: Theoretical

foundations, Cambridge University Press, Cambridge, 1999.

[102] L. Devroye, L. Gyorfi, and G. Lugosi, A probabilistic theory of pattern

recognition, Springer-Verlag, New York, 1996.

[103] R. Duda and P. Hart, Pattern classification and scene analysis, John Wiley,

New York, 1973.

[104] K. Fukunaga, Introduction to statistical pattern recognition, Academic

Press, New York, 1972.

[105] M. Kearns and U. Vazirani, An introduction to computational learning

theory, MIT Press, Cambridge, Massachusetts, 1994.

[106] J. Proakis and D. Manolakis, Digital signal processing (3rd Edition),

Prentice Hall, 1996.


[107] S. M. Kuo, B. H. Lee, and W. Tian, Real-time digital signal processing,

John Wiley & Sons Ltd, 2007.

[108] A. V. Oppenheim, R. W. Schafer, and J. R. Buck, Discrete-time signal

processing, Prentice Hall, 1999.

[109] E. C. Ifeachor and B. W. Jervis, Digital signal processing: A practical

approach (2nd Edition), Prentice Hall, 2002.

[110] S. S. Rao, Engineering optimization: Theory and practice, John Wiley &

Sons, Inc., 1996.

[111] P. S. Iyer, Operations research, Tata McGraw-Hill, 2008.

[112] F. S. Hillier and G. J. Lieberman, Introduction to operations research, Mc

Graw Hill, 2005.

[113] I. W. Sandberg, “Approximation theorems for discrete-time systems,” IEEE

Transactions on Circuits and Systems, vol. 38, no. 5, pp. 564-566, 1991.

[114] K. Ogata, Modern Control Engineering, Prentice Hall PTR, Upper Saddle

River, NJ, 2001.

[115] W. J. Rough, Linear system theory (2nd Edition), Prentice Hall, 1996.

[116] M. S. Santina, A. R. Stubberud, and G. H. Hostter, Digital control design,

Saunders College Publishing, 1998.

[117] K. J. Astrom and B. Wittenmark, Computer-controlled systems (3rd Edition),

Prentice Hall, 1997.


[118] S. Dudoit and Fridlyand, “Introduction to classification in microarray

experiments,” in D. P. Berrar, W. Dubitzky, M. Granzow, (Eds.), A

Practical Approach to Microarray Data Analysis, Norwell, MA, Kluwer,


[119] Y. Lu and J. Han, “Cancer classification using gene expression data,”

Information Systems, vol. 28, no. 4, pp. 243-268, 2003.

[120] W. Huber, A. C. Heydebreck, and M. Vingron, “Analysis of microarray

gene expression data,” In Martin Bishop et al., editor, Handbook of

Statistical Genetics, Chichester, UK, John Wiley & Sons, Ltd, 2003.

[121] J. Misra, W. Schmitt, D. Hwang, L. Hsiao, S. Gullans, and G.

Stephanopoulos, “Interactive Exploration of Microarray Gene Expression

Patterns in a Reduced Dimensional Space,” Genome Res, vol. 12, no. 7, pp.

1112-1120, 2002.

[122] M. E. Wall, A. Rechtsteiner, and L. M. Rocha, “Singular Value

Decomposition and Principal Component Analysis,” in D. P. Berrar, W.

Dubitzky, M. Granzow, (Eds.), A Practical Approach to Microarray Data

Analysis, Norwell, MA, Kluwer, pp. 91-109, 2003.

[123] X. Liao, N. Dasgupta, S. M. Lin, and L. Carin, “ICA and PLS modelling for

functional analysis and drug sensitivity for DNA microarray signals,” in

Proc. Workshop on Genomic Signal Processing and Statistics, 2002.

[124] A. Chen and J.-C. Hsu, “Exploring novel algorithms for the prediction of

cancer classification,” 2nd International Conference on Software

Engineering and Data Mining, SEDM, pp. 378-383, 2010.


[125] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.-H. Yeang, M.

Angelo, C. Ladd, M. Reich, E. Latulippe, J. Mesirov, T. Poggio, W. Gerald,

M. Loda, E. Lander, and T. Golub, “Multiclass Cancer Diagnosis Using

Tumor Gene Expression Signatures,” in Proc. Natl. Acad. Sci. USA, vol. 98,

no. 26, pp. 15149-15154, 2002.

[126] D. Baboo and M. Sasikala, “Multicategory Classification Using Support

Vector Machine for Microarray Gene Expression Cancer Diagnosis,” Global

Journal of Computer Science and Technology, 2010.

[127] J. Sanchez-Monedero, M. Cruz-Ramirez, F. Fernandez-Navarro, J.

Fernandez, P. Gutierrez, and C. Hervas-Martinez, “On the suitability of

Extreme Learning Machine for gene classification using feature selection,”

10th International Conference on Intelligent Systems Design and

Applications, ISDA 2010, pp. 507-512, Nov. 29 2010-Dec. 1 2010.

[128] A. Bharathi and A. Natarajan, “Microarray gene expression cancer diagnosis

using Machine Learning algorithms,” International Conference on Signal

and Image Processing, ICSIP 2010, pp. 275-280, 15-17 Dec. 2010.

[129] G. Unger and B. Chor, “Linear Separability of Gene Expression Data Sets,”

IEEE/ACM Transactions on Computational Biology and Bioinformatics,

vol. 7, no. 2, pp. 375-381, Apr.-Jun. 2010.

[130] T. Pham, D. Beck, and H. Yan, “Spectral Pattern Comparison Methods for

Cancer Classification Based on Microarray Gene Expression Data,” IEEE

Transactions on Circuits and Systems I: Regular Chapters, vol. 53, no. 11,

pp. 2425-2430, Nov. 2006.

[131] Y. Liu, J. Shen, and J. Cheng, “Cancer Classification Based on the

“Fingerprint” of Microarray Data,” The 1st International Conference on

Bioinformatics and Biomedical Engineering, ICBBE 2007, pp. 176-179, 6-8

Jul. 2007.


[132] J. P. Brody, B. A. Williams, B. J. Wold, and S. R. Quake, “Significance and

statistical errors in the analysis of DNA microarray data,” in Proc. Natl.

Acad. Sci. USA, vol. 99, no. 20, pp. 12975-12978, 2002.

[133] G. Arce and Y. Li, “Median power and median correlation theory,” IEEE

Transactions on Signal Processing, vol. 50, no. 11, pp. 2768- 2776, Nov.


[134] R. Salakhutdinov, “Learning in Markov random fields using tempered

transitions,” in Y. Bengio, D. Schuurmans, J. Lafferty, C. Williams, A.

Culota (Eds.), Advances in neural information processing systems, 22,

Cambridge, MA, MIT Press, 2009.

[135] L. Yang, H. Yan, Y. X. Dong, and L. Y. Fei, “A kind of correlation

classification distance of whole phase based on weight,” International

Conference on Environmental Science and Information Application

Technology, ESIAT 2010, vol. 3, pp. 668-671, 17-18 Jul. 2010.

[136] C. Chatfield, The Analysis of Time Series: an Introduction. 6th Ed,

Chapman and Hall, 2004.

[137] A. Ben-Dor, A. Bruhn, N. Friedman, I. Nachman, M. Schummer, and Z.

Yakhini, “Tissue Classification with Gene Expression Profiles,” J.

Computational Biology, vol. 7, no. 3/4, pp. 559-583, 2000.

[138] S. Mukherjee, P. Tamayo, S. Rogers, R. Rifkin, A. Engle, C. Campbell, T.

R. Golub, and J. P. Mesirov, “Estimating Dataset Size Requirements for

Classifying DNA Microarray Data,” J. Computational Biology, vol. 10, no.

2, pp. 119-142, 2003.


[139] Y. Miche, P. Bas, C. Jutten, O. Simula, and A. Lendasse, “A methodology

for building regression models using Extreme Learning Machine: OP-ELM,”

in ESANN 2008, European Symposium on Artificial Neural Networks,

Bruges, Belgium, 2008.

[140] J. Li and H. Liu, "Kent ridge bio-medical data set repository," School of

Computer Engineering, Nanyang Technological University, Singapore,

2004. [Online]. Available:


[141] A. M. Sarhan, “Cancer classification based on microarray gene expression

data using DCT and ANN,” Journal of Theoretical and Applied Information

Technology (JATIT), vol. 6, no. 2, pp. 208-216, 2009.

[142] A. H. Ali, “Self-Organization Maps for Prediction of Kidney Dysfunction,”

in Proc. 16th Telecommunications Forum TELFOR, Belgrade, Serbia, 2008.

[143] S. Haykin, Adaptive filter theory (Third Edition), Prentice Hall, New Jersey,


[144] X. Yao, “Evolving artificial neural networks,” in Proc. of the IEEE, vol. 87,

no. 9, pp. 1423-1447, 1999.

[145] G. Zhang, “Neural networks for classification: A survey,” IEEE

Transactions on Systems, Man, and Cybernetics, Part C, vol. 30, no. 4, pp.

451-462, 2000.

[146] Z. Man, S. Liu, H. R. Wu, and X. Yu, “A new adaptive back-propagation

algorithm based on Lyapunov stability theory for neural networks,” IEEE

Transactions on Neural Networks, vol. 17, no. 6, pp. 1580-1591, 2006.


[147] G.-B. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme Learning Machine

for Regression and Multiclass Classification,” IEEE Transactions on

Systems, Man, and Cybernetics, Part B: Cybernetics, vol. PP, no. 99, pp. 1-

17, 2011.

[148] C. H. Phillips, J. M. Parr, and E. A. Riskin, Signals, Systems, and

Transforms, Prentice Hall, 2003.

[149] F. Girosi, M. Joenes, and T. Poggio, “Regularization theory and neural

networks architectures,” Neural Computation, vol. 7, pp. 219-269, 1995.

[150] S. Knerr, L. Personnaz, and G. Dreyfus, “Handwritten digit recognition by

neural networks with single-layer training,” IEEE Transactions on Neural

Networks, vol. 3, pp. 962–968, 1992.

[151] C. Liu, K. Nakashima, H. Sako, and H. Fujisawa, “Handwritten digit

recognition: benchmarking of state-of-the-art techniques,” Pattern

Recognition, vol. 36, pp. 2271–2285, 2003.

[152] E. Kussul and T. Baidyk, “Improved method of handwritten digit

recognition tested on MNIST database,” Image and Vision Computing, vol.

22, pp. 971–981, 2004.

[153] B. Zhang, M. Fu, and H. Yan, “A nonlinear neural network model of

mixture of local principal component analysis: application to handwritten

digits recognition,” Pattern Recognition, vol. 34, pp. 203–214, 2001.

[154] D. J. Albers, J. C. Sprott, and W. D. Dechert, “Routes to chaos in neural

networks with random weights,” Int. J. of Bifurcation and Chaos, vol. 8, no.

7, pp. 1463-1478, 1998.


[155] W. F. Schmidt, M. A. Kraaijveld, and R. P. W. Duin, “Feed forward neural

networks with random weights,” in Proc. of the 11th IAPR International

Conference on Pattern Recognition, vol. 2, pp. 1–4, 1992.

[156] I. Y. Tyukin and D. V. Prokhorov, “Feasibility of random basis function

approximators for modeling and control,” in Proc. of the IEEE International

Conference on Control Applications (CCA) and Intelligent Control (ISIC),

pp. 1391–1399, 2009.

[157] Y. LeCun, B. Boser, J. S. Denker, R. E. Howard, W. Habbard, L. D. Jackel,

and D. Henderson, “Handwritten digit recognition with a back-propagation

network,” Advances in neural information processing systems, vol. 2, pp.

396-404, 1990.

[158] L. N. Li, J. H. Ouyang, H. L. Chen, and D. Y. Liu, “A Computer Aided

Diagnosis System for Thyroid Disease Using Extreme Learning Machine,” J.

Med. Syst., 2012. doi:10.1007/s10916-012-9825-3

[159] E. Malar, A. Kandaswamy, D. Chakravarthy, and A. G. Dharan, “A novel

approach for detection and classification of mammographic

microcalcifications using wavelet analysis and extreme learning machine,”

Comp. in Bio. and Med., vol. 42, no. 9, pp. 898-905, 2012.

[160] Y. Song, J. Crowcroft, and J. Zhang, “Automatic epileptic seizure detection

in EEGs based on optimized sample entropy and extreme learning machine,”

Journal of Neuroscience Methods, vol. 210, no. 2, pp. 132–146, 2012.

[161] Q. Yuan, W. Zhou, Y. Liu, and J. Wang, “Epileptic seizure detection with

linear and nonlinear features,” Epilepsy and Behavior, vol. 24, no. 4, pp.

415–421, 2012.

[162] L.-C. Shi, and B.-L. Lu, “EEG-based vigilance estimation using extreme

learning machines,” Neurocomputing, vol. 102, pp. 135–143, 2013.


[163] S. Decherchi, P. Gastaldo, R. Zunino, E. Cambria, and J. Redi, “Circular-

ELM for the reduced-reference assessment of perceived image

quality,” Neurocomputing , vol. 102, pp. 78-89, 2013.

[164] L. Wang, Y. Huang, X. Luo, Z. Wang, and S. Luo, “Image deblurring with

filters learned by extreme learning machine,” Neurocomputing, vol. 74, no.

16, pp. 2464–2474, 2011.

[165] J. Yang, S. Xie, D. Park, and Z. Fang, “Fingerprint Matching based on

Extreme Learning Machine,” Neural Computing and Applications, vol. 14,

pp. 1–11, 2012.

[166] W. Zong, and G.-B. Huang, “Face recognition based on extreme learning

machine,” Neurocomputing, vol. 74, no. 16, pp. 2541–2551, 2011.

[167] I. Marqués, and M. Graña, “Face recognition with lattice independent

component analysis and extreme learning machines,” Soft Comput., vol. 16,

no. 9, pp. 1525–1537, 2012.

[168] Y. Yu, T.-M. Choi, and C.-L. Hui, “An intelligent fast sales forecasting

model for fashion products,” Expert Syst. Appl., vol. 38, no. 6, pp. 7373–

7379, 2011.

[169] M. Xia, Y. Zhang, L. Weng, and X. Ye, “Fashion retailing forecasting based

on extreme learning machine with adaptive metrics of inputs,” Knowl.-

Based Syst., vol. 36, pp. 253–259, 2012.

[170] F. L. Chen, and T. Y. Ou, “Sales forecasting system based on Gray extreme

learning machine with Taguchi method in retail industry,” Expert Syst.

Appl., vol. 38, no. 3, pp. 1336–1345, 2011.

[171] Y. Xu, Y. Dai, Z. Y. Dong, R. Zhang, and K. Meng, “Extreme learning

machine-based predictor for real-time frequency stability assessment of

electric power systems,” Neural Computing and Applications, vol. 22, no.

3–4, pp. 501-508, 2013.


[172] S. Wu, Y. Wang, and S. Cheng, “Extreme learning machine based wind

speed estimation and sensorless control for wind turbine power generation

system,” Neurocomputing, vol. 102, pp. 163–175, 2013.

[173] J. Tang, D. Wang, and T. Chai, “Predicting mill load using partial least

squares and extreme learning machines,” Soft Comput., vol. 16, no. 9, pp.

1585–1594, 2012.

[174] H. Wang, G. Qian, and X. Q. Feng, “Predicting consumer sentiments using

online sequential extreme learning machine and intuitionistic fuzzy sets,”

Neural Comput Appl., 2012. doi:10.1007/s00521-012-0853-1

[175] W. Zheng, Y. Qian, and H. Lu, “Text categorization based on regularization

extreme learning machine,” Neural Computing and Applications, vol. 22, no.

3–4, pp. 447–456, 2013.

[176] H.-J. R., and G.-S. Zhao, “Direct adaptive neural control of nonlinear

systems with extreme learning machine,” Neural Computing and

Applications, vol. 22, no. 3–4, pp. 577–586, 2013.

[177] Y. Yang, Y. Wang, X. Yuan, Y. Chen, and L. Tan, “Neural network-based

self-learning control for power transmission line deicing robot”, Neural

Computing & Applications, 2012. doi:10.1007/s00521-011-0789-x

[178] A. K. Jain, P. W. Duin, and J. Mao, “Statistical pattern recognition: a

review,” IEEE Trans Pattern Anal Machine Intell, vol. 22, pp. 4–37, 2000.

[179] J. Anderson, A. Pellionisz, and E. Rosenfeld, Neurocomputing 2: Directions

for Research, Cambridge Mass., MIT Press, 1990.

[180] X. Liu, C. Gao, and P. Li, “A comparative analysis of support vector

machines and extreme learning machines,” Neural Networks, vol. 33, pp.

58–66, 2012.

[181] R. A. Fisher, “The use of multiple measurements in taxonomic problems,”

Annals of Eugenics 7, Part II, pp. 179–188, 1936.


[182] H. Kim, B. L. Drake, and H. Park, “Multiclass classifiers based on

dimension reduction with generalized lda,” Pattern Recogn., vol. 40, no. 11,

pp. 2939–2945, 2007.

[183] M. D. Richard, and R. Lippmann, “Neural network classifiers estimate

Bayesian a posteriori probabilities,” Neural Comput., vol. 3, pp. 461–483,


[184] P. Gallinari, S. Thiria, R. Badran, and F. Fogelman-Soulie, “On the

relationships between discriminant analysis and multilayer perceptrons,”

Neural Networks, vol. 4, pp. 349–360, 1991.

[185] P. L. Bartlett, “The sample complexity of pattern classification with neural

networks: The size of the weights is more important than the size of the

network,” IEEE Trans. Inf. Theory, vol. 44, no. 2, pp. 525–536, Mar. 1998.

[186] R. Wang, S. Kwong, and X. Wang, “A study on random weights between

input and hidden layers in extreme learning machine,” Soft Comput., vol. 16,

no. 9, pp. 1465–1475, 2012.

[187] P. Horata, S. Chiewchanwattana, and K. Sunat, “A comparative study of

pseudo-inverse computing for the extreme learning machine classifier,”

Data Mining and Intelligent Information Technology Applications (ICMiA),

2011 3rd International Conference on, pp. 40–45, 24–26 Oct. 2011.

[188] X.-K. Wei, Y.-H. Li, and Y. Feng, “Comparative Study of Extreme

Learning Machine and Support Vector Machine”, Proceedings of

International Conference on Intelligent Sensing and Information Processing,

pp. 1089–1095, 2006.

[189] Q. Liu, Q. He, and Z. Shi, “Extreme support vector machine classifier,”

Lecture Notes in Computer Science, vol. 5012, pp. 222–233, 2008.


[190] B. Frénay, and M. Verleysen, “Using SVMs with randomised feature spaces:

An extreme learning approach,” in Proc. 18th ESANN, Bruges, Belgium, pp.

315–320, Apr. 28–30, 2010.

[191] A. M. Sarhan, “A Comparison of Vector Quantization and Artificial Neural

Network Techniques in Typed Arabic Character Recognition,” International

Journal of Applied Engineering Research (IJAER), vol. 4, no. 5, pp. 805-

817, May, 2009.

[192] A. M. Sarhan, and O. I. Al-Helalat “A Novel Approach to Arabic Characters

Recognition Using A Minimum Distance Classifier,” In Proceedings of the

World Congress on Engineering, London, U.K, July 2007.



Matlab Codes 1.1 Matlab code for FIR-ELM function [FIRELMrmse, FIRELMstd, ELMrmse, ELMstd] = ...

FIRELM(sample, target, filmat, reg_d, reg_r, test_snr,


% Finite Impulse Response Extreme Learning Machine algorithm % % [FIRELMrmse, FIRELMstd, ELMrmse, ELMstd] = ... % FIRELM(sample, target, filmat, reg_d, reg_r, test_snr,

% numite) % % Inputs: % sample is a M x N sample pattern matrix with: % M sample pattern vectors of length N % % target is a M x 1 class label matrix with: % M scalar target labels % % filmat is a N x n hidden layer matrix where: % each column is a N x 1 FIR filter % such that n is the number of neurons % % reg_d is the first regularization parameter % such that the regularization ratio is d/r % % reg_r is the second regularization parameter % such that the regularization ratio is d/r % % test_snr is the SNR value in Db for the sample % patterns during testing % % numite is the number of iterations to run the % testing with noisy samples % % Outputs: % FIRELMrmse is the mean RMSE for the FIR-ELM after % numite iterations % % FIRELMstd is the standard deviation of the FIR-ELM % RMSE values over numite iterations %


% ELMrmse is the mean RMSE for the ELM after % numite iterations % % ELMstd is the standard deviation of the ELM % RMSE values over numite iterations % % Publication: Z. Man, K. Lee, D. H. Wang, Z. Cao, and C. Miao, % “A new robust training algorithm for a class of % single hidden layer neural networks,” Neurocomputing, % vol. 74, pp. 2491-2501, 2011. % % Email: % % Copyright (C) 2011 by Kevin Hoe Kwang Lee. % % This function is free software; you can redistribute it and/or % modify it under the terms of the GNU General Public License as % published by the Free Software Foundation; either version 2 of % the License, or any later version. % % The function is distributed in the hope that it will be useful, % but WITHOUT ANY WARRANTY; without even the implied warranty of % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU % General Public License for more details. %

%% Set up experiment parameters rpt = numite; Am = sample; Output = target; xaxis = [1:size(Am,1)];

%% Neural network parameters NumData = size(Am,1); NumInput = size(Am,2); NumNeuron = size(filmat,2);

%% Train FIR-ELM % initialize hidden layer weights using designed filter matrix Nw = filmat;

% intialize random biases matrix B = zeros(NumData,NumNeuron); for i = 1:NumNeuron B(:,i) = 1*rand() - .5; end

% solve for output layer weights NG = Am*Nw + B; d = reg_d; r = reg_r; Beta2 = inv((eye(size(NG'*NG)).*(d/r)) + NG'*NG)*NG'*Output;

%% Train ELM % initialize hidden layer weights using random method W = zeros(NumInput,NumNeuron); for j = 1:NumInput for i = 1:NumNeuron


W(j,i) = 2*rand() - 1; end end

% intialize random biases matrix b = zeros(NumData,NumNeuron); for i = 1:NumNeuron b(:,i) = 2*rand() - 1; end

% solve for output layer weights G = Am*W + b; Hinv = pinv(G); Beta = Hinv*Output;

%% Testing phase % calculate standard output for FIR-ELM A = Am; NG = A*Nw + B; O2D = NG*Beta2;

% performing testing with noisy samples for q = 1:rpt

% Add noise A = awgn(Am,test_snr,'measured');

% @Random Case@ G = A*W + b; O1 = G*Beta;

% @FIR Filter@ NG = A*Nw + B; O2 = NG*Beta2;

% Calculate MSE of random case count = 0; error = zeros(1,length(xaxis)); for t = xaxis count = count+1; error(count) = Output(count)-O1(count); end RMSE_Random(q) = sqrt((sum(error.^2))/count);

% Plot output of random case % subplot(2,2,2); figure(1) plot(xaxis,Output,'-.',xaxis,O1,'o');grid on; title('ELM') axis([0 11 0.6 2.5]); legend('target','output');hold on; % subplot(2,2,4); figure(2) plot(xaxis,error);title('Error ELM'); axis([0 11 -1 1]); grid on;hold on;

% Calculate MSE of FIR case count = 0;


error2 = zeros(1,length(xaxis)); for t = xaxis count = count+1; error2(count) = Output(count)-O2(count); end RMSE_FIR(q) = sqrt((sum(error2.^2))/count);

% Plot output of pole placement case % subplot(2,2,1); figure(3) plot(xaxis,O2D,'-.',xaxis,O2,'o');title('FIR-ELM');grid on; axis([0 11 0.6 2.5]); legend('target','output');hold on; % subplot(2,2,3); figure(4) plot(xaxis,error2);title('Error FIR-ELM') axis([0 11 -1 1]); grid on;hold on;


ELMrmse = mean(RMSE_Random); FIRELMrmse = mean(RMSE_FIR); ELMstd = std(RMSE_Random); FIRELMstd = std(RMSE_FIR);

%************************************************************** % End of Code: FIRELM.m %**************************************************************


1.2 Matlab code for DFT-ELM function [DFTELMacc] = DFTELM(trainSet, testSet, tt, ts, HT,

reg_d1, reg_r1, reg_d2, reg_r2)

% Discrete Fourier Transform Extreme Learning Machine algorithm % % [DFTELMacc] = DFTELM(trainSet, testSet, tt, ts, HT, reg_d1, ... % reg_r1, reg_d2, reg_r2) % % Inputs: % trainSet is a M1 x N sample pattern matrix with: % M1 sample training pattern vectors of

% length N % % testSet is a M2 x N sample pattern matrix with: % M2 sample testing pattern vectors of

% length N % % tt is the training class label matrix with: % M1 target label vectors (1-of-C) % % ts is the testing class label matrix with: % M2 target label vectors (1-of-C) % % HT is a M1 x n target frequency spectrum % matrix for the SLFN hidden layer outputs, % where n is the number of neurons % % reg_d1 is the first regularization parameter % for the hidden layer weights

% (used in d1/r1) % % reg_r1 is the second regularization parameter % for the hidden layer weights

% (used in d1/r1) % % reg_d2 is the first regularization parameter % for the output layer weights

% (used in d2/r2) % % reg_r2 is the second regularization parameter % for the output layer weights

% (used in d2/r2) % % Output: % DFTELMacc is the accuracy (%) obtained for one % iteration of training and testing % % % Publication: Z. Man, K. Lee, D. H. Wang, Z. Cao, and S-Y. Khoo, % “A robust single–hidden layer feedforward network % based pattern classifier,” IEEE TRANSACTIONS ON NEURAL % NETWORKS AND LEARNING SYSTEMS, 23(12), pp. 1974-1986,

% 2012. % % Email: % % Copyright (C) 2012 by Kevin Hoe Kwang Lee. %


% This function is free software; you can redistribute it and/or % modify it under the terms of the GNU General Public License as % published by the Free Software Foundation; either version 2 of % the License, or any later version. % % The function is distributed in the hope that it will be useful, % but WITHOUT ANY WARRANTY; without even the implied warranty of % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU % General Public License for more details. %

%% Set up experiment parameters % Neural network parameters [x,y] = size(trainSet); [x2,y2] = size(testSet);

NumData = x; NumInput = y; NumNeuron = size(HT,2);

%% Train DFT-ELM % Create A-bar matrix for a = 0:NumNeuron-1 for b = 0:NumNeuron-1; Abar(a+1,b+1) = exp(-j*((2*pi)/NumNeuron)*a*b); end end

% Calculate hidden layer target matrix HTD = HT*Abar;

% Calculate hidden layer weights reg_d1 = 1; reg_r1 = 100; W =



% Actual hidden layer output with new weights G1 = trainSet*W; G2 = abs(G1*Abar); G = G2;

% Activation function -> linear G = G;

% Calculate output layer weights reg_d2 = 1; reg_r2 = 100; Beta = inv((eye(size(G'*G)).*(reg_d2/reg_r2)) + G'*G)*G'*tt;

%% Testing phase % Calculate test data output for i = 1:x2 NG1(i,:) = testSet(i,:)*W; end

NG2 = abs(NG1*Abar);


OTELM = NG2*Beta;

% Calculate classification accuracy count = 0;

for i = 1:x2 target = find(ts(i,:) == max(ts(i,:))); output = find(OTELM(i,:) == max(OTELM(i,:)));

if(output == target) count = count + 1; end end

DFTELMacc = (count/x2) * 100;

%************************************************************** % End of Code: DFTELM.m %**************************************************************


1.3 Matlab code for OWLM function [OWLMacc] = OWLM(trainSet, testSet, tt, ts, NumNeuron, ... reg_d1, reg_r1, reg_d2, reg_r2)

% Optimal Weights Learning Machine algorithm % % [OWLMacc] = OWLM(trainSet, testSet, tt, ts, NumNeuron, ... % reg_d1, reg_r1, reg_d2, reg_r2) % % Inputs: % trainSet is a M1 x N sample pattern matrix with: % M1 sample training pattern vectors of

% length N % % testSet is a M2 x N sample pattern matrix with: % M2 sample testing pattern vectors of

% length N % % tt is the training class label matrix with: % M1 target label vectors (1-of-C) % % ts is the testing class label matrix with: % M2 target label vectors (1-of-C) % % NumNeuron is the number of hidden layer neurons set % for the SLFN to be trained by OWLM % % reg_d1 is the first regularization parameter % for the hidden layer weights

% (used in d1/r1) % % reg_r1 is the second regularization parameter % for the hidden layer weights

% (used in d1/r1) % % reg_d2 is the first regularization parameter % for the output layer weights

% (used in d2/r2) % % reg_r2 is the second regularization parameter % for the output layer weights

% (used in d2/r2) % % Output: % OWLMacc is the accuracy (%) obtained for one % iteration of training and testing % % % Publication: Z. Man, K. Lee, D. H. Wang, Z. Cao, and S-Y. Khoo, % “An optimal weight learning machine for handwritten % digit image recognition,” Signal Processing, 93(6), % 1624-1638, 2013. % % Email: % % Copyright (C) 2012 by Kevin Hoe Kwang Lee. % % This function is free software; you can redistribute it and/or % modify it under the terms of the GNU General Public License as


% published by the Free Software Foundation; either version 2 of % the License, or any later version. % % The function is distributed in the hope that it will be useful, % but WITHOUT ANY WARRANTY; without even the implied warranty of % MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU % General Public License for more details. %

%% Set up experiment parameters % Neural network parameters [x,y] = size(trainSet); [x2,y2] = size(testSet);

NumData = x; NumInput = y;

%% Train OWLM % Initialize random weights matrix scaleW = 1; % scaling of weights magnitude % eg: 1 = [1 -1], 10 = [10 -10] W = zeros(NumInput,NumNeuron); for j = 1:NumInput for i = 1:NumNeuron W(j,i) = 2*rand() - 1; end end W = W*scaleW;

% Intialize random biases matrix scaleb = 1; % scaling of bias magnitude % eg: 1 = [1 -1], 10 = [10 -10] b = zeros(NumData,NumNeuron); for i = 1:NumNeuron b(:,i) = 2*rand() - 1; end b =b*scaleb;

% random weights hidden layer output HELM = trainSet*W + b;

% activation function -> tansig HELM = tansig(HELM);

% calculate new hidden layer weights reg_d1 = 1; reg_r1 = 100; CW = inv((eye(size(trainSet'*trainSet)).*(reg_d1/reg_r1)) +


% calculate optimized hidden layer output Hnew = trainSet*CW;

% calculate output layer weights reg_d2 = 1; reg_r2 = 100; Beta = inv((eye(size(Hnew'*Hnew)).*(reg_d2/reg_r2)) +



%% Testing phase % calculate test data output NG1 = testSet*CW; OTELM = NG1*Beta;

% calculate classification accuracy count = 0;

for i = 1:x2 target = find(ts(i,:) == max(ts(i,:))); output = find(OTELM(i,:) == max(OTELM(i,:)));

if(output == target) count = count + 1; end end

OWLMacc = (count/x2) * 100;

%************************************************************** % End of Code: OWLM.m %**************************************************************


List of Publications Journal Papers

1. Z. Man, K. Lee, D. H. Wang, Z. Cao, and C. Miao, “A new robust training

algorithm for a class of single hidden layer neural networks,” Neurocomputing,

vol. 74, pp. 2491-2501, 2011.

2. K. Lee, Z. Man, D. H. Wang, and Z. Cao, “Classification of Bioinformatics

Dataset using Finite Impulse Response Extreme Learning Machine for Cancer

Diagnosis,” Neural Computing & Applications, Available online 30 Jan. 2012,

Doi: 10.1007/s00521-012-0847-z.

3. Z. Man, K. Lee, D. H. Wang, Z. Cao, and S. Khoo, “An optimal weight learning

machine for handwritten digit image recognition,” Signal Processing, Available

online 1 Aug. 2012, Doi: 10.1016/j.sigpro.2012.07.016.

4. Z. Man, K. Lee, D. H. Wang, Z. Cao, and S. Khoo, “A Robust Single-Hidden

layer Feedforward Neural Network based Signal Classifier,” under minor

revision at IEEE Transactions on Neural Networks and Learning Systems, 2012.

Conference Papers

5. Z. Man, K. Lee, D. H. Wang, Z. Cao, and C. Miao, “A modified ELM algorithm

for single-hidden layer feedforward neural networks with linear nodes,” 6th

IEEE Conference on Industrial Electronics and Applications, ICIEA 2011, pp.

2524-2529, 21-23 Jun. 2011.


6. K. Lee, Z. Man, D. H. Wang, and Z. Cao, “Classification of microarray datasets

using finite impulse response extreme learning machine for cancer diagnosis,”

37th Annual Conference on IEEE Industrial Electronics Society, IECON 2011,

pp. 2347-2352, 7-10 Nov. 2011.