Chapter 1 Soft computing algorithms - Shodhganga :...
Transcript of Chapter 1 Soft computing algorithms - Shodhganga :...
Chapter 1
Soft computing algorithms
It is indeed a surprising and fortunate fact that nature can be
expressed by relatively low-order mathematical functions.
Rudolf Carnap
Mitchell [Mitchell, 1997] defines the machine learning process as:
A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P, if its performance at tasks in T,
as measured by P, improves with experience E.
The paradigms of the machine learning field can be labeled depending
on how E, P and T are defined. In this research work, Task T: recognize
amyloid motifs within protein sequences. Performance measure P: percentage
of amyloid fibril forming motifs correctly classified. Training experience E: a
dataset of positive and negative hexamers with given classifications.
One of the main categories of machine learning is supervised learning: the
learning procedure that gives the learner directs feedback about the correctness
of its performance [Penarroya, 2004]. There is always some kind of tutor
involved in this process. The proposed computational approach is exclusively
with supervised learning algorithms. The formal definition could be framed
as: Given a set of instances D ={d1, d2, .., dn}, each instance is labeled with
1
one of the classes C ={c1, c2}, the classification task is to generate a certain
theory T depending on D and C that, when an unlabeled new instance is
given, T can give a prediction of the class of this instance. The two phases of
the life cycle of this learning system consist of: Training and Exploitation. The
simulation of exploitation phase is carried out by splitting the set of labeled
examples into two non-overlapping sets. These sets are distinguished by the
training set and the test set. The utilization of test set is to validate that the
generated theory, T is correct, which means the learning system has been able
to model the concept represented by the instances, D in the training set. The
generalization ability is characterized by the capacity of making a theory that
models the concepts correctly that are represented by the training set. A good
performance on the test set is a sign of generalization.
The classification task from the viewpoint of constructing a theory that
models the concepts represented by examples has been discussed. It is also
important to deal with what knowledge representation is used to generate this
theory. The knowledge representation could be in the form of set of instances
and/or set of rules corresponding to an attribute of the input data samples.
The instances or the data samples processed by the learning algorithm have
a consistent form: each data instance comprises of a finite and fixed set of
components. These components are called attributes. An attribute is a fea-
ture/property that characterizes the instance. Sometimes, it may happen that
the generalization capacity of the learning system deteriorates. This could be
due to the missing values of attributes of an instance or the wrongly labeled
instances in the training set. In such a scenario, the role of soft computing
algorithms is very well appreciated. Soft computing, a term coined by Lofti
Zadeh (1994), is defined as a consortium of methodologies that work synergis-
tically and provides, in one form or another, flexible information processing
capabilities for handling real life ambiguous situations. Unlike conventional
2
(hard) computing, its aim is to exploit the tolerance for imprecision, uncer-
tainty, approximate reasoning and partial truth in order to achieve tractability,
robustness, low solution cost, and more importantly close resemblance with
human like decision making [Mitra and Hayashi, 2006].
It has been established that the ability of soft computing algorithms to
handle imprecision, uncertainty in large and complex search spaces suits well
in solving bioinformatics problems [Jena et al., 2009]. It is a fact that the
kind of theory that can be generated and the kind of predictions that can be
made are affected by the chosen learning algorithm. The subsequent section
discusses the background of soft computing algorithms used in this research.
1.1 Neural network
A Neural Network (NN), is a mathematical or computational model, influ-
enced by the structure and behavior of biological neurons of human brain. NN
is viewed as a massively parallel distributed processor, made up of processing
units called neurons, which acquires experimental knowledge from its environ-
ment. This is achieved through a learning process. The acquired knowledge is
then stored in the form of inter neuron connection strengths, known as synap-
tic weights. The network acquires knowledge by learning through experience
like a human brain. The learning process is called learning algorithm, using
which the synaptic weights of the network are modified to achieve the desired
objective.
As shown in Figure 2.1, each neuron is a simple processing unit consisting
of a summing unit and an activation function. The summing unit calculates
the sum of inputs and a bias value after multiplication with the corresponding
weight factors. The output of the summing unit can be expressed as the
inner product or dot product of the input vector and the weight vector. This
3
Figure 1.1: A sample neuron
is sometimes referred to as the induced local field. The activation function
limits the weighted sum to the required range. The activation function may
be linear or nonlinear. The output of a neuron in terms of the induced local
field ’v’ is defined by activation function, denoted by ψ(v) [Haykin, 2005].
One of the most common activation functions is the sigmoid function which is
defined by
ψ(v) =1
1 + exp(−av)(1.1)
where ’a’ is the slope parameter of the sigmoid function. The range of
values assumed by the sigmoid function is from 0 to 1.
The neural networks can be broadly classified into two: Feed forward net-
works and Recurrent neural networks. In feed forward network, the flow of
data from input units to output units is strictly feed forward. The processing
of data can be extended over multiple layers of units but no feedback connec-
tions from outputs units to input units in the same layer or previous layers. A
multi-layer feed forward network consists of an input layer, one or more hidden
layers and an output layer. The neurons in the input layer, connects the ele-
ments of the input vector to the next layer of the network. The hidden layers
4
do the computation. Usually the activation function of the hidden layers will
be nonlinear. The output of neurons in each layer is given as the input to the
next layer. The set of output from the output layer form the overall response
of the network to the given input pattern. On the other hand, recurrent net-
work is a dynamic network which contains feedback connections. Feedback
exists in a dynamic system whenever the input applied to a particular element
is influenced in part by the output of that element in the system, that results
in one or more closed circuits for the signal transmission.
Perceptron
As suggested by Kecman [Kecman, 2004], the perceptron was one of the
first processing elements that were able to learn. The paradigm used in learn-
ing is an iterative supervised one. In a supervised adapting scheme, the initial
random weights vector w, is chosen and the perceptron is given a randomly
chosen data pair and desired output d1. The perceptron learning algorithm
is an error-correction rule that changes the weights proportional to the error,
e1 = d1 − o1 between the actual output o1 and the desired output d1. Us-
ing a simple rule, w2 = w1 + ∆w1 = w1 + η(d1 − o1)x1, the new weights are
calculated. Here w1 and w2 are previous and new weights respectively. The
next data pair is drawn randomly from the dataset and the whole scheme is
repeated. η is called the learning rate. Gradually the error rate is reduced to
zero iteratively. The computing scheme of the perceptron is given for an input
vector x, as:
u =n+1∑i=1
wixi = w1x1 + w2x2 + .....+ wnxn + wn+1xn+1 (1.2)
which produces an output of +1 if u is positive else -1.
o = sign(u) = sign(n+1∑i=1
wixi) (1.3)
5
sign stands for the signum function ie.,
o = sign(u) = {+1; for u > 0 , 0; for u = 0, −1; for u < 0} (1.4)
Multilayer perceptron
A multilayer perceptron is a feed forward neural network comprising of
multiple layers of perceptron. The activation function chosen is sigmoid and
can approximate Rn → R1 nonlinear mapping. In a multilayer perceptron
xn+1 will be the constant term equal to 1 called bias. The bias weights vector
’b’ can simply be integrated into the hidden layer weights matrix ’v’ as its last
column.
A multilayer perceptron is a representative of nonlinear basis function ex-
pression
o = fa(x,w, v) =N∑i=1
wiψi(x, vi) (1.5)
where fa(x,w, v) is a set of given functions such as sigmoid functions, o is
the output from a model, and N is the number of hidden layer neurons. The
output layer’s weight vector ’w’ and the hidden layer weights vector ’v’ and
free parameters that are subjects of learning. Input vector x, bias weights
vector b, hidden layer weights matrix ’v’ and output weights vector w are as
follows:
x = [x1, x2, ...., xn]T (1.6)
v = vi,j; i = 1, 2, ..., n; j = 1, 2, ..., J (1.7)
b = [b1, b2, ...., bJ ]T (1.8)
w = [w1, w2, ...., wJ , wJ+1]T (1.9)
The output layer neuron may have sigmoid activation functions mostly for
classification tasks.
Back propagation
6
In this algorithm, the error signal of the output neuron is calculated as
the difference between the desired output and the actual output, which is
subsequently used to modify the weight vector of the output layer. Since the
output of hidden layers cannot be found, the error signal for hidden layer
neurons are calculated by back propagating the error signal in terms of the
output layer neurons δ0. The weight change ∆vi,j is obtained by
∆vi,j = ηf ′j(uj)xi
K∑k=1
δokwkj; j = 1, .., J − 1; i = 1, .., I (1.10)
This is the generalization of delta learning rule and explains how hidden
layer weights is to be modified. In each iteration step, the new weight vji will
be adjusted by using the equation
vi,j = Vi,j +∆vi,j = vij +ηf ′j(uj)xi
K∑k=1
δokwkj; j = 1, .., J−1; i = 1, .., I (1.11)
The vector notation is
V = V + ηδyxT (1.12)
where V is a (J−1)×I matrix, and x and δy are I×1 and (J−1)×1 matrices
respectively.
The algorithm for back propagation is briefly explained.
Given a set of P measured data pairs that are used for training.
X = {xp, dp, p = 1, .., P}
input vector, x = [x1, x2, .., xn,+1]T
output vector, d = [d1, d2, .., dk]T
There are two sections: feed forward and back propagation.
Feed forward section
1. Choose the learning rate η and predefine the maximally allowed error
Edes.
2. Initialize weights matrices Vp(J-1, I) and Wp(K,J)
7
3. Perform the training with p = 1, .., P . Apply the new training pair
(xp, dp) in sequence or randomly to the hidden layer neurons.
4. Consecutively, calculate the output from the hidden and output layer
neurons
yjp = fh(ujp); okp = fo(ukp) (1.13)
5. Find sum of errors squared cost function Ep for data pair applied and
the given weight matrices Vp and Wp
Ep =1
2
K∑k=1
(dkp − okp)2 + Ep (1.14)
Back propagation section
6. Calculate the output layer neuron’s error signal δokp
δokp = (dkp − okp)f ′ok(ukp); k = 1, .., K (1.15)
7. Calculate the hidden layer neuron’s error signal δyjp
δyjp = f ′nj(ujp)K∑k=1
δokpwkjp; j = 1, ..., J − 1 (1.16)
8. Calculate the updated output layer weights wkj,p+1
wkj,p+1 = wkjp + ηδokpγjp (1.17)
9. Calculate the updated hidden layer weights vji,p+1
vji,p+1 = vjip + ηδyjpxip (1.18)
10. if p < P go to step 3.
11. The learning epoch is completed when p = P . The learning is terminated
when Ep < Edes. Otherwise goto step 3 and start new learning epoch
with p=1.
8
The practical implementation aspect of back propagation learning to be
considered are the number of neurons in hidden layers, the type activation
function, weight initialization, choice of learning rate, choice of the error stop-
ping function and the momentum term.
1.2 Support Vector Machine
Support Vector Machine (SVM) was first suggested by Vladimir Vapnik for
classification. Owing to the developments in the theory and techniques, this
classifier has recently become an area of intense research. SVMs are a general
class of learning architectures, influenced by statistical learning theory. This
performs structural risk minimization on a nested set structure of separating
hyperplanes. The SVM learning algorithm generates optimal hyperplane that
separates positive and negative data examples in terms of generalization error
for a given set of training data. It obtains a set of support vectors characteriz-
ing a given classification task, after learning. Let the training data represented
by a set of instances be of the form:
D = {(xi, ci)|xi ∈ {−1, 1}}ni=1 (1.19)
where the ci is either +1 or -1, which indicates the class in which the point
Xi belongs to. Each Xi is a p-dimensional feature vector of real values. The
main objective is to obtain the maximum-margin hyperplane which divides
the points belonging to ci = +1 from those in ci = -1.
Any hyperplane in Rp parameterized by a vector (w), and a constant (b)
can be represented in the form of an equation as the set of points x satisfying
the following:
w.x+ b = 0 (1.20)
where · denotes the dot product. The vector w is a normal vector: it is
9
perpendicular to the hyperplane. The parameter b‖w‖ determines the offset of
the hyperplane from the origin along the normal vector w.
When such a hyperplane (w, b) that separates the data is given, this con-
tributes the function:
f(x) = sign(w · x+ b) (1.21)
This correctly classifies the training data and hopefully with the testing data
it has not encountered yet. However, a given hyperplane represented by (w, b)
is equally expressed by all pairs λw, λb for λ+. Therefore, the canonical hyper
plane can be defined to be that which separates the data from the hyper plane
by a distance of atleast 1. That is, only those that satisfy the following may
be considered:
xi · w + b ≥ +1 when ci = +1 (1.22)
xi · w + b) ≥ −1 when ci = −1 (1.23)
or more compactly:
ci(xi · w + b) ≥ 1 ∀i (1.24)
All pairs of λw, λb for a given hyperplane (w, b) define exactly the same hyper-
plane, however each has a diverse functional distance to a given data point.
The magnitude of w is normalized to get the geometric distance from the
hyperplane to a data point. This distance is computed as:
d((w, b), xi) =ci(xi · w + b)
‖w‖≥ 1
‖w‖(1.25)
Intuitively, the hyperplane that maximizes the geometric distance to the clos-
est data points is chosen (Figure 2.2). This is achieved by minimizing the
term ‖w‖. One of the important methods of solving this is with Lagrange
multipliers. The problem is ultimately changed into:
minimize:
W (α) = −∑l
i−1 αi + 12
∑li−1
∑lj=1 cicjαiαj(xi · xj)
10
Figure 1.2: Choosing the hyperplane that maximizes the margin
[Source: Cristianini and Taylor, 2000]
subject to:
∑li=1 ciαi = 0
0 ≤ αi(∀i)
where α is the vector of l non-negative Lagrange multipliers to be determined,
and C is a constant.
Let the matrix be (H)ij = cicj(xi · xj), and introduce more compact nota-
tion: minimize:
W (α) = −αT1 +1
2αTHα (1.26)
subject to:
αT c = 0 (1.27)
0 ≤ α1 (1.28)
This minimization problem is termed as Quadratic Programming Problem.
Moreover, from the derivation of these equations, the optimal hyperplane can
be written as:
w =∑
αicixi (1.29)
11
That is, the vector w is a linear combination of the training data examples.
Interestingly, it can also be shown that αi(ci(wi + b)− 1) = 0 ∀i.
This is another way of saying that the functional distance of an example
is strictly greater than 1 ie., when ci(w · xi + b) > 1), αi = 0. Therefore only
the closest data points contribute to w. These training examples for which
αi > 0 are termed support vectors. Support vectors are the only ones needed
in finding the optimal hyperplane.
Evenif the optimal α from which we construct w is known, b has to be deter-
mined to fully specify the hyperplane. For this, any ”positive” and ”negative”
support vector, x+ and x− respectively, can be taken for which we know
(w · x+ +b) = +1
(w · x−+b) = −1
Solution of the above results in
b = −1
2(w · x+ + w · x−) (1.30)
The need for the constraint in equation 2.30 is stated as follows.
αi(∀i)
When C =∝, the optimal hyperplane will be the one that completely separates
the data. (It is assumed that one exists). The problem is changed to find a
”soft-margin” classifier for finite C. Perhaps, this allows misclassification of
some data. C is thought as a tunable parameter. Higher C corresponds to
more importance on correct classification of all training data. On the other
hand, a ”more flexible” hyperplane that tries to minimize the margin error ie.,
how badly ci(w.xi + b) < 1 for each example is resulted with lower C. Finite
values of C are helpful in scenarios where the data cannot be separated easily.
A hyperplane is expected to distinguish d-dimensional data perfectly into
individual classes. However, since the nature of example data is often non-
linearly separable, the notion of a ”kernel induced feature space” has been
12
Figure 1.3: Separating the Data in a Feature Space
[Source: Cristianini and Taylor, 2000]
introduced. This transforms the data into a higher dimensional space where
the data can be separable. In order to achieve this, a mapping z = φ(x) that
casts the d-dimensional input vector x into a usually higher d’-dimensional
vector z is defined. A φ() is chosen so that the new training data {φ(xi), ci}
is separable by a hyperplane (Figure 2.3).
Given a mapping z = φ(x), substitute all occurrences of x with φ(x) to set
up a new optimization problem. Hence in equation 2.26, (H)ij = cicj(φ(xi) ·
φ(xj)), and w =∑
i αiciφ(Xi) in equation 2.29.
Equation 2.21 becomes
f(x) = sign(w · φ(x) + b)
= sign([∑
i αiciφ(xi)] · φ(x) + b)
= sign(∑
i αici(φ(xi) · φ(x)) + b)
It can be observed that any time a φ(xa) appears, it is always in a dot product
with some other φ(xb). This implies that in a higher dimensional feature space,
if the kernel is known for the dot product
K(xa, xb) =φ(xa)·φ(xb)
then the mapping z = φ(x) never needs to be dealt directly. The equations
corresponding to matrix and the classifier are (H)ij = cicj(K(xi;xj)) and
13
Figure 1.4: Computational pipeline of feature vector representation and SVM
classification
f(x) = sign(∑
i αici(K(xi, x)) + b) respectively. Finding the optimal hyper
plane may be proceeded as usual provided the problem is set up in the above
manner; perhaps only the hyper plane will be in some unknown feature space.
It is a fact that the data samples will be separated by some curved, possibly
non-continuous contour in the original input space. Some of the commonly
used kernels are:
• linear: K(xi, xj) = xTi xj.
• polynomial: K(xi, xj) = (γxTi xj + r)d, γ > 0.
• radial basis function: K(xi, xj) = exp(−γ ‖xi − xj‖2), γ > 0.
• sigmoid: K(xi, xj) = tanh(γxj + r).
Here, γ, r, and d are kernel parameters. The computational pipeline that
involves training the decision function on a series of chosen binary-labeled
training feature vectors, and classifying a given test sample xi into either
positive or negative class is depicted in Figure 2.4.
1.3 Decision tree
A Decision Tree (DT) is a tree-like structure that depicts rules for dividing
training data into groups based on the regularities in the data. In fact, the
14
Figure 1.5: Schematic sketch of a sample decision tree
most striking feature of DT’s is their ability to break down a complex decision
making problem into a set of simpler decisions at several levels of the tree,
thereby provides a solution which is often easier to infer. Given the data
represented by:
(x,C) = (x1, x2, x3, .., xn, C) (1.31)
The variable, C, is the target class variable. The vector x consists of the input
data samples encoded by the feature values, x1, x2, x3 etc. that are used for
the classification or prediction task.
DT is a hierarchical structure, where each interior node represents a test on
an attribute, each branch denotes an outcome of the test, and finally each leaf
node determines a class label. A tree is said to learn by splitting the source
set into subsets. The splitting is done based on an attribute value test. This
procedure is repeated on each derived subset recursively. The recursion is said
to complete when (i) the subset at a node has all the same value of the target
variable, or (ii) when splitting no longer adds value to the predictions.
Every path from the root to a leaf in an unpruned tree gives one if-then
rule. Each classification rule represents a hyper plane that best divides the
15
fibril forming or non-fibril forming fragments in the representative space. Every
such rule is simplified by eliminating conditions that seem to be not helpful
for differentiating the nominated class from other classes. For each class in
turn, all the simplified rules for that class are separated to eradicate rules that
do not contribute to the precision of the set of rules as a whole. The sets of
rules for the classes are then arranged and sorted to minimize false positive
errors and a default class is chosen. This procedure leads to a production rule
predictor that is almost as accurate as a pruned tree, but more comprehensible
[Han, 2011]. Figure 2.5 provides a schematic sketch of a decision tree. The
whole representative space is divided into two subspaces with the test ”F1 ≤
0.54?”. All fragments are labeled Amyloidogenic in the space ”F1 ≤ 0.54?”
and induction of the tree finishes. The fragments are of mixed labels in the
space ”F1 > 0.54?”, and they are furthermore divided by the test ”F1 ≤ 2.39?”
and so on.
1.4 Ensemble methods
A set of classifiers built from the Learning algorithms to classify new data
samples by taking into consideration a weighted or unweighted vote of their
predictions are termed ensemble methods. In other words, an ensemble of
classifiers constitute a set of classifiers whose individual decisions are integrated
in some way or other by weighted or unweighted voting for classifying new data
samples. Conceptually, ensembles follow a divide-and-conquer paradigm used
to ameliorate the performance.
The main notion behind this approach is that a group of ’weak learners’
generating weak hypotheses are brought together to form a ’strong learner’
that results in a strong hypotheses. Each classifier, individually, is considered
a ’weak learner’, while all the classifiers taken together are a ’strong learner’.
16
Basically, a weak learner is a classifier which is correlated but not strongly
correlated with correct classification. In other words, weak learner learns with
low bias and high variance. Due to this, a large value of ’k’ is preferred where
’k’ represents the number of iterations in ensemble technique which in turn
refers to the number of single classifier models combined to form a strong
learner.
In an ensemble approach, a series of ’k’ classifiers is iteratively learned
(Figure 2.6) to produce final hypotheses. Studies have shown that ensemble
approaches often give much more prediction accuracy compared to individual
classifiers that make them up. Bagging and boosting are well-known examples
of ensemble approaches.
1.4.1 Bagging
Bagging is a method of manipulating the training samples to improve the
prediction performance. Given a dataset D containing d samples, bagging
works as follows. A training set Di of d samples is sampled with replacement
from the original set D for iteration i (i = 1, 2, .., k). Such a training set
termed as a bootstrap replicates the original training set. This technique is
called bootstrap aggregation. Each training set is a bootstrap sample.
There is a chance that some of the actual samples of D may not be in-
cluded in Di and others may occur more than once because sampling with
replacement is used. A classifier model Mi is said to learn for each training
set, Di. To classify an unknown data sample, X, each classifier, Mi, returns
its class prediction, counted as one vote. The bagged classifier, M∗, counts
the votes and allots the class with the maximum number of votes to X [Han
and Kamber, 2008]. Therefore, bagged classifier often has more classification
accuracy than an individual classifier.
Random Forest follows a bagging-based approach.
17
Random Forest
Random Forest (RF) is a group of predictors based on tree structure such that
each tree depends on the values of a random vector sampled independently and
with the same distribution for all trees in the forest. A RF is defined to consist
of a collection of tree structured classifiers {h(x,Θk), k = 1, ...}, where Θk are
independent identically distributed random vectors and each tree casts a unit
vote for the most popular class at input x [Breiman, 2001]. RF begins with
a conventional machine learning algorithm named a DT which, in ensemble
terms, corresponds to a weak learner. In a DT, an input is entered at the top
initially. As it traverses down the tree, the data gets contained into smaller
and smaller sets. The RF takes this notion to the next level by integrating
trees with the notion of an ensemble. Thus, in terms of ensemble, the trees are
weak learners and RF is a strong learner. At each node: choose some subsets
of variables at random and find a variable (and a value for that variable) which
optimizes the split (Figure 2.7).
1.4.2 Boosting
Boosting is a generic procedure by which the accuracy of a learning algorithm
is improved. It’s aim is to integrate ’weak’ classifiers to get a ’strong’ classifier.
This is an iterative algorithm, and in each iteration, a weak classifier is chosen
by minimizing an average training error. Formally, in boosting, weights are
assigned to each training sample. A series of k classifiers is iteratively learned.
The weights are updated to allow the subsequent classifier, Mi+1, to pay more
focus to the training samples that were misclassified by Mi. This is done
after a classifier Mi is learned. The final boosted classifier, M∗, combines the
votes of each individual classifier, where the weight of each classifier’s vote is
a function of its accuracy [Han, 2008]. Adaboost algorithm with conventional
18
Figure 1.6: The final hypothesis hfinal is a combination of (h1, ..., hk)
classifiers like SVM, DT and RF as base classifiers was utilized in this research
work.
Adaboost
Yet another method of manipulating the training set can be illustrated by Ad-
aboost algorithm. The main objective is to run the weak-learning algorithm
many times, each time on a diverse distribution of instances, to generate several
different hypotheses. Moreover, this technique forwards the incorrect classifi-
cations to another model of the same weak learning algorithm. Formally, given
D, a dataset of d class labeled samples in the form (X1, y1), (X2, y2), .., (Xd, yd),
where yi is the target class label of a sample Xi, Adaboost assigns each train-
ing sample an equal weight of 1/d. Generating k classifiers for the ensemble
requires k rounds through the rest of the algorithm. In round i, the data
from D are sampled to form a training set, Di, of size d. Since sampling with
replacement is used, same samples may be selected more than once. Each
sample’s choice of being selected is based on its weight. A classifier model,
Mi, is derived from the training samples of Di. Its error is then calculated
19
Figure 1.7: Representation of a Random Forest
using Di as a test set. The weights of training samples are then adjusted
according to how they were classified. If a sample was incorrectly classified,
its weight is increased and vice versa. A sample’s weight reflects how hard it
is to classify: higher the weight, more often it has been misclassified. These
weights are used to generate the training samples for the classifier of the next
round. The basic idea here is to focus more on the misclassified tuples of the
previous round when a classifier is built. In this way, a series of classifiers are
built that complement each other.
To compute the error rate of model Mi, the sum of the weights of each
sample in Di that Mi misclassified is calculated. Mathematically,
error(Mi) =d∑j
wj × err(Xj) (1.32)
where err(Xj) is the misclassification error of tuple Xj. if the sample was
misclassified, then err(Xj) is 1. Else 0. If the classifier performance of Mi is
so poor that error exceeds 0.5, then abandon it. Try further by generating a
new Di training set, from which a new Mi is derived.
The error rate of Mi affects how the weights of the training tuples are up-
20
dated. If a sample in round i was correctly classified, its weight is multiplied by
error(Mi)/(1−error(Mi)). Once the weights of all correctly classified samples
are updated, the weights for all other samples including the misclassified ones
are normalized so that their sum remains the same as it was before. In order
to normalize a weight, multiply it by the sum of the old weights and divide
by the sum of the new weights. This will result in the increase of weights of
misclassified samples and decrease of weights of correctly classified samples.
Boosting method assigns a weight to each classifier’s vote, based on how
well the classifier has performed. The lower the error rate, more accurate the
performance would be and hence, the higher its weight for voting should be.
The weight of classifier Mi’s vote is log 1−error(Mi)error(Mi)
. For each class c, the weights
of each classifier that assigned class c to X are summed up. The class with
the highest sum is the ’winner’ and is returned as the class prediction for a
sample X.
The utilization of RF as a weak learner in Adaboost algorithm results in
an ensemble of ensemble technique since RF follows a bagging-based ensemble
approach.
1.5 Summary
This chapter has described some base material of supervised machine learning
systems where this thesis is concentrating upon. The chapter started with a
general description of the machine learning paradigms and with some defini-
tions and mathematical formulations related to the learning task of classifying
motif sequences. Finally, the soft computing algorithms utilized in this re-
search work are outlined.
21