Chapter 1 Soft computing algorithms - Shodhganga :...

21
Chapter 1 Soft computing algorithms It is indeed a surprising and fortunate fact that nature can be expressed by relatively low-order mathematical functions. Rudolf Carnap Mitchell [Mitchell, 1997] defines the machine learning process as: A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. The paradigms of the machine learning field can be labeled depending on how E, P and T are defined. In this research work, Task T: recognize amyloid motifs within protein sequences. Performance measure P: percentage of amyloid fibril forming motifs correctly classified. Training experience E: a dataset of positive and negative hexamers with given classifications. One of the main categories of machine learning is supervised learning: the learning procedure that gives the learner directs feedback about the correctness of its performance [Penarroya, 2004]. There is always some kind of tutor involved in this process. The proposed computational approach is exclusively with supervised learning algorithms. The formal definition could be framed as: Given a set of instances D ={d 1 ,d 2 , .., d n }, each instance is labeled with 1

Transcript of Chapter 1 Soft computing algorithms - Shodhganga :...

Chapter 1

Soft computing algorithms

It is indeed a surprising and fortunate fact that nature can be

expressed by relatively low-order mathematical functions.

Rudolf Carnap

Mitchell [Mitchell, 1997] defines the machine learning process as:

A computer program is said to learn from experience E with respect to some

class of tasks T and performance measure P, if its performance at tasks in T,

as measured by P, improves with experience E.

The paradigms of the machine learning field can be labeled depending

on how E, P and T are defined. In this research work, Task T: recognize

amyloid motifs within protein sequences. Performance measure P: percentage

of amyloid fibril forming motifs correctly classified. Training experience E: a

dataset of positive and negative hexamers with given classifications.

One of the main categories of machine learning is supervised learning: the

learning procedure that gives the learner directs feedback about the correctness

of its performance [Penarroya, 2004]. There is always some kind of tutor

involved in this process. The proposed computational approach is exclusively

with supervised learning algorithms. The formal definition could be framed

as: Given a set of instances D ={d1, d2, .., dn}, each instance is labeled with

1

one of the classes C ={c1, c2}, the classification task is to generate a certain

theory T depending on D and C that, when an unlabeled new instance is

given, T can give a prediction of the class of this instance. The two phases of

the life cycle of this learning system consist of: Training and Exploitation. The

simulation of exploitation phase is carried out by splitting the set of labeled

examples into two non-overlapping sets. These sets are distinguished by the

training set and the test set. The utilization of test set is to validate that the

generated theory, T is correct, which means the learning system has been able

to model the concept represented by the instances, D in the training set. The

generalization ability is characterized by the capacity of making a theory that

models the concepts correctly that are represented by the training set. A good

performance on the test set is a sign of generalization.

The classification task from the viewpoint of constructing a theory that

models the concepts represented by examples has been discussed. It is also

important to deal with what knowledge representation is used to generate this

theory. The knowledge representation could be in the form of set of instances

and/or set of rules corresponding to an attribute of the input data samples.

The instances or the data samples processed by the learning algorithm have

a consistent form: each data instance comprises of a finite and fixed set of

components. These components are called attributes. An attribute is a fea-

ture/property that characterizes the instance. Sometimes, it may happen that

the generalization capacity of the learning system deteriorates. This could be

due to the missing values of attributes of an instance or the wrongly labeled

instances in the training set. In such a scenario, the role of soft computing

algorithms is very well appreciated. Soft computing, a term coined by Lofti

Zadeh (1994), is defined as a consortium of methodologies that work synergis-

tically and provides, in one form or another, flexible information processing

capabilities for handling real life ambiguous situations. Unlike conventional

2

(hard) computing, its aim is to exploit the tolerance for imprecision, uncer-

tainty, approximate reasoning and partial truth in order to achieve tractability,

robustness, low solution cost, and more importantly close resemblance with

human like decision making [Mitra and Hayashi, 2006].

It has been established that the ability of soft computing algorithms to

handle imprecision, uncertainty in large and complex search spaces suits well

in solving bioinformatics problems [Jena et al., 2009]. It is a fact that the

kind of theory that can be generated and the kind of predictions that can be

made are affected by the chosen learning algorithm. The subsequent section

discusses the background of soft computing algorithms used in this research.

1.1 Neural network

A Neural Network (NN), is a mathematical or computational model, influ-

enced by the structure and behavior of biological neurons of human brain. NN

is viewed as a massively parallel distributed processor, made up of processing

units called neurons, which acquires experimental knowledge from its environ-

ment. This is achieved through a learning process. The acquired knowledge is

then stored in the form of inter neuron connection strengths, known as synap-

tic weights. The network acquires knowledge by learning through experience

like a human brain. The learning process is called learning algorithm, using

which the synaptic weights of the network are modified to achieve the desired

objective.

As shown in Figure 2.1, each neuron is a simple processing unit consisting

of a summing unit and an activation function. The summing unit calculates

the sum of inputs and a bias value after multiplication with the corresponding

weight factors. The output of the summing unit can be expressed as the

inner product or dot product of the input vector and the weight vector. This

3

Figure 1.1: A sample neuron

is sometimes referred to as the induced local field. The activation function

limits the weighted sum to the required range. The activation function may

be linear or nonlinear. The output of a neuron in terms of the induced local

field ’v’ is defined by activation function, denoted by ψ(v) [Haykin, 2005].

One of the most common activation functions is the sigmoid function which is

defined by

ψ(v) =1

1 + exp(−av)(1.1)

where ’a’ is the slope parameter of the sigmoid function. The range of

values assumed by the sigmoid function is from 0 to 1.

The neural networks can be broadly classified into two: Feed forward net-

works and Recurrent neural networks. In feed forward network, the flow of

data from input units to output units is strictly feed forward. The processing

of data can be extended over multiple layers of units but no feedback connec-

tions from outputs units to input units in the same layer or previous layers. A

multi-layer feed forward network consists of an input layer, one or more hidden

layers and an output layer. The neurons in the input layer, connects the ele-

ments of the input vector to the next layer of the network. The hidden layers

4

do the computation. Usually the activation function of the hidden layers will

be nonlinear. The output of neurons in each layer is given as the input to the

next layer. The set of output from the output layer form the overall response

of the network to the given input pattern. On the other hand, recurrent net-

work is a dynamic network which contains feedback connections. Feedback

exists in a dynamic system whenever the input applied to a particular element

is influenced in part by the output of that element in the system, that results

in one or more closed circuits for the signal transmission.

Perceptron

As suggested by Kecman [Kecman, 2004], the perceptron was one of the

first processing elements that were able to learn. The paradigm used in learn-

ing is an iterative supervised one. In a supervised adapting scheme, the initial

random weights vector w, is chosen and the perceptron is given a randomly

chosen data pair and desired output d1. The perceptron learning algorithm

is an error-correction rule that changes the weights proportional to the error,

e1 = d1 − o1 between the actual output o1 and the desired output d1. Us-

ing a simple rule, w2 = w1 + ∆w1 = w1 + η(d1 − o1)x1, the new weights are

calculated. Here w1 and w2 are previous and new weights respectively. The

next data pair is drawn randomly from the dataset and the whole scheme is

repeated. η is called the learning rate. Gradually the error rate is reduced to

zero iteratively. The computing scheme of the perceptron is given for an input

vector x, as:

u =n+1∑i=1

wixi = w1x1 + w2x2 + .....+ wnxn + wn+1xn+1 (1.2)

which produces an output of +1 if u is positive else -1.

o = sign(u) = sign(n+1∑i=1

wixi) (1.3)

5

sign stands for the signum function ie.,

o = sign(u) = {+1; for u > 0 , 0; for u = 0, −1; for u < 0} (1.4)

Multilayer perceptron

A multilayer perceptron is a feed forward neural network comprising of

multiple layers of perceptron. The activation function chosen is sigmoid and

can approximate Rn → R1 nonlinear mapping. In a multilayer perceptron

xn+1 will be the constant term equal to 1 called bias. The bias weights vector

’b’ can simply be integrated into the hidden layer weights matrix ’v’ as its last

column.

A multilayer perceptron is a representative of nonlinear basis function ex-

pression

o = fa(x,w, v) =N∑i=1

wiψi(x, vi) (1.5)

where fa(x,w, v) is a set of given functions such as sigmoid functions, o is

the output from a model, and N is the number of hidden layer neurons. The

output layer’s weight vector ’w’ and the hidden layer weights vector ’v’ and

free parameters that are subjects of learning. Input vector x, bias weights

vector b, hidden layer weights matrix ’v’ and output weights vector w are as

follows:

x = [x1, x2, ...., xn]T (1.6)

v = vi,j; i = 1, 2, ..., n; j = 1, 2, ..., J (1.7)

b = [b1, b2, ...., bJ ]T (1.8)

w = [w1, w2, ...., wJ , wJ+1]T (1.9)

The output layer neuron may have sigmoid activation functions mostly for

classification tasks.

Back propagation

6

In this algorithm, the error signal of the output neuron is calculated as

the difference between the desired output and the actual output, which is

subsequently used to modify the weight vector of the output layer. Since the

output of hidden layers cannot be found, the error signal for hidden layer

neurons are calculated by back propagating the error signal in terms of the

output layer neurons δ0. The weight change ∆vi,j is obtained by

∆vi,j = ηf ′j(uj)xi

K∑k=1

δokwkj; j = 1, .., J − 1; i = 1, .., I (1.10)

This is the generalization of delta learning rule and explains how hidden

layer weights is to be modified. In each iteration step, the new weight vji will

be adjusted by using the equation

vi,j = Vi,j +∆vi,j = vij +ηf ′j(uj)xi

K∑k=1

δokwkj; j = 1, .., J−1; i = 1, .., I (1.11)

The vector notation is

V = V + ηδyxT (1.12)

where V is a (J−1)×I matrix, and x and δy are I×1 and (J−1)×1 matrices

respectively.

The algorithm for back propagation is briefly explained.

Given a set of P measured data pairs that are used for training.

X = {xp, dp, p = 1, .., P}

input vector, x = [x1, x2, .., xn,+1]T

output vector, d = [d1, d2, .., dk]T

There are two sections: feed forward and back propagation.

Feed forward section

1. Choose the learning rate η and predefine the maximally allowed error

Edes.

2. Initialize weights matrices Vp(J-1, I) and Wp(K,J)

7

3. Perform the training with p = 1, .., P . Apply the new training pair

(xp, dp) in sequence or randomly to the hidden layer neurons.

4. Consecutively, calculate the output from the hidden and output layer

neurons

yjp = fh(ujp); okp = fo(ukp) (1.13)

5. Find sum of errors squared cost function Ep for data pair applied and

the given weight matrices Vp and Wp

Ep =1

2

K∑k=1

(dkp − okp)2 + Ep (1.14)

Back propagation section

6. Calculate the output layer neuron’s error signal δokp

δokp = (dkp − okp)f ′ok(ukp); k = 1, .., K (1.15)

7. Calculate the hidden layer neuron’s error signal δyjp

δyjp = f ′nj(ujp)K∑k=1

δokpwkjp; j = 1, ..., J − 1 (1.16)

8. Calculate the updated output layer weights wkj,p+1

wkj,p+1 = wkjp + ηδokpγjp (1.17)

9. Calculate the updated hidden layer weights vji,p+1

vji,p+1 = vjip + ηδyjpxip (1.18)

10. if p < P go to step 3.

11. The learning epoch is completed when p = P . The learning is terminated

when Ep < Edes. Otherwise goto step 3 and start new learning epoch

with p=1.

8

The practical implementation aspect of back propagation learning to be

considered are the number of neurons in hidden layers, the type activation

function, weight initialization, choice of learning rate, choice of the error stop-

ping function and the momentum term.

1.2 Support Vector Machine

Support Vector Machine (SVM) was first suggested by Vladimir Vapnik for

classification. Owing to the developments in the theory and techniques, this

classifier has recently become an area of intense research. SVMs are a general

class of learning architectures, influenced by statistical learning theory. This

performs structural risk minimization on a nested set structure of separating

hyperplanes. The SVM learning algorithm generates optimal hyperplane that

separates positive and negative data examples in terms of generalization error

for a given set of training data. It obtains a set of support vectors characteriz-

ing a given classification task, after learning. Let the training data represented

by a set of instances be of the form:

D = {(xi, ci)|xi ∈ {−1, 1}}ni=1 (1.19)

where the ci is either +1 or -1, which indicates the class in which the point

Xi belongs to. Each Xi is a p-dimensional feature vector of real values. The

main objective is to obtain the maximum-margin hyperplane which divides

the points belonging to ci = +1 from those in ci = -1.

Any hyperplane in Rp parameterized by a vector (w), and a constant (b)

can be represented in the form of an equation as the set of points x satisfying

the following:

w.x+ b = 0 (1.20)

where · denotes the dot product. The vector w is a normal vector: it is

9

perpendicular to the hyperplane. The parameter b‖w‖ determines the offset of

the hyperplane from the origin along the normal vector w.

When such a hyperplane (w, b) that separates the data is given, this con-

tributes the function:

f(x) = sign(w · x+ b) (1.21)

This correctly classifies the training data and hopefully with the testing data

it has not encountered yet. However, a given hyperplane represented by (w, b)

is equally expressed by all pairs λw, λb for λ+. Therefore, the canonical hyper

plane can be defined to be that which separates the data from the hyper plane

by a distance of atleast 1. That is, only those that satisfy the following may

be considered:

xi · w + b ≥ +1 when ci = +1 (1.22)

xi · w + b) ≥ −1 when ci = −1 (1.23)

or more compactly:

ci(xi · w + b) ≥ 1 ∀i (1.24)

All pairs of λw, λb for a given hyperplane (w, b) define exactly the same hyper-

plane, however each has a diverse functional distance to a given data point.

The magnitude of w is normalized to get the geometric distance from the

hyperplane to a data point. This distance is computed as:

d((w, b), xi) =ci(xi · w + b)

‖w‖≥ 1

‖w‖(1.25)

Intuitively, the hyperplane that maximizes the geometric distance to the clos-

est data points is chosen (Figure 2.2). This is achieved by minimizing the

term ‖w‖. One of the important methods of solving this is with Lagrange

multipliers. The problem is ultimately changed into:

minimize:

W (α) = −∑l

i−1 αi + 12

∑li−1

∑lj=1 cicjαiαj(xi · xj)

10

Figure 1.2: Choosing the hyperplane that maximizes the margin

[Source: Cristianini and Taylor, 2000]

subject to:

∑li=1 ciαi = 0

0 ≤ αi(∀i)

where α is the vector of l non-negative Lagrange multipliers to be determined,

and C is a constant.

Let the matrix be (H)ij = cicj(xi · xj), and introduce more compact nota-

tion: minimize:

W (α) = −αT1 +1

2αTHα (1.26)

subject to:

αT c = 0 (1.27)

0 ≤ α1 (1.28)

This minimization problem is termed as Quadratic Programming Problem.

Moreover, from the derivation of these equations, the optimal hyperplane can

be written as:

w =∑

αicixi (1.29)

11

That is, the vector w is a linear combination of the training data examples.

Interestingly, it can also be shown that αi(ci(wi + b)− 1) = 0 ∀i.

This is another way of saying that the functional distance of an example

is strictly greater than 1 ie., when ci(w · xi + b) > 1), αi = 0. Therefore only

the closest data points contribute to w. These training examples for which

αi > 0 are termed support vectors. Support vectors are the only ones needed

in finding the optimal hyperplane.

Evenif the optimal α from which we construct w is known, b has to be deter-

mined to fully specify the hyperplane. For this, any ”positive” and ”negative”

support vector, x+ and x− respectively, can be taken for which we know

(w · x+ +b) = +1

(w · x−+b) = −1

Solution of the above results in

b = −1

2(w · x+ + w · x−) (1.30)

The need for the constraint in equation 2.30 is stated as follows.

αi(∀i)

When C =∝, the optimal hyperplane will be the one that completely separates

the data. (It is assumed that one exists). The problem is changed to find a

”soft-margin” classifier for finite C. Perhaps, this allows misclassification of

some data. C is thought as a tunable parameter. Higher C corresponds to

more importance on correct classification of all training data. On the other

hand, a ”more flexible” hyperplane that tries to minimize the margin error ie.,

how badly ci(w.xi + b) < 1 for each example is resulted with lower C. Finite

values of C are helpful in scenarios where the data cannot be separated easily.

A hyperplane is expected to distinguish d-dimensional data perfectly into

individual classes. However, since the nature of example data is often non-

linearly separable, the notion of a ”kernel induced feature space” has been

12

Figure 1.3: Separating the Data in a Feature Space

[Source: Cristianini and Taylor, 2000]

introduced. This transforms the data into a higher dimensional space where

the data can be separable. In order to achieve this, a mapping z = φ(x) that

casts the d-dimensional input vector x into a usually higher d’-dimensional

vector z is defined. A φ() is chosen so that the new training data {φ(xi), ci}

is separable by a hyperplane (Figure 2.3).

Given a mapping z = φ(x), substitute all occurrences of x with φ(x) to set

up a new optimization problem. Hence in equation 2.26, (H)ij = cicj(φ(xi) ·

φ(xj)), and w =∑

i αiciφ(Xi) in equation 2.29.

Equation 2.21 becomes

f(x) = sign(w · φ(x) + b)

= sign([∑

i αiciφ(xi)] · φ(x) + b)

= sign(∑

i αici(φ(xi) · φ(x)) + b)

It can be observed that any time a φ(xa) appears, it is always in a dot product

with some other φ(xb). This implies that in a higher dimensional feature space,

if the kernel is known for the dot product

K(xa, xb) =φ(xa)·φ(xb)

then the mapping z = φ(x) never needs to be dealt directly. The equations

corresponding to matrix and the classifier are (H)ij = cicj(K(xi;xj)) and

13

Figure 1.4: Computational pipeline of feature vector representation and SVM

classification

f(x) = sign(∑

i αici(K(xi, x)) + b) respectively. Finding the optimal hyper

plane may be proceeded as usual provided the problem is set up in the above

manner; perhaps only the hyper plane will be in some unknown feature space.

It is a fact that the data samples will be separated by some curved, possibly

non-continuous contour in the original input space. Some of the commonly

used kernels are:

• linear: K(xi, xj) = xTi xj.

• polynomial: K(xi, xj) = (γxTi xj + r)d, γ > 0.

• radial basis function: K(xi, xj) = exp(−γ ‖xi − xj‖2), γ > 0.

• sigmoid: K(xi, xj) = tanh(γxj + r).

Here, γ, r, and d are kernel parameters. The computational pipeline that

involves training the decision function on a series of chosen binary-labeled

training feature vectors, and classifying a given test sample xi into either

positive or negative class is depicted in Figure 2.4.

1.3 Decision tree

A Decision Tree (DT) is a tree-like structure that depicts rules for dividing

training data into groups based on the regularities in the data. In fact, the

14

Figure 1.5: Schematic sketch of a sample decision tree

most striking feature of DT’s is their ability to break down a complex decision

making problem into a set of simpler decisions at several levels of the tree,

thereby provides a solution which is often easier to infer. Given the data

represented by:

(x,C) = (x1, x2, x3, .., xn, C) (1.31)

The variable, C, is the target class variable. The vector x consists of the input

data samples encoded by the feature values, x1, x2, x3 etc. that are used for

the classification or prediction task.

DT is a hierarchical structure, where each interior node represents a test on

an attribute, each branch denotes an outcome of the test, and finally each leaf

node determines a class label. A tree is said to learn by splitting the source

set into subsets. The splitting is done based on an attribute value test. This

procedure is repeated on each derived subset recursively. The recursion is said

to complete when (i) the subset at a node has all the same value of the target

variable, or (ii) when splitting no longer adds value to the predictions.

Every path from the root to a leaf in an unpruned tree gives one if-then

rule. Each classification rule represents a hyper plane that best divides the

15

fibril forming or non-fibril forming fragments in the representative space. Every

such rule is simplified by eliminating conditions that seem to be not helpful

for differentiating the nominated class from other classes. For each class in

turn, all the simplified rules for that class are separated to eradicate rules that

do not contribute to the precision of the set of rules as a whole. The sets of

rules for the classes are then arranged and sorted to minimize false positive

errors and a default class is chosen. This procedure leads to a production rule

predictor that is almost as accurate as a pruned tree, but more comprehensible

[Han, 2011]. Figure 2.5 provides a schematic sketch of a decision tree. The

whole representative space is divided into two subspaces with the test ”F1 ≤

0.54?”. All fragments are labeled Amyloidogenic in the space ”F1 ≤ 0.54?”

and induction of the tree finishes. The fragments are of mixed labels in the

space ”F1 > 0.54?”, and they are furthermore divided by the test ”F1 ≤ 2.39?”

and so on.

1.4 Ensemble methods

A set of classifiers built from the Learning algorithms to classify new data

samples by taking into consideration a weighted or unweighted vote of their

predictions are termed ensemble methods. In other words, an ensemble of

classifiers constitute a set of classifiers whose individual decisions are integrated

in some way or other by weighted or unweighted voting for classifying new data

samples. Conceptually, ensembles follow a divide-and-conquer paradigm used

to ameliorate the performance.

The main notion behind this approach is that a group of ’weak learners’

generating weak hypotheses are brought together to form a ’strong learner’

that results in a strong hypotheses. Each classifier, individually, is considered

a ’weak learner’, while all the classifiers taken together are a ’strong learner’.

16

Basically, a weak learner is a classifier which is correlated but not strongly

correlated with correct classification. In other words, weak learner learns with

low bias and high variance. Due to this, a large value of ’k’ is preferred where

’k’ represents the number of iterations in ensemble technique which in turn

refers to the number of single classifier models combined to form a strong

learner.

In an ensemble approach, a series of ’k’ classifiers is iteratively learned

(Figure 2.6) to produce final hypotheses. Studies have shown that ensemble

approaches often give much more prediction accuracy compared to individual

classifiers that make them up. Bagging and boosting are well-known examples

of ensemble approaches.

1.4.1 Bagging

Bagging is a method of manipulating the training samples to improve the

prediction performance. Given a dataset D containing d samples, bagging

works as follows. A training set Di of d samples is sampled with replacement

from the original set D for iteration i (i = 1, 2, .., k). Such a training set

termed as a bootstrap replicates the original training set. This technique is

called bootstrap aggregation. Each training set is a bootstrap sample.

There is a chance that some of the actual samples of D may not be in-

cluded in Di and others may occur more than once because sampling with

replacement is used. A classifier model Mi is said to learn for each training

set, Di. To classify an unknown data sample, X, each classifier, Mi, returns

its class prediction, counted as one vote. The bagged classifier, M∗, counts

the votes and allots the class with the maximum number of votes to X [Han

and Kamber, 2008]. Therefore, bagged classifier often has more classification

accuracy than an individual classifier.

Random Forest follows a bagging-based approach.

17

Random Forest

Random Forest (RF) is a group of predictors based on tree structure such that

each tree depends on the values of a random vector sampled independently and

with the same distribution for all trees in the forest. A RF is defined to consist

of a collection of tree structured classifiers {h(x,Θk), k = 1, ...}, where Θk are

independent identically distributed random vectors and each tree casts a unit

vote for the most popular class at input x [Breiman, 2001]. RF begins with

a conventional machine learning algorithm named a DT which, in ensemble

terms, corresponds to a weak learner. In a DT, an input is entered at the top

initially. As it traverses down the tree, the data gets contained into smaller

and smaller sets. The RF takes this notion to the next level by integrating

trees with the notion of an ensemble. Thus, in terms of ensemble, the trees are

weak learners and RF is a strong learner. At each node: choose some subsets

of variables at random and find a variable (and a value for that variable) which

optimizes the split (Figure 2.7).

1.4.2 Boosting

Boosting is a generic procedure by which the accuracy of a learning algorithm

is improved. It’s aim is to integrate ’weak’ classifiers to get a ’strong’ classifier.

This is an iterative algorithm, and in each iteration, a weak classifier is chosen

by minimizing an average training error. Formally, in boosting, weights are

assigned to each training sample. A series of k classifiers is iteratively learned.

The weights are updated to allow the subsequent classifier, Mi+1, to pay more

focus to the training samples that were misclassified by Mi. This is done

after a classifier Mi is learned. The final boosted classifier, M∗, combines the

votes of each individual classifier, where the weight of each classifier’s vote is

a function of its accuracy [Han, 2008]. Adaboost algorithm with conventional

18

Figure 1.6: The final hypothesis hfinal is a combination of (h1, ..., hk)

classifiers like SVM, DT and RF as base classifiers was utilized in this research

work.

Adaboost

Yet another method of manipulating the training set can be illustrated by Ad-

aboost algorithm. The main objective is to run the weak-learning algorithm

many times, each time on a diverse distribution of instances, to generate several

different hypotheses. Moreover, this technique forwards the incorrect classifi-

cations to another model of the same weak learning algorithm. Formally, given

D, a dataset of d class labeled samples in the form (X1, y1), (X2, y2), .., (Xd, yd),

where yi is the target class label of a sample Xi, Adaboost assigns each train-

ing sample an equal weight of 1/d. Generating k classifiers for the ensemble

requires k rounds through the rest of the algorithm. In round i, the data

from D are sampled to form a training set, Di, of size d. Since sampling with

replacement is used, same samples may be selected more than once. Each

sample’s choice of being selected is based on its weight. A classifier model,

Mi, is derived from the training samples of Di. Its error is then calculated

19

Figure 1.7: Representation of a Random Forest

using Di as a test set. The weights of training samples are then adjusted

according to how they were classified. If a sample was incorrectly classified,

its weight is increased and vice versa. A sample’s weight reflects how hard it

is to classify: higher the weight, more often it has been misclassified. These

weights are used to generate the training samples for the classifier of the next

round. The basic idea here is to focus more on the misclassified tuples of the

previous round when a classifier is built. In this way, a series of classifiers are

built that complement each other.

To compute the error rate of model Mi, the sum of the weights of each

sample in Di that Mi misclassified is calculated. Mathematically,

error(Mi) =d∑j

wj × err(Xj) (1.32)

where err(Xj) is the misclassification error of tuple Xj. if the sample was

misclassified, then err(Xj) is 1. Else 0. If the classifier performance of Mi is

so poor that error exceeds 0.5, then abandon it. Try further by generating a

new Di training set, from which a new Mi is derived.

The error rate of Mi affects how the weights of the training tuples are up-

20

dated. If a sample in round i was correctly classified, its weight is multiplied by

error(Mi)/(1−error(Mi)). Once the weights of all correctly classified samples

are updated, the weights for all other samples including the misclassified ones

are normalized so that their sum remains the same as it was before. In order

to normalize a weight, multiply it by the sum of the old weights and divide

by the sum of the new weights. This will result in the increase of weights of

misclassified samples and decrease of weights of correctly classified samples.

Boosting method assigns a weight to each classifier’s vote, based on how

well the classifier has performed. The lower the error rate, more accurate the

performance would be and hence, the higher its weight for voting should be.

The weight of classifier Mi’s vote is log 1−error(Mi)error(Mi)

. For each class c, the weights

of each classifier that assigned class c to X are summed up. The class with

the highest sum is the ’winner’ and is returned as the class prediction for a

sample X.

The utilization of RF as a weak learner in Adaboost algorithm results in

an ensemble of ensemble technique since RF follows a bagging-based ensemble

approach.

1.5 Summary

This chapter has described some base material of supervised machine learning

systems where this thesis is concentrating upon. The chapter started with a

general description of the machine learning paradigms and with some defini-

tions and mathematical formulations related to the learning task of classifying

motif sequences. Finally, the soft computing algorithms utilized in this re-

search work are outlined.

21