Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index:...

114
A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain the Master of Science Degree in Telecomunication and Computer Engineering Supervisor: Prof. Rui Fuentecilla Maia Ferreira Neves Prof. Nuno Cavaco Gomes Horta Examination Committee Chairperson: Prof. Ricardo Jorge Fernandes Chaves Supervisor: Prof. Rui Fuentecilla Maia Ferreira Neves Member of the Committee: Prof. João Miguel Duarte Ascenso October 2018

Transcript of Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index:...

Page 1: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

A sequential approach in forecasting the S&P500 index:Combining Genetic Algorithm and Random Forests

Ivo Miguel Fouto Pires

Thesis to obtain the Master of Science Degree in

Telecomunication and Computer Engineering

Supervisor: Prof. Rui Fuentecilla Maia Ferreira NevesProf. Nuno Cavaco Gomes Horta

Examination Committee

Chairperson: Prof. Ricardo Jorge Fernandes ChavesSupervisor: Prof. Rui Fuentecilla Maia Ferreira Neves

Member of the Committee: Prof. João Miguel Duarte Ascenso

October 2018

Page 2: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

ii

Page 3: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Acknowledgments

Firstly, I would like to thank to Prof. Rui Neves, my thesis supervisor, for his weekly help and feedback

during the course of this thesis.

I also would like to thank my family, friends and colleagues for their support and advise, not only

during the development process of this work, but during all of my academic path.

Finally, I would like to thank to the Instituto Superior Tecnico, campus Taguspark, for the continuous

support, continuous emotional and intellectual growth, and, above all, for providing the facilities that

enabled us, students, to carry the hard work throughout this path.

iii

Page 4: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

iv

Page 5: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Resumo

O Mercado Financeiro devido as suas caracterısticas ruıdosas, nao estacionarias e deterministica-

mente caoticas, ganhou bastante atencao por parte da comunidade de Aprendizagem Automatica. Esta

tese propoe investigar a previsibilidade do ındice S&P500 atraves do desenvolvimento de um sistema

habil. Com base no comportamento previsto, temos como objetivo formar uma estrategia de trading

rentavel, obtendo lucros diarios com um baixo risco de investimento associado. O sistema sugerido

usa uma nova abordagem baseada no combinacao de dois algoritmos, mais precisamente, em primeiro

lugar um metodo de feature selection, o Genetic Algorithm (GA), e, em seguida, um algoritmo de Apren-

dizagem Automatica, o Random Forest (RAF). Como input inicial do sistema sao usados, juntamente

com um conjunto de indicadores tecnicos especificados pelo utilizador, precos e volume diarios.

Em primeiro lugar, sera feita uma abordagem recorrendo aos GA onde serao selecionados os

parametros usados na computacao dos indicadores tecnicos. Do grupo inicial de indicadores tecnicos

serao, tambem, eleitos os indicadores que conseguirem extrair informacao util dos dados financeiros

historicos, deste modo, reduzindo a dimensao do grupo inicial, mas preservando a essencia dos da-

dos. Posteriormente, recorrendo ao uso dos indicadores tecnicos selecionados em conjunto com a

informacao diaria do mercado, sera formada uma RAF que fara uma previsao do comportamento do

mercado, que sera avaliada para que o investidor adopte a posicao de mercado mais sensata.

Por fim, sera realizada uma avaliacao para perceber se serao cumpridos os objetivos estabelecidos.

A solucao proposta e testada com dados diarios de cinco mercados financeiros com caracterısticas

inerentes distintas. Quatro funcoes de fitness foram consideradas no Algoritmo Genetico para avaliar

a performance das diferentes solucoes encontradas e os resultados mais robustos sao produzidos por

estrategias baseadas no uso de uma funcao de fitness que mede o racio entre taxa de retorno e o risco

do investimento obtido pelo sinal de transacoes gerado. Os resultados obtidos demostram que esta

abordagem obtem melhores resultados do que os obtidos pela estrategia de Buy and Hold (B&H) na

maioria dos mercados testados.

Palavras-chave: Previsao dos Mercados Financeiros, Algoritmo de aprendizagem, Apren-

dizagem conjunta, Hipotese dos Mercados Eficientes, Algoritmos Geneticos, Random Forest

v

Page 6: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

vi

Page 7: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Abstract

Stock Market due to its noisy, non-stationary, and deterministically chaotic features, gained a lot of

Machine Learning (ML) community attention. In this thesis we propose to investigate the predictability

of the S&P500 index by developing an expert system. Based on the forecasted behaviour, we aim at

establishing a profitable trading strategy, achieving daily profits with low risk associated. The suggested

system uses a novel approach based on the ensemble of a feature selection method, a Genetic Algo-

rithm (GA), with a ML algorithm, more precisely, a Random Forest (RAF) learner. This system uses daily

prices and volume together with an user’s specific set of technical indicators as input.

Firstly, a GA approach will be used to select the technical indicators’ computation parameters and to

elect from the initial group of technical indicators those which will retrieve useful information from histor-

ical stock data, thus reducing the number of features but still preserving the stock data’s fundamentals.

Then, through the usage of the selected technical indicators coupled with the daily stock information,

a RAF learner will try to emit a forecast of the market’s behaviour, which will be evaluated so that the

trader can endorse a wise market position.

At last an evaluation is carried to understand if the objectives we set ourselves can be fulfilled. The

proposed approach is tested with daily data from five financial markets with different inherent char-

acteristics. Four different fitness functions are used by the Genetic Algorithm to evaluate the perfor-

mance of different possible solutions and the most robust results are produced by a fitness function

that measures the risk return ratio (i.e., the ratio between the Rate of Return (ROR) and the Maximum

Drawdown (MDD)) obtained by the trading signal yield. The results achieved show that this approach

outperforms the Buy and Hold strategy in the majority of the tested markets.

Keywords: Stock Market Forecast, Learning algorithm, Ensemble learning, Efficient Market

Hypothesis, Genetic Algorithm, Random Forests

vii

Page 8: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

viii

Page 9: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Contents

Acknowledgments iii

Resumo v

Abstract vii

List of Figures xiv

List of Tables xv

Acronyms xvii

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Thesis Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.5 Thesis Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.6 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related Work 5

2.1 Financial Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 The S&P500 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Market Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Fundamental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.2 Technical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2.4.1 Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

2.4.2 Simple Prediction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

2.4.3 Ensemble Prediction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5.1 Works on Feature Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

2.5.2 Works on Simple Prediction Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 33

ix

Page 10: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

2.5.3 Works on Ensemble Prediction Algorithms . . . . . . . . . . . . . . . . . . . . . . . 34

3 Implementation 38

3.1 Architecture Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.2 Layer 1: Presentation Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.3 Layer 2: Data Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.4 Layer 3: Prediction & Broker Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4.1 Trend Analysis Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

3.4.2 Data Preparation Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

3.4.3 Random Forest Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

3.4.4 Stock Exchange Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4 Evaluation 58

4.1 Financial Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Datasets Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.3 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

4.4 Case Study I - Performance of the Whole System . . . . . . . . . . . . . . . . . . . . . . . 63

4.5 Case Study II - Influence of the Genetic Algorithm Module . . . . . . . . . . . . . . . . . . 67

4.6 Case Study III - Influence of the Market Trend Feature . . . . . . . . . . . . . . . . . . . . 70

4.7 Evaluation Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

5 Conclusions and Future Work 75

5.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.2 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

A Technical Analysis 77

A.1 Technical Indicators Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

A.1.1 Trend Following . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

A.1.2 Momentum Oscillators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

A.1.3 Volume Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

A.1.4 Volatility Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

A.2 Technical Indicators Parameters Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

B Implementation 84

B.1 K-fold Cross Validation Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

C Evaluation Plots 85

C.1 Normal distribution graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

C.2 Apple stocks candlesticks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

x

Page 11: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

D Return Plots 87

D.1 Full System return plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

D.2 System without GA return plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

D.3 System without Trend feature return plots . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

Bibliography 96

xi

Page 12: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

xii

Page 13: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

List of Figures

2.1 Graph showing the MACD indicator coupled with the S&P500 index action . . . . . . . . . 13

2.2 Graph Showing the CCI indicator coupled with the S&P500 index action . . . . . . . . . . 17

2.3 Graph showing the MFI indicator coupled with the S&P500 index action . . . . . . . . . . 19

2.4 Graph showing the ATR indicator coupled with the S&P500 index action . . . . . . . . . . 21

2.5 Graph showing the BBANDS indicator coupled with the S&P500 index action . . . . . . . 22

3.1 Diagrammatic representation of the layered working of the autonomous trading system . . 39

3.2 Diagrammatic representation of the data preparation submodules . . . . . . . . . . . . . . 44

3.3 Genetic Algorithm Evolution Process Overflow . . . . . . . . . . . . . . . . . . . . . . . . 46

3.4 Chromosome Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

3.5 Diagrammatic representation of the random forest module performance . . . . . . . . . . 53

4.1 Returns obtained by the system with the different fitness functions and the B&H in the

S&P500 index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.2 Returns obtained by the system with the different fitness functions and the B&H in the

AT&T stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

4.3 Returns obtained by the system without the GA module using the different fitness func-

tions and the B&H in the S&P500 index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.4 Returns obtained by the system without the GA module using the different fitness func-

tions and the B&H in the AT&T stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

4.5 Returns obtained by the system without the Trend feature using the different fitness func-

tions and the B&H in the S&P500 index . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

4.6 Returns obtained by the system without the Trend feature using the different fitness func-

tions and the B&H in the AT&T stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

B.1 Diagram which represents a 3-fold cross validation scheme . . . . . . . . . . . . . . . . . 84

C.1 Bell shaped histogram of a normal distribution . . . . . . . . . . . . . . . . . . . . . . . . . 85

C.2 Candlestick chart for the AAPL stocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

D.1 Returns obtained by the system using the different fitness functions and the B&H in the

Apple stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

xiii

Page 14: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

D.2 Returns obtained by the system using the different fitness functions and the B&H in the

Amazon stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

D.3 Returns obtained by the system using the different fitness functions and the B&H in the

Coca-Cola stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

D.4 Returns obtained by the system without the GA module using the different fitness func-

tions and the B&H in the Apple stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

D.5 Returns obtained by the system without the GA module using the different fitness func-

tions and the B&H in the Amazon stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

D.6 Returns obtained by the system without the GA module using the different fitness func-

tions and the B&H in the Coca-Cola stock . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

D.7 Returns obtained by the system without the Trend feature using the different fitness func-

tions and the B&H in the Apple stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

D.8 Returns obtained by the system without the Trend feature using the different fitness func-

tions and the B&H in the Amazon stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

D.9 Returns obtained by the system without the Trend feature using the different fitness func-

tions and the B&H in the Coca-Cola stock . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

xiv

Page 15: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

List of Tables

2.1 Overview over different approaches to forecast Stock Market . . . . . . . . . . . . . . . . 37

4.1 Implemented system parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.2 Financial Data Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.3 Financial Data Returns Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

4.4 Results from the B&H and the different fitness functions’ strategies tested with the full

system . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

4.5 Results from the B&H and the different fitness functions’ strategies tested without the GA 68

4.6 Results from the B&H and the different fitness functions’ strategies tested without the

Trend Label module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

A.1 Parameters used in the computation of the different Technical Indicators . . . . . . . . . . 83

xv

Page 16: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

xvi

Page 17: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

List of Acronyms

ADX Average Directional Index

A/D Advance/Decline Line

AF Acceleration Factor

AI Artificial Intelligence

ANN Artificial Neural Network

ATR Average True Range

BBANDS Bollinger Bands

CCI Commodity Channel Index

DM Directional Movement

DNN Deep Neural Network

DT Decision Trees

EMH Efficient Market Hypothesis

EMA Exponential Moving Average

GA Genetic Algorithm

GBT Gradient-Boosted-Tree

IEEE Institute of Electrical and Electronics Engineers

MDP Mean of the Daily Profit

MA Moving Average

MACD Moving Average Convergence/Divergence

MDD Maximum Drawdown

MFI Money Flow Index

ML Machine Learning

NDT Neural-based Decision Tree

OBV On Balance Volume

PP Probability of Winning

xvii

Page 18: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

PPO Percentage Price Oscillator

PSAR Parabolic Stop and Reversal

SMA Simple Moving Average

SVM Support Vector Machine

RAF Random Forest

ROC Rate of Change

ROI Return on Investment

ROR Rate of Return

ROR/day Rate of Return per day

RSI Relative Strength Index

RRR Risk Return Ratio

SMTP Simple Mail Transfer Protocol

STO Stochastic Oscillator

WILLR Williams %R

xviii

Page 19: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Chapter 1

Introduction

Due to the high amount of money flow that is involved around the Stock Market, it has stood out to all

sorts of investors, ranging from individual investors to more established ones, such as trading companies

and banks. Hence, making the stock market a hot topic among researchers in Financial Engineering.

In the manner that Financial Engineering recurres on the use of mathematical techniques to solve

financial problems, this subject of study has seen its scope of research largely spread due to the ever

evolving computer technology (Lyuu, 2001). Nowadays, owing to the large, and always growing, amount

of financial information, one can not think how it would be possible to create new trading strategies and

investment analysis without resorting to efficient and complex algorithms to ease their researches.

When studying the financial market, determining the crucial moments to invest or sell an investor’s

holdings is crucial to achieve a profitable strategy, in order to meet the financial demands of an investor.

1.1 Background

Predicting the direction or trend of the Stock Market has always been a challenge. Since, it is very

hard to predict due to the great amount of macro-economical factors, like political and general economic

conditions, movement of other stock markets and traders’ expectations, making it a highly complex,

evolutionary and non-linear dynamic system.

Stock Market’s prediction has been one of the more active themes of research from the autonomous

learning system’s community over the last years. Recently, the number of researchers, among both

academic and industry professionals, interested in efficiently analyse the market increased. This trend

has been observed due to the high amount of achievable returns on a very short time basis.

The main goal is to produce Artificial Intelligence (AI) models with the desire of constructing systems

capable of autonomously trading stocks, while recognising different investment opportunities, with a

higher level of confidence of achieving profitable returns than human investors. Ensemble prediction

systems is a modern technique used nowadays to develop forecasting systems, where base learning

algorithms such as Artificial Neural Network (ANN), Support Vector Machine (SVM) and Decision Trees

(DT) are gathered together, enhancing the prediction accuracy, which, in turn, when accompanied by

1

Page 20: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

precise sell/buy signals, yields high profits. Such systems exploit historical evidences on the market’s

behaviour and seek to output a strong signal, which indicates the foreseen trend.

Notwithstanding, according to the Efficient Market Hypothesis (EMH) (Fama, 1970) the market has

a random walk, that makes it impossible to predict its behaviour. However, there were some researches

that tried to abjure the EMH, showing that in fact it is possible to predict the future behaviour of the

market (Patel et al., 2015b). As a matter of fact, if researchers achieve a probability of predicting the

market’s trend a little over fifty percent, which is a very good accuracy, they may have an increase on

the expected return on investment made (Gorgulho et al., 2011).

When it comes to evaluate stocks, and make investment decisions there are two quite different

methods in which stock investors rely on: fundamental analysis (Murphy, 1999) and technical analysis

(Murphy, 1999). In fundamental analysis, the investors look at the fundamental value of each company

using its financial statements, as the income statement, the balance sheet and the competition, this are

only a few of the various statistics, this data is difficult to collect and sometimes delayed in time. On the

other hand, the technical analysis raises its predictions on the study of stock markets technical indicators

that are built on stock prices or volumes time-series, which makes it more accurate, on time, and easy

to obtain (Pinto et al., 2015).

Unfortunately, none of these methods will always find the perfect prediction due to the markets’

randomness presented formerly. Hence, the number of studies that try effortlessly to construct a system

adaptable to the non-stationary market have been increasing.

1.2 Motivation

As presented formerly, stock markets exchange hefty amounts of capital per day, which from the

financial point of view can become a great motivation to achieve the precise moments to issue an im-

pactful trading signal.

However, since our work fits on the Machine Learning (ML) realm, the challenge of correctly iden-

tifying price patterns in financial markets, due to its noisy, non-stationary, and deterministically chaotic

features, in an efficient process is, also, a compelling motivation that pushes the work to make a differ-

ence relatively to past studies.

Lastly, an extra motivation for this work follows from the usage of a novel strategy from assembling

two algorithms, the Genetic Algorithm (GA) and the Random Forest (RAF), in order to detect profitable

trading points. By using the GA as the foremost step into our system one can assess its operation of

effectively discover the features that convoy more useful information for the RAF, improving its perfor-

mance and prediction potential, revamping the learner ability to generalize the patterns found on the fast

changing world of stock markets.

2

Page 21: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

1.3 Thesis Goals

With the growing interest on achieving an autonomously adaptable system which is able to predict

the behaviour or trend of the stock market while making aware decisions about the traded stocks, we

think that developing this novel system will bring more consciousness about the technologies employed,

like the ensemble algorithms and random forests, which are not used very often but able to perform very

well. We think this study may open the researchers’ scope to future developments.

This study has three main goals, being two of them extremely correlated. Apart from successfully

predicting the market’s trend, which is crucial in order to obtain any sort of results, the main goals

intimately correlated are the maximisation of the investment made while minimising the related risk.

Besides these monetary goals, one of the main goals for this thesis is to show that the EMH, presented

previously, cannot be accepted as a tenet. This last goal, by its own, can be a major contribution to new

researchers in the field of autonomous systems designed to predict the stock markets and to efficiently

trade stocks, since it can give evidences of the studied market’s behaviour.

1.4 Proposed Solution

In this thesis, we will focus on the creation of a novel sequential approach to an adaptable prediction

system. The system will be based on two learning algorithms that will be grouped together, the first

algorithm used on the chain will be the Genetic Algorithm (GA) and the second, that by itself is already

an ensemble of decision trees, is the Random Forest (RAF) which, in the end, will output the forecast.

A simple technical analysis will be employed to forecast the movement of the stock market. Strate-

gies based on the use of technical analysis usually embody a set of technical indicators, which, by

themselves, try to give a future perspective of the market to be analysed. Technical indicators fully

reflect past behaviour of the market, since they are based on mathematical formulas that extract infor-

mation from financial time series. Consequently, this type of analysis is precise, easy to obtain and ideal

for systems that try to predict stock market’s behaviour with a prediction window suitable for the problem

in hands (Pinto et al., 2015).

1.5 Thesis Contributions

The main contributions of this thesis are:

– The ability to formulate a financial market’s prediction problem through the use of a binary classifier.

– The combination of a GA to perform data dimensionality reduction and parameter optimisation with

the RAF algorithm to identify best trading points in the stock market.

– The use of fitness functions to evaluate the performance of GA’s individuals that take into consid-

eration not only the solution’s accuracy, but also the returns obtained, the daily profit and risk from

the investment and the number of days spent with capital invested.

3

Page 22: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

– A framework capable of estimating the performance of a binary classifier given the solution for the

proposed problem formulation.

1.6 Outline

This document describes the research and work developed and it is organized as follows:

– Chapter 1 presents the motivation, background, proposed solution, thesis goals and its contribu-

tions.

– Chapter 2 addresses the theory supporting the work developed, as the most relevant techniques

and algorithms employed, and, also, describes the previous work in the field.

– Chapter 3 presents the architecture of the ensemble system, detailing each of its components,

while describing its implementation and the technologies chosen.

– Chapter 4 shows the evaluation process performed, the metrics used to test the system and the

analogous results.

– Chapter 5 summarizes the conclusions of this work, presenting the obtained accomplishments

and making suggestions for future work.

4

Page 23: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Chapter 2

Related Work

In this chapter, some of the most important techniques and financial concepts closely related to the

forecast ensemble method will be presented. Firstly, we will start by describing some general financial

concepts, then we will briefly describe the index studied and present the purpose of market analysis, as

well as, some technical indicators important for the stock market’s forecast. Secondly, we will present

feature selection methods and how will these positively impact our system. Section 2.3 goes through

the autonomous learning algorithms that using technical indicators, presented in a former subsection,

can predict the behaviour of the market. Next, the recently adopted ensemble method that contributes

to the system developed in this thesis will be described. Finally, related work using those algorithms will

be analysed, to better understand how these ensembles of base learners can achieve greater results

than using simple learning methods individually.

2.1 Financial Concepts

The stock market is a vital component of a free-market economy, which is characterized by a volun-

tary and decentralized order of agreements through which individuals make economic decisions, how-

ever this notion of a free-market economy is unobtainable due to the existence of some constraints, such

as prohibition of specific exchanges and taxation. At its core the stock market connects a collection of

markets and exchanges where are issued and traded stocks for public companies, which are commonly

known by equities, bonds, acknowledged as a debt security where the issuer owes the holders a debt

and is duty-bound to pay them back in a established period, and a plethora of other securities.

Regardless of the financial market’s features, it can be portrayed primarily by its trends as a bull or

bear market (Edwards et al., 2007). A bull market arises when the stock price reaches higher highs or

is foreseen to rise (uptrend), usually it is associated with increasing investor confidence, and increased

prospect of future capital gains. In the opposite side of the spectrum, a bear market is characterized

by a general decline in the financial market over a period of time, which is identified by the stock price

making lower lows (downtrend), being followed by a change in investor’s sentiment, transitioning from

a high investor confidence to a widespread investor fear and pessimism. However, there is still a third

5

Page 24: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

scenario, that can likewise be acknowledged as a market condition, when there is neither an uptrend or

a downtrend and the prices tend to sway within bounds, that classifies the market as being sideways.

When encountered with a market scenario from those described above, an investor has to make a

rational decision on how to make his investment. Having this in mind, it can be adopted one of three

known market positions: long, short or neutral. In a long position, the holder of the position owns the

security (such as an equity or a bond) and will profit it the price of the security rises. In contrast, when

adopting a short position the investor is selling a security that he does not own expecting that its price

will drop, in this case the investor/seller borrows the asset. Consequently, the resulting position is said to

be “covered” when the seller repurchases the security to deliver it back to the broker who lent it initially.

This way, an investor will profit if the price of the security declines, since the cost of repurchasing will

be less than the income received from the initial short sale. Finally, a neutral position is taken when the

investor considers that the market is unstable, so there’s not a clear trend in the market, hence the most

conscious decision is to stay out of the market, not making any kind of investment.

Depending on the market conditions (bull, bear or sideways market) different positions are more

recommended to be adopted in order to gain financial leverage over the market. Hence, when the

market is considered to be bull, one should adopt a long position, since the prices tend to rise. On the

other hand, when the market is facing a bear trend, an investor should adopt a short position, since the

prices will eventually drop, making profit from the lower securities’ prices. Finally, when the market is

sideways and is struggling to breakout, the wisest decision is to stay away from any investment position

by staying neutral in face of the market behaviour.

2.2 The S&P500 Index

The S&P500 index is a very well known stock index which main goal is to have a price that offers a

quick look at the stock market and economy, which will be the main scope of our work 1. The Standard

& Poor’s 500 index, also known as S&P500, is an index that measures the value of the 500 largest, by

market capitalization, corporations’ stocks. This index is a leading indicator of United States’ capitals

and seen as a benchmark for the U.S. stock market. Due to be composed by companies with large

capital, which form a big portion of the total market’s value, the S&P500 is considered representative of

the market.

The S&P500, in opposite to the DJIA (Dow Jones Industrial Average) which uses a price weighting

methodology, uses a market capitalization strategy to rank the companies incorporated. In the strategy

adopted, larger companies will be awarded with bigger weighting, contributing to a better classification

within the index.1http://www.investopedia.com/terms/s/sp500.asp, last accessed May 1st, 2017

6

Page 25: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

2.3 Market Analysis

As stated previously, the prevision of the precise moments to entry or exit the stock market remains

a challenge, having multiple researches tried to ease this task [(Cao et al., 2005), (Cheng et al., 2010)].

These researches benefit from analytical methods such as fundamental analysis or technical analysis,

and machine learning methods to gain advantage over the market.

2.3.1 Fundamental Analysis

Fundamental analysis is the process of examining the underlying forces of the economy, industry

groups or companies with the aspiration of developing a forecast of future price movement and profit

from it. The examinations carried by fundamentalists may vary from the scope of research. At the

company level, it may involve research at a company’s financial data and competition, whereas when

examining at a grossest level, as an industry, the fundamental analysts may observe for the supply and

demand forces of the products (Suresh, 2013).

A fundamental analysis is focused on understanding a company by studying its wealth, the health, its

potential to grow and makes an intensive study on macro-economic indicators to derive the true market

price of the assets (Murphy, 1999). Accordingly, companies with stronger fundamentals may foresee

an uptrend on its asset’s price, while companies with weaker fundamentals may see their asset’s price

falling.

The true market value of a company’s asset is described by fundamental analysts as the “intrinsic

value” of a stock, and the actual market value of a stock will ultimately gravitate towards its true market

value. Fundamentalists leverage from this knowledge to forecast future stock prices, which in case of

mismatching between the actual stock price and the true market value, the stock is either under or

over valued. When the stock is considered to be under valued, which means that its current trading

price is lower than what it is really worth, an investor should adopt a buying position, since its price will

conclusively rise. In opposition, when the stock price is above its intrinsic value, the stock is considered

over valued and the decision maker should sell the stock, as the stock price will tend to its true value.

2.3.2 Technical Analysis

Technical analysis is a security analysis discipline, focused on the investigation of past market data,

primarily price and volume, with the purpose of forecasting the future price of financial assets or future

market’s behaviour. Through the usage of a common technique, the study of charts, technical analysts

seek to identify price patterns and market trends on financial markets, attempting to take leverage of the

market exploiting those patterns. However, market technicians also resort to the use of market indicators

to help access whether an asset is trending, and if it is, the probability of continuation and direction, as

a consequence different positions can be adopted.

7

Page 26: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

There are three premisses on which the technical approach is based (Murphy, 1999):

1. Market Action Discounts Everything: a technician fully believes that everything that can poten-

tially affect the price, like company’s fundamentals, current political situation or average trader’s

psychology, is indeed reflected in the price of the market. This claims that the price action should

reflect the variations in supply and demand, which are considered the economic fundamentals of

a market, so if the demand exceeds supply, a positive variation in the price should be noticed, and

vice-versa. However, the technician turns this statement upside down to conclude that if prices are

rising, independently of the reasons, demand exceeds supply and the company’s fundamentals

must be bullish, and the other way around is also true, so if prices are falling, the fundamentals are

determined to be bearish.

2. Prices Move in Trends: the concept of trend is essential to the technical approach, since the study

of the market’s price charts has the intent of identifying trends in early stages of their evolution,

independently of its direction, i.e. up, down, or sideways, with the main objective of predicting the

precise moments to invest appropriately. Technicians believe that a trend in motion is more likely

to continue than to reverse or to move unpredictably, in fact, most of the techniques used in this

discipline are trend-following, meaning these are set to identify and follow existing trends.

3. History Repeats Itself: technicians believe that investors are largely influenced by investor’s be-

haviour that preceded them. Owing to the repeated behaviour of the investors, recognizable and

predictable price patterns will resemble on charts. As a matter of fact, when using chart patterns

which have been identified and categorized over the years, these patterns portray specific past

price conditions which can reveal the bullish or bearish psychology of the market. In technical

analysis, the key to understand the future prevails on the knowledge collected from previous mar-

ket conditions.

As stated previously, this method of studying the market conditions also relies on the use of technical

indicators which, through the applicability of a formula to stock’s prices and volumes charts, try to give

future context of market development and aids at judging the real value of the shares. Sometimes, alone

technical indicators can mislead the decision maker, since these, despite being accurate in time, could

originate delayed signals relative to the assets’ price chart. Therefore, it is more suitable to employ

technical indicators as a set, rather than selecting only one technical indicator (Gorgulho et al., 2011).

Technical indicators may branch into four different types, depending on the information imparted

(Murphy, 1999):

– Trend Following - used to understand and identify trends in the market. A trend is perceived when

it is detected a consistent change in price, that corresponds to the traders’ expectations about the

market. Traders will learn the assets behaviour by understanding the present trend on the market.

– Momentum Oscillators - used to identify the price trend movement’s strength and how buyers/sellers

are reacting to the price development, by measuring the directional velocity at which the price

varies. They forecast sudden changes in the stocks’ movement.

8

Page 27: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

– Volatility Indicators - used to measure the rate of price movement, regardless of the direction, its

value is largely influenced by the change in the highest and lowest historical prices. By using this

kind of indicators, traders will learn about the range of buying/selling in a given market and will be

able to determine points where the market may change direction.

– Volume Indicators - used to measure the strength or confirm a trading trend direction based on

some mathematical calculations over an asset’s raw volume. In fact, large shifts in price can be

triggered by the increasing trading volume.

In summary, Trend Following and Volatility indicators can have a major influence in one’s investment

strategy. As mentioned above, these indicators determine if a trend’s cycle has begun or have just

ended, which from the investor’s point of view can bring valuable information. Since, they can identify

in the market entry or exit points, when an entry point is identified (asset’s price is surging) the investor

should adopt a long position, but when an exit point is identified (asset’s price starts to deteriorate) a

short position should be adopted. However, Momentum Oscillators and Volume indicators can help the

trader to determine the magnitude of the investment, as these allow to identify if the observed price trend

is going to endure or will face a shift in direction.

Subsequently a series of indicators, within each group, that will be used by the autonomous learning

system presented on this thesis, will be described in detail. Where open, close, high, low and volume

refer to the asset’s closing price, open price, highest price achieved, lowest price achieved and transi-

tion’s volume during a period of time (for the purpose of this thesis, we will be using daily as the time

period).

An even more detailed explanation of how technical analysis works and technical indicators, can be

found over [(Murphy, 1999), (Edwards et al., 2007), (Suresh, 2013)].

2.3.2.1 Trend Following

As stated previously, this type of indicators aids investors in determining the exact turning point in the

market since these indicators will determine a trend’s cycle, which can bring valuable information when

establishing one’s investment strategy.

From a vast extent of trend following indicators, the indicators Moving Average (MA) and Moving

Average Convergence/Divergence (MACD) form a very popular initial set on recently developed trading

algorithms, like the ones seen in [(Nair et al., 2010), (Booth et al., 2014)].

Moving Average (MA)

This technical indicator is widely used as it helps in smoothing out the price’s movement, inducing the

attenuation of stock prices volatility. It is classified as a trend following indicator, considering it aids in

identifying the current price trend and the resistance to experience a change on a established trend

(Murphy, 1999).

Moving averages are computed by taking an average of a subset of stock’s closing prices of size n,

being n the period of the moving average, then the fixed subset is shifted forward, creating a new subset

9

Page 28: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

of values, which is averaged and summed to the previously computed average. Notwithstanding, how

the averages are calculated depends on the type of the considered moving average, as will be briefly

explained. Due to their smoothing nature and by being computed over past data points, lag behind the

latest available data point, therefore are considered to be lagging indicators.

The Moving Average can be easily distinct into two classes, which are the most common form of its

usage:

– Simple Moving Average (SMA) - is the arithmetic average of the stocks’ past closing price over a

defined number of time periods and then divided by the number of time periods, which formula can

be found in the Equation 2.1.

SMA(n)t =

d∑i=d−n

closet

n(2.1)

In Equation 2.1, d refers to the current day and n refers to the number of periods to be analysed,

while closet is the closing price of an asset on a specific day t.

– Exponential Moving Average (EMA) - computed in a similar way to the SMA, but in EMA exponen-

tially larger weights are assigned to newer data, computed as follows in the Equation 2.2.

EMA(n)t = [EMA(n)t−1 × (1− α)]× (closet × α)

with α =2

n+ 1

(2.2)

Where n refers to the number of periods to be analysed, closet is the closing price of an asset

on a specific day t and α represents the degree of weighting decrease, a constant smoothing

factor that, as can be stated by the formula mentioned, depends on n the number of periods of the

indicator. When computing the EMA, as the beginning value of EMA(n) is undefined it is used the

corresponding SMA.

By comparing both the equations from the distinct moving averages, it is easy to state that an EMA

positively rewards the more recent prices, thus reducing the lag on the indicator. On the other hand,

an SMA can be improperly impaired by old data points dropping out of the average, thus lagging to an

unwanted extent (Edwards et al., 2007).

Both of this moving average classes, from the analytical point of view, impact the trader’s decision

analogously. When a MA is in an upswing, it indicates that the associated asset is facing an uptrend,

asymmetrically when a MA is declining, it indicates that the associated asset is facing a downtrend.

10

Page 29: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Parabolic Stop and Reversal (PSAR)

This technical indicator is a method devised to find trends in market prices or securities, tends to

actively follow the price action, being then classified as a trend following indicator (Murphy, 1999). It is

used essentially to determine the direction of an asset’s momentum and the exact moment when this

momentum has began to decay and could face an imminent change in direction. The concept behind the

PSAR indicator outlines the idea that time is the enemy and unless an asset can continue to generate

profit over time, then it is a better idea to liquidate the position.

SAR(t+1) = SARt + α(EP − SARt) (2.3a)

αt+1 = αt + initial value, α0 = initial value (2.3b)

The Equation 2.3 describes the calculation of the indicator, where EP is a register that is kept during

each trend that represents the highest/lowest point the price has reached and the α value represents

the acceleration factor.

Every time is reached a new EP , the Acceleration Factor (AF) is updated as found in the Equa-

tion 2.3b. Where initial value depends on the studied market, on more unsteady markets it is preferable

to use a lower AF in order to have an indicator that is less sensible to small decreases in price.

Average Directional Index (ADX)

The ADX when plotted by itself is used as indicator to measure the strength or weakness of a trend,

since it is a non-directional indicator it can only quantify its strength, ranging between 0 and 100, re-

gardless of the market being bullish or bearish (Murphy, 1999). However, when is taken into account

its components, the positive directional indicator (+DI) and the negative directional indicator (-DI), it

could be foreseen as trend following indicator, since it can aid the investor in choosing the right strategy

depending on the market’s conditions.

The ADX indicator is a complex indicator, not only from the evaluation perspective, as can be seen

on the Subsection A.1.1, but also from the analytical point of view. So to assess its value we have to

broke down its computation in two parts, as can be found in the Equations 2.4. Firstly, one have to

assess the +DI and -DI values, which starts by determining the Directional Movement (DM) (+DM and

-DM). Lastly, the ADX value itself has to be computed.

UpMove = Today′s High−Yesterday′s High

DownMove = Yesterday′s Low − Today′s Low(2.4a)

11

Page 30: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

+DM =

UpMove if UpMove > DownMove and UpMove > 0

0 otherwise

(2.4b)

−DM =

DownMove if DownMove > UpMove andDownMove > 0

0 otherwise

(2.4c)

+DI = 100× EMAn (+DM)

ATR(2.4d)

−DI = 100× EMAn (−DM)

ATR(2.4e)

ADX = 100× EMAn

(∣∣∣∣+DI −−DI+DI +−DI

∣∣∣∣)

(2.4f)

Where EMA is the exponential moving average with period n, as described in the Section 2.3.2.1,

and ATR refers to the Average True Range (ATR) indicator which will be described in detail in the

following sections.

Moving Average Convergence/Divergence (MACD)

The Moving Average Convergence/Divergence (MACD) is a trend following and momentum oscillator

indicator with the purpose of showing the relationship between two MAs, generically a 26-day and a

12-day exponential moving averages of the closing prices, thus highlighting changes in the trend of a

stock (Murphy, 1999). Afterwards, an exponential moving average with a 9-day period is calculated

of the MACD with the name of ”signal line” which, in turn, is overlapped over the original MACD plot,

performing as a trigger to emit buy/sell signals.

The indicator’s value is computed as showed by Equations 2.5, where the EMAs are computed as

described in the Section 2.3.2.1.

MACDs,l = EMAs − EMAl (2.5a)

MACDsignal = EMAg

(MACDs,l

)(2.5b)

MACDhistogram = MACDs,l −MACDsignal (2.5c)

Where s is the number of periods of the short-term EMA, l stands for the number of periods of the

long-term EMA and g is the number of periods considered for the signal plot.

In the Figure 2.1 can be seen an example of the MACD indicator used to evaluate the closing price

12

Page 31: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

action of the S&P500 index.

1000

1500

2000

2500

3000

3500

-35

-15

5

25

45

65

85

105

125

145

14/06/16 14/08/16 14/10/16 14/12/16 14/02/17 14/04/17 14/06/17 14/08/17 14/10/17 14/12/17 14/02/18

Histogram

MACD Value

Signal

Closing Price

Figure 2.1: MACD application

2.3.2.2 Momentum Oscillators

As stated previously, this type of indicators gives a deeper insight to the momentum of the prevailing

trend aiding investors to determine the intensity of their investment, which can bring valuable information

when establishing a profitable investment strategy.

From this vast group of indicators, one can select the indicators Relative Strength Index (RSI) and

Stochastic Oscillator (STO) to form a recognizable set of indicators used on recently developed trading

algorithms, like the one seen in (Qin et al., 2013).

Relative Strength Index (RSI)

The Relative Strength Index (RSI) is classified as a momentum oscillator, which is computed over

a determined period of time where it compares the significance of the asset’s recent gains and losses,

to assess the speed and magnitude of the stock’s directional price movement (Murphy, 1999). The

primarily goal is to gauge the overbought or oversold conditions of securities.

The indicator’s value is computed as showed by Equations 2.6, where the EMAs are computed as

described in the Section 2.3.2.1.

U = closenow − closeprevious, D = 0 (2.6a)

D = closeprevious − closenow, U = 0 (2.6b)

13

Page 32: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

RS =EMAn (U)

EMAn (D)(2.6c)

RSI =

100− 100

1+RS if EMAn (D) 6= 0

100 if EMAn (D) = 0

(2.6d)

U and D stand for the upward and downward changes that are computed every trading period, which

in this work is a day. The Equation 2.6a is computed over market’s uptrends, which are characterized

by prices closing higher than the day before. However, when the market is bearish, the upward and

downward changes are computed using the Equation 2.6b.

Stochastic Oscillator (STO)

The Stochastic Oscillator is a momentum oscillator indicator that tries to predict price turning points

by comparing the securities’ closing price to its price range, over a certain period of time (Murphy, 1999).

The indicator’s base theory is that in a market facing an upward trend, prices tend to close relatively high

and the opposite case is also true.

The indicator starts by computing the range, during a certain period, between a stock’s high and low

price. The range is then expressed in a percentage, which is labelled as Stochastic %K, between 0%

and 100% over the period analysed, if the closing price of a stock is founded to be near of any range’s

extreme, then a turning point on the price of an asset may be imminent. Hereafter, an exponential

moving average, usually with a 3-day period, of Stochastic %K is figured, which is called Stochastic

%D. Sometimes, if the price is highly volatile, a third indicator is required, which is an EMA of the %D

indicator, smoothing out the oscillator’s sensitivity to market movements.

When the market is facing an uptrend momentum, prices tend to achieve higher highs, and the

closing price usually is adjoining the higher extreme of the period’s trading range. Nevertheless, when

the momentum starts to fade out, the closing prices will start to recede from the upper end of the trading

range. The stochastic indicator will react to this price changes, turning down at or before the final highest

price.

The previous values can be computed as found in the formulas described in Equations 2.7.

%K = 100× closingprice − Lown

Highn − Lown(2.7a)

%D = EMAm

(%K

)(2.7b)

Smoothing%D = EMAm

(%D

)(2.7c)

Being n the number of periods on which are observed the highest and lowest prices and m the

number of periods used to compute the EMA, which usually is a 3-day period.

14

Page 33: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Williams %R (WILLR)

The Williams %R, popularly known as %R, is a momentum oscillator indicator, throw the comparison

of the asset’s today’s closing price with the highest/lowest price over the previous trading periods, tries

to signal a market reversal (Murphy, 1999). Through the usage of the %R indicator, a trader not only can

determine market’s turning points but can also hint a security market’s condition of being overbought or

oversold.

Readings of the WILLR indicator are unusual, since its values range in a negative scale, from -100 to

0, which is the obverse of the more common 0 to 100 scale found in most technical indicators. Although,

its readings fluctuate over negative values, a value of -100 can be interpreted as if the present price is

closing near the lowest low for the past considered trading period. On the other side of the spectrum,

a reading of 0 can be interpreted as if today’s price is gravitating towards the highest high of the past

period.

Following the Equation 2.8, one can found how these readings are computed.

%R =closetoday − highestn days

highestn days − lowestn days× 100 (2.8)

Momentum

The Momentum indicator is perhaps the simplest indicator used, since it only determines a trend’s

movement strength by measuring the directional velocity at which the price changes. Through the ex-

amination of the speed of price changes, a trader could assess how buyers/sellers are reacting to price

developments, which could help foreseeing sudden trend changes.

By measuring price’s directional change over a certain period of time, the momentum indicator will

help in recognising trend lines. A rising momentum plot above zero indicates that an uptrend is firmly

developing, the reversal is also true, so when the momentum plot line is ranging below zero it dictates

that a downtrend is developing. There is still a third scenario, where the plot line starts to level off, which

indicates to technicians that the ongoing trend is slowing down, so the current issue’s price is about the

same as it ways in previously considered trading period.

The indicator’s value can be computed as shown by the Equation 2.9.

Momentumt = closet − closet−n (2.9)

As closepricet stands for the closing price of the present day and closepricet−n stands for the closing

price n days ago, it is easy to conclude that the present momentum value represents the evolution of the

price over the past n days.

15

Page 34: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Rate of Change (ROC)

The Rate of Change ratio presents the percentage difference between the current closing price and

the price n time periods ago. Through the computation of the difference between prices, it allow us to

assess the velocity at which the stocks are changing prices, which is taken as the momentum of a stock

(Murphy, 1999).

In order to extract more information from this indicator, the ROC is plotted against a zero line, from

where a technician can distinguish positive values from negative. To traders, positive readings indicate

that the stock price is rising, therefore the stock is on a upward momentum, while negative values

indicate that the stock price is plunging, so the stock is on a downward momentum. However, if the

stock’s price action is facing abrupt movements in either direction, usually above +30 and below -30, are

probably interpreted as indicating that the stock is being overbought or oversold (Gorgulho et al., 2011).

The Equation 2.10 presents the formula of the ROC indicator.

ROCt =closet − closet−n

closet−n× 100 (2.10)

By using this indicator, traders could leverage the forecasted information to formulate profitable trad-

ing strategies. So if ROC readings tend to fall within the range from 0 to +30, an investor could determine

that a stock is on an upward momentum, therefore he should adopt a long position, since the price will

be more biased to rise. However, if the price goes beyond the maximum defined threshold of +30, the

stock will become overbought, which may indicate that a price correction could possibly happen soon,

making the short selling the best strategy to adopt.

On the other hand, when the values are ranging between 0 and -30, indicates that the stock is on a

downward momentum, which is characterized by the descent movement of the stock’s price, therefore

the investor is better advised to sell its positions. However, if the price goes below the minimum threshold

of -30 will probably indicate the stock’s oversold condition, which may sign that a price reversal may be

happening, therefore a long position strategy is best suited to this market conditions.

Commodity Channel Index (CCI)

By its multi-capable facets the Commodity Channel Index (CCI) has grown in popularity. Since it can

help decision makers at identifying turning point moments in the stock’s price, while assisting them to

determine the market’s trend strength, this indicator can be classified as momentum oscillator (Edwards

et al., 2007).

The CCI readings fluctuate above and below zero, normal oscillations would occur between -100 and

+100, which are the default levels to determine the asset’s condition regarding the market. So, values

that falls above the +100 level imply an overbought condition, while values that falls below the -100 level

imply an oversold condition. As with other overbought/oversold indicators studied, this means that there

is a larger probability of having a price correction to more representative levels.

16

Page 35: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Through the use of the Equations 2.11, technicians can found the indicator’s values.

CCI = γpt − SMA (pt)

σ (pt)(2.11a)

pt =H + L + C

3(2.11b)

γ =1

0.015(2.11c)

Where pt stands for the typical price, which is a mean of the three achieved prices during a trading

day, such as high, low and close, and the main formula can be found in Equation 2.11b, γ represents

a scaling factor in order to provide more readable values from the indicator, this way between 70 to 80

percent of the values will fall within the range aforementioned, and the explanation of the SMA used can

be found over the Section 2.3.2.1.

In the Figure 2.2 can be seen an example of the CCI indicator used to evaluate the closing price

action of the S&P500 index.

1000

1200

1400

1600

1800

2000

2200

2400

2600

2800

3000

-400

-200

0

200

400

600

800

1000

1200

24/05/16 24/07/16 24/09/16 24/11/16 24/01/17 24/03/17 24/05/17 24/07/17 24/09/17 24/11/17 24/01/18

CCI

Threshold Max (200)

Threshold Min (-200)

Closing Price

Figure 2.2: CCI application

Advance/Decline Line (A/D)

The Advance/Decline Line (A/D) indicator is a stock market indicator used to measure the number

of individual stocks participating in a market rise or fall. This indicator is used by investors to assess

the strength of the ongoing trend and its likelihood to reverse, therefore can be thought as being a

momentum oscillator (Murphy, 1999).

As market indexes, such as the S&P500, which is the core focus of our thesis, represent a group of

stocks, they do not convoy well the whole condition of the trading day and the market’s performance.

17

Page 36: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

However, throughout the application of this indicator a technician can have a deeper insight on how par-

ticular stocks have performed during the day. The A/D indicator shows if most securities are participating

in the direction of the market trend.

Readings of the Advance/Decline line portray the cumulative sum of the daily difference between the

number of stocks progressing and the number of stocks lowering in a stock market index. Thus, when

there are more rising stocks than declining the plot moves up, and moves down when there are more

declining than advancing stocks. The formula for its computation can be found on Equation 2.12.

A/D Linet = # of Advancing Stocks−# of Declining Stocks +A/D Linet−1 (2.12)

Percentage Price Oscillator (PPO)

The Percentage Price Oscillator indicator measures the price momentum of a stock or a market as a

whole, making it a reliable momentum oscillator to assess if it will occur price trend reversals in a not so

distant future.

The indicator’s computation can be found over the Equation 2.13, where EMA is the simpler form of

the Exponential Moving Average, and its definition can be found on the Section 2.3.2.1.

PPO =EMAn − EMAm

EMAm× 100 (2.13)

Where n and m stand for the periods of the EMAs used, and should be different from each other, else

the indicator value will not be possible to compute. Usually, these it is used with a 9-day and a 26-day

moving averages. The key idea behind this indicator is to have a comparison between the short-term

and the long-term moving averages, while staying unaffected by sudden price movements.

2.3.2.3 Volume Indicators

Volume, or trading volume, is a term in capital markets, referring to the number of assets or shares

that are traded in a stock or in an entire market during a certain period of time. However, these type of

indicators tend to couple some price influence with market’s trading volume to grasp the strength of a

trend in the market, leveraging the trader knowledge.

Some of the most common indicators that fall in this category are the On Balance Volume (OBV)

and the Money Flow Index (MFI), as can be found on the studies presented by (Booth et al., 2014) and

(Maragoudakis and Serpanos, 2010).

Money Flow Index (MFI)

The Money Flow Index is an oscillator which is computed over a n-day period, ranging from 0 to

100, showing money flow on positive days as a percentage over the total of positive and negative days,

where negative and positive stands for rising and falling days of an asset in the market. Hence, this

18

Page 37: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

indicator is best suited to identify price reversals and price extremes, making its analysis a root for a

variety of trading signals (Murphy, 1999).

It is important to enlighten the true meaning of money flow. In the financial market analysis, it stands

for the dollar volume, i.e. the total value of shares traded, which on an up day represents the enthusiasm

of the buyers, while on a down day represents the enthusiasm of the sellers. A disproportion in one of

the directions is interpreted as an extreme point of the indicator’s reading, possibly resulting in a price

reversal. Thus, with this indicator are usually used overbought and oversold levels to help in identifying

unsustainable price extremes.

This indicator’s computation can be decomposed into a number of smaller equations, as seen in

Equation 2.14, where pt stands for the typical price for each day which is the average of the highest,

lowest and close price of the trading day, as found in Equation 2.11b.

money flowt = pt × volumet (2.14a)

money ratiot =positive money flowt

negative money flowt(2.14b)

MFIt = 100− 100

1 + money ratiot(2.14c)

Where the positive money flow and negative money flow stand for the total of days where the present

typical price is higher/lower than the previous day’s typical price, if typical price stays the same from the

previous day it is discarded.

In the Figure 2.3 can be seen an example of the MFI indicator used to evaluate the closing price

action of the S&P500 index.

1500

1700

1900

2100

2300

2500

2700

2900

3100

20

40

60

80

100

120

140

160

180

200

25/05/16 25/07/16 25/09/16 25/11/16 25/01/17 25/03/17 25/05/17 25/07/17 25/09/17 25/11/17 25/01/18

MFI

Threshold Max (70)

Threshold Min (30)

Closing Price

Figure 2.3: MFI application

19

Page 38: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

On Balance Volume (OBV)

The On Balance Volume indicator, by relating the price action with the volume in the stock market,

tries to show if volume is following into or out of a security. This analysis, heavily relies on the tenet that

volume changes precede price changes (Gorgulho et al., 2011).

The idea behind this indicator is that volume moves sharply on days where the price is moving

towards the dominant direction, for instance when in a strong uptrend more volume will be expected on

up days rather than on down days. Its values can be formulated as follows in Equation 2.15.

OBVt = OBVt−1 +

volume if closet > closet−1

0 if closet = closet−1

−volume if closet < closet−1

(2.15)

When analysing this indicator, technicians look for divergences between the indicator value and the

market’s value to predict price movements or to confirm price trends. The main idea behind the OBV is

that when prices are going up, the indicator value should also go up, and when prices make a new rally

high, then the OBV should adhere too. If the indicator does not make a higher rally than its previous

high, then this is considered to be a bearish divergence, suggesting a weak move.

2.3.2.4 Volatility Indicators

In financial markets, volatility refers to the variance of the accumulative returns of a financial in-

strument within a time horizon, it is based on historical prices over the specified period being the last

observation the current price of the security. Hence, a stock’s price that moves wildly, with higher fluctu-

ation and unpredictably, is considered highly volatile, while a stock that maintains a stable price action,

i.e. low standard deviation over a certain time horizon, has lower volatility.

From an investors point of view, volatility can be both beneficial and harmful. When investing in a

high volatile security, an investor could benefit from the opportunities presented by buying assets cheaply

and sell them when overpriced. However, when the investor is dependent on the returns achievable by

selling the security, due to the fact of being a higher volatility asset means that has a greater chance of

losing the initial investment.

One must not misinterpret the volatility concept, since it does not measure the direction of the price

trend, measuring only the dispersion of the price changes. Accordingly, when choosing from two instru-

ments with the same expected return but with different volatilities, an investor should elect the security

that presents the smallest volatility, since it can be shown as the safest investment in the long run.

20

Page 39: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Average True Range (ATR)

The Average True Range indicator is a unique indicator that reflects the degree of interest or disin-

terest in the movement of a stock’s price.

From the analyse of past ATR readings, technicians can state that stronger stock movements, in

either direction, are accompanied by larger ranges, specially at the beginning of the movement. On the

other hand, when the stock does not present stimulating movements, which are characterized by having

narrower swings, the ATR can present relatively small ranges. As such, large or increasing ranges

suggest that trader are prepared to continue to invest or sell short an asset through the development of

the trading day, while small/decreasing ranges may suggest that the investor’s interest is dispelling.

The necessary computations to achieve the ATR value can be decomposed as shown in Equa-

tions 2.16, where the EMA definition is over the Section 2.3.2.1, n usually is a 14-day period but other

values can also be used.

true ranget = max (hight, closet−1)−min (lowt, closet−1) (2.16a)

ATRt = EMAn (true ranget) (2.16b)

From the Equation 2.16a, the true range value is the largest of the following cases:

– More recent period’s high less the most recent period’s low.

– The absolute value between the highest value of the more recent period and the past close.

– The absolute value between the lowest value of the more recent period and the past close.

In the Figure 2.4 can be seen an example of the ATR indicator used to evaluate the closing price

action of the S&P500 index.

1700

1900

2100

2300

2500

2700

2900

0

20

40

60

80

100

120

140

160

180

200

04/05/16 04/07/16 04/09/16 04/11/16 04/01/17 04/03/17 04/05/17 04/07/17 04/09/17 04/11/17 04/01/18 04/03/18

ATR

Closing Price

Figure 2.4: ATR application

21

Page 40: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Bollinger Bands (BBANDS)

This technical analysis tool can be used to measure the length of price swings relative to previous

trades. The BBANDS indicator consists of three bands, a middle band that goes along the price action

and two price bands above and below the price. These bands will mimic the price movement, expanding

and contracting as the volatility of the stock increases or decreases, respectively. By definition, prices

are considered to be high at the upper band and low at the lower band, which can be helpful to determine

rigorous patterns on the price action.

Readings for each of the bands above mentioned can be computed as presented in the Equa-

tions 2.17.

Middle band = SMAn (close) (2.17a)

Upper band = SMAn (close) +K × σn (2.17b)

Lower band = SMAn (close)−K × σn (2.17c)

In the formulas mentioned, SMA refers to the simple moving average with period n and its value can

be computed as mentioned previously on Section 2.3.2.1, σn refers to the standard deviation of prices

over the last n-day period.

In the Figure 2.5 can be seen an example of the BBANDS indicator used to evaluate the closing price

action of the S&P500 index.

2150

2300

2450

2600

2750

2900

1/3/17 2/3/17 3/3/17 4/3/17 5/3/17 6/3/17 7/3/17 8/3/17 9/3/17 10/3/17 11/3/17 12/3/17 1/3/18 2/3/18 3/3/18

Closing Price

Upper Band

Middle Band

Lower Band

Figure 2.5: BBANDS application

22

Page 41: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

2.4 Machine Learning

Machine Learning (ML) fundamental ambition is the development of automated systems capable

of processing big volumes of data in order to extract meaningful and possibly useful information (data

mining) as well as levering from the gathered information to support the resolution of real world problems

(decision support), which resolution may be difficult to people since they are prone to making mistakes

when trying to establish relationships between features (Wallace, 2007). The combination of both of this

components, i.e., information extraction and application, is called data classification, which is the main

purpose of the developed work, properly classy future information regarding stock markets.

In the context of data classification, the researchers aim at developing mechanisms that, having

studied a large amount of data divided into classes/labels, are able to automatically label/classify unseen

data.

Currently the growth rate of this type of technologies is not the expected, since there are still some

human based problems that need to be addressed to mature ML to its full potential. One problem that

is typically associated with machine learning systems is humans’ inability to supervise the system’s

activity or justified reluctance to support important and sometimes crucial decisions on recommenda-

tions provided by a system they do not clearly understand to its full extension (Wallace, 2007). The

main parameter that needs to be overwatched is the information considered by the machine which is an

assignment impossible to tackle by humans, due to its dimension and digital format. As this problem

becomes harder to resolve, since the available information’s growth is unpredictable, other solutions to

ease this process from humans need to be developed, that way data reduction (feature selection) and

data visualisation research fields have recently surfaced that focus on alleviating this difficulty.

Every dataset considered by ML algorithms is composed by instances, where each instance is rep-

resented by the same set of features, these are used to outline the problem in hands and can have

different categories, varying from continuous, categorical or binary. Depending on the tool developed,

three different types of learning can be used (Kotsiantis et al., 2007):

– Supervised Learning - the main goal of supervised learning is to build a concise model from

properly labelled instances in terms of predictor features, being these features/attributes the most

informative to the induced model (Kotsiantis et al., 2007).

– Unsupervised Learning - in contrast with the supervised learning, the main goal of unsupervised

learning (clustering) is to deduce useful, but unknown, classes of items from unlabelled instances,

i.e., instances that have not been pre-classified in any way (Kotsiantis and Pintelas, 2004). The

classes of items are found through the exploration of inter-relationships among the instances.

– Reinforcement Learning - this type of learning sets apart from the methods above mentioned,

since to the ML algorithm is never supplied any type of feature, instead the training information

provided to the autonomous system is in the form of reinforcement values, which measures how

well the algorithm is performing (Gosavi, 2003). This type of learning has a try/error approach,

where to the learner is not given any information on how to act, but it rather must discover which

23

Page 42: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

actions yield the best reward, by trying each action in turn.

Stock market prediction has received increased attention from both academics and industrial profes-

sionals, since it states major challenges to researches due to the different stock market’s uncertainties,

such as political events, general economic conditions, investors’ expectations, etc.

When trying to overcome the challenges imposed by predicting stock markets’ behaviour, a wise

solution goes through modelling an autonomous supervised system from training and experience. By

processing big amounts of past financial data as well as other markets’ uncertainties, that may seem

uncorrelated and noisy from the human point of view, learners can detect data patterns and predict

future market direction with the main objective of maximizing the returns while reducing the investment

risk.

As stated previously (in Section 2.3), there are two major philosophical attitudes to analyse the

market, the technical and the fundamental analysis. Through the evaluation of the related work [(Patel

et al., 2015a), (Patel et al., 2015b), (Kumar and Thenmozhi, 2006)], one can aver that researchers

usually tend to rely on the use of the former approach to develop autonomous systems, attempting to

forecast the closing price’s direction change in financial indices.

Each of the technical indicators incorporate several trading signals, which will influence the trader’s

market position. In the light of the information carried by the technical indicators, researchers are re-

sponsible to properly develop different investment models which apply real individual constraints to these

signals, dictating the prediction accuracy of the algorithm.

2.4.1 Feature Selection

Approximately 80% of the computer resources are wasted on cleaning and pre-processing data

to be used on machine learning methods (Piramuthu, 2004). To properly extract patterns from data

it is needed to have a clean set of features, considering it is one of the primary sources of system’s

knowledge. Having said that, multiple input features are considered redundant, since these do not add

anything new to describe the concept, reducing the system’s prediction accuracy and possibly adding

more noise to the information.

Nowadays, due to the increasing demand of learning algorithms, there is a big necessity for a pre-

processing step of data mining to reduce the dimension of the feature space, whilst preserving most

of the relevant features, which helps and brings further useful information at describing the targeted

problem.

Feature Selection aims at reducing the initial feature set, choosing a subset of system’s input vari-

ables, by eliminating redundant/irrelevant features which deliver just a little or any prediction information.

In particular, feature selection enhances the prediction accuracy of the autonomous prediction system,

improves learning performance, by reducing the dimension of the original set, and builds up the com-

prehensibility of the learned results (ElAlami, 2009).

The algorithms used to perform feature selection can be settled into two broad classes (Sebbana

and Nock, 2002):

24

Page 43: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

– Filter model - relies on the use of a function to evaluate the data properties, thus making it inde-

pendent from any algorithm.

– Wrapper model - relies on the use of inductive algorithms - which learn from observation and

identification of patterns from earlier knowledge to induce general rules - (AlMana and Aksoy,

2014) to estimate a subset’s value.

Furthermore, the search carried by a feature selection algorithm can, by itself, be divided into two

comprehensive categories:

– Heuristic search - by a heuristic measure, such as information gain, Gini index, discrepancies

measure and chi-square test, the feature selection algorithm estimates the quality of the feature.

– Exhaustive search - aims at finding a subset with only the essential variables, by searching all

possible combinations, which are enough to construct a consistent model.

The stock market, which is our work subject, is influenced by many aspects, such as political, global

economy, etc., as a result, it is challenging to accurately predict its behaviour. In literature, many basic

factors, such as technical indicators or financial ratios, have been proven important in the stock’s move-

ment. However, as this subject has been object of research, in the related work an indecision of what

are the important features for stock prediction emerged, since there is no exact answer. On the other

hand, the same forecast model can perform differently depending on the variables used (Tsai and Hsiao,

2010).

In the light of the above, it is crucial to implement a feature selection mechanism to select a set of

variables which are informative or hoist high discrimination power.

2.4.2 Simple Prediction Algorithms

Many prediction algorithms have surfaced recently. Decision Trees (DT) and Genetic Algorithm (GA)

are just some of the various machine learning algorithms which are widely used for predicting stock and

stock price index movement. Each of the aforementioned algorithms has its own way of learning the

market patterns, which will be detailed over this section.

2.4.2.1 Genetic Algorithm (GA)

The Darwin’s theory of evolution, from natural selection of the fittest individual, supports the main

idea behind the Genetic Algorithm (GA), also known as population-based algorithms.

GA belongs to the class of evolutionary algorithms and can efficiently search complex data sets to

find nearly optimal solutions by performing random searches through a given set of candidate solutions,

called population.

From an initial population, the algorithm randomly generates successive fixed-size sets of solutions

which are examined by a fitness function to elect those with higher probabilities to be reserved to breed,

to create the next generation, this procedure is called selection. Each solution present in the several

25

Page 44: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

generations is represented as a chromosome, which is a feature (gene) vector with as many positions

as there are features to be included in the problem.

To create new generations of chromosomes, two best known operations are performed:

– Crossover - occurs between two parent chromosomes, which exchange parts of their chromo-

somes to generate a pair of new individuals, called offsprings.

– Mutation - occurs when some positions of a chromosome are shifted, the selected positions are

chosen at random with a very small probability.

The main components of the GA, which are inspired by biological operators such as Selection,

Crossover and Mutation, are used to direct the search into solutions with good performance at the

desired task, where each of the operators can be performed through different processes (Mitchell, 1998).

When choosing the right process to be used on the Selection procedure, one can choose from three

different possibilities:

– Tournament Selection - in this selection method, are chosen individuals which have higher prob-

ability to survive the tournament, i.e. individuals with higher fitness values are more probable to

be selected for the next phase rather than weaker individuals (Hirabayashi et al., 2009). Each in-

dividual tournament consists on the sampling of k individuals from the population, which are then

compared between each other, in order to select the fittest. The tournament is then repeated until

all the initial population is analysed, however the selection pressure is determined by the tourna-

ment size, for larger tournaments the chances for weaker individuals to proliferate are smaller.

– Truncation Selection - in this procedure, the individuals are sorted by their fitness value, being the

top performers selected to breed (Gorgulho et al., 2011).

– Roulette Selection - in this selection method, probabilities of selection are attributed to the individ-

uals accordingly with their fitness values, where higher fitness values will enhance the probability

of being chosen. Then, are performed k spins of the roulette, in which k individuals are sampled

by their probability being selected.

After the Selection has been made, it is time to Crossover the best selected chromosomes to origi-

nate new offsprings. The purpose of the Crossover is to generate new individuals that can combine the

best genes of the parents and thus, generate a better performing solution. To perform this procedure, a

plethora of methods can be performed, formerly wil presented two possible processes:

– One-Cut Point Crossover - this method chooses a random position on each of the parent’s chro-

mosomes, then the genes from the beginning of the chromosome until this point are swapped

between the parents to generate two new children (Gorgulho et al., 2011).

– Two-Cut Point Crossover - very similar from the method above mentioned, but instead of choosing

just a random point in each parent individuals, there are chosen two random points, next all the

bits that are within this two points are swapped between parents, generating two new offsprings

(Hirabayashi et al., 2009).

26

Page 45: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

– Single Arithmetic Recombination - this crossover method defines a linear combination of two chro-

mosomes (Koksoy and Yalcinoz, 2008). Two randomly selected individuals may produce two

offsprings by performing a linear combination of the genes values of each parent as follows in

Equations 2.18:

Ct+1i = a× Ct

i + (1− a)× Ctj (2.18a)

Ct+1j = (1− a)× Ct

i + a× Ctj (2.18b)

Where Ct+1 is an individual from the new generation, Ct is an individual from the old generation

and a is a weight which determines the dominant individual on the crossover (between 0 and 1).

Lastly the Mutation takes place, this operator is performed in order to maintain genetic diversity

between generations in one gene or a set of genes in a probabilistic way. By analysing the related

work [(Hirabayashi et al., 2009), (Koksoy and Yalcinoz, 2008)], we can infer that there is not an optimal

solution to perform the mutation, the function chosen will therefore depend on the problem in hands.

One can consider two classes of mutation procedures that can be applied:

– Bit Mutation - this mutation procedure is best advised when using bits as chromosome’s genes

values. The mutation is performed by converting chromosome’s 0 to 1 and vice-versa.

– Function-based Mutation - when using real values as chromosome’s features, a mutation based on

an analytical function can be performed, generating new random values for each selected gene.

There are many choices for the function which performs the mutation, for instance, the simplest to

use would be the function that randomly generate a new value to change the selected gene.

By doing these operations continuously over time, until a stop condition is met, the optimal solution of

a complex space will be easier to understand, since all stages of the algorithm will be outlined. The stop

condition can be defined, among others, as a fixed number of generations to be ran by the algorithm or

a fitness value that needs to be met by the best performing individual out of the population.

2.4.2.2 Decision Trees (DT)

The Decision Tree learning algorithm is one of the most popular techniques for classification, since

it is very efficient, easy to understand and, at the same time, accomplishes classification accuracies

comparable to other prediction models (Patel et al., 2015a). In its core, a decision tree is a mapping of

observations regarding a variable to conclude the variable’s predictive value.

The final classification model learnt is represented as a tree, composed by two main components,

nodes and arches. An inner node corresponds to a variable, which typically has many values, each

expressed as an arch to a child node, which will describe a downward path to a conclusion. A leaf node

embodies the predictive value of a variable given the values from the inner nodes, represented by the

path from the root to the leaf of the tree.

Towards the construction of the final decision tree, one must found the root node by evaluating a

set of attributes using a statistical test to determine how the root attribute, alone, classifies the whole

27

Page 46: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

training data. The attribute that best classifies the training examples alone, is then used as the root

node of the tree, from where a descendant is created for each of its possible values. Consequently, the

training examples are arranged to the opportune downward node. The former processes are repeated

until all the training data is classified adequately (Huang et al., 2008).

In order to realize the attribute selection, mentioned previously, there are two metrics used regularly

(Huang et al., 2008). At first place, we can measure the Information Gain, which measures the entropy

reduction of the training data achieved by learning a new variable’s state of an attribute relative to a

collection. The information gain can be computed as follows:

Gain (T, V ) = E(T )− I(T, V ) (2.19)

where I(T, V ) is the average amount of information needed to identify the class of V in T , and can

be defined as:

I(T, V ) =∑i

|Ti|T× E(T ) (2.20)

where Ti is the subset of T where the attribute V has value i. E(T ) is named as the entropy of the

training data T . This can be estimated as:

E(T ) = −C∑i=1

pi log2(pi) (2.21)

where C is the sum of examples for attribute V and pi is the number of examples of T that belong

to the i-th class. Having that said, the attribute to be selected will be the one to efficiently maximise the

difference presented in the Equation 2.19, which means having the greatest entropy reduction.

The second metric that is also very used to benefit the attribute selection is called Gain Ratio, that

expresses the share of information yield after realising the split that helps in the classification:

GainRatio (T, V ) =Gain (T, V )

IntrinsicInformation (T, V )(2.22)

where IntrinsicInformation (T, V ), which measures the amount of potential information generated by

dividing T into C branches, is defined as:

IntrinsicInformation (T, V ) = −C∑i=1

|Ti|T× log2(

|Ti|T

) (2.23)

from the above equation, we can infer that T1 through TC are the C subsets of examples resulting

from portioning T by the values of V .

To properly construct prediction models from data, two broad classes of tree-based methods can be

employed (Loh, 2011):

– Classification trees - designed for dependent variables that take a finite number of unordered

classes, with the prediction error being measured as a misclassification rate.

28

Page 47: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

– Regression trees - designed for dependent variables that take continues or ordered discrete val-

ues, with the prediction error being typically calculated as the squared error between the observed

and desired values.

Many were the learning algorithms developed using this two methods of decision trees, like C.45

(Quinlan, 2014), ID3 (Quinlan, 1986) and CART (Tsai and Hsiao, 2010).

2.4.3 Ensemble Prediction Algorithms

Ensemble of prediction algorithms have proven numerous times to be more efficient than simple fore-

casting systems in predicting the behaviour of the stock market [(Maragoudakis and Serpanos, 2010);

(Kwon and Moon, 2004)], which is the subject of our work. Machine learning research has shown to

be biased recently trying to prove the usefulness of using this kind of prediction algorithms in face of

real problems, such as image recognition (Bosch et al., 2007) and microarray analysis (Tan and Gilbert,

2003).

The intuitive concept of ensemble learning is that the integration of several single approaches, will

enhance the performance of the final classifier, enhancing its accuracy, reliability and comprehensibil-

ity. Many combination schemes and ensemble methods have surfaced recently (Valentini and Masulli,

2002), however these can be gathered and analysed separately, depending on the classification crite-

rion used. As an example, one may want to evaluate the ensemble based on the architecture schemes

adopted which can be distinguished as conditional, serial and parallel (Lam, 2000), or based in the

presence or absence of changes in the base learners which can make the ensemble non-generative or

generative.

Starting by defining the ensemble evaluation by the later criterion presented, we can have two broad

categories of ensembles: the non-generative and the generative ensemble prediction algorithms. The

non-generative ensembles, as the name suggests, do not actively produce new base learners, only try

to combine a set of given base learning algorithms in a suitable way. A large set of different approaches,

for instance majority voting, fuzzy aggregation methods or simply using simple operators as Minimum

or Maximum, are used to combine learning machines, however the methodology to be used depends on

the prerequisites of the system, such as adaptability to the inputs and the output requirements of each

learner.

Contrariwise, the generative ensembles actively try to improve the miscellany and the accuracy of the

ensemble, by generating subsets of learning machines that will perform on the base learning algorithm or

on the structure of the data. As in the previous class, a plethora of methods can be applied, for example

we can apply resampling or feature selection methods, if our objective is to rework the structure and the

features of the available input data, we can employ mixture of experts’ methods to select specialised

base learning algorithms for a specific input subset or we can apply randomized methods to adjust the

base learners which constitute the ensemble algorithm.

From the methods aforementioned, the resampling method is one of the most used methods to

originate newer and more accurate ensemble algorithms [(Booth et al., 2014); (Wang et al., 2011); (Patel

29

Page 48: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

et al., 2015a)]. Two popular techniques used to construct ensembles, which rely on resampling methods,

are: bagging and boosting (Dietterich, 2000). Both techniques operate using bootstrapping techniques,

that will increase the amount of diversified training sets which will be fed to simple prediction algorithms

and, as a result of increasing their scope, the number of hypotheses created will also increase.

In the bagging technique, the ensemble arises by making random bootstrap samples of the training

dataset.

In boosting methods, at each iteration a predictor is invoked using a different distribution or using an

uneven weighting of the training sample. When using weighting, a set of weights in maintained over the

original training dataset and at each iteration the least accurate predictions receive high weights so that

the base learning algorithm focuses on the hardest examples, whereas to more accurate forecasts the

weight is decreased since in the next iteration the learning algorithm focus is not required.

Since the bagging method realises a random selection over the initial dataset, successfully exploits

the presence of noisy data producing more diverse classifiers. On the other hand, boosting is quite

sensible to classification noise given that misclassified examples will see their weights increased, which,

in turn, increases the error rate of the individual classifiers (Dietterich, 2000).

When in face of a regression problem the aggregation can be performed through averaging the

outputs of each of the single predictors, whilst in classification problems a majority or weighted voting is

needed to aggregate.

The proposed ensemble methods tend to have more impact in base predictors which are more sen-

sitive to small changes, called unstable algorithms, like neural networks and decision trees (Valentini

and Masulli, 2002).

Ensemble prediction algorithms may adopt different architectural schemes to fuse the base machine

learnings, which will impact the final behaviour of the system. When designing the final ensemble

method, one can adopt a conditional, serial/hierarchical or parallel scheme (Lam, 2000). First, in the

conditional approach a primary classifier is used, then when its prediction is no longer viable for a

determined pattern, another prediction algorithm is adopted. This approach when used with only two

classifiers can be computationally efficient, since a faster learner can be employed firstly in the chain

enhancing the fast prediction of the easier patterns, while a slower and more complex classifier can

be used afterwards to classify the remaining patterns. However, if the two machine learning algorithms

have the same performance, a question of sequence arises, since the first classifier would left the most

difficult patterns to analyse to an equal classifier which will occur in a loss of accuracy and increase in

the error rate of the ensemble.

In second place, one can use a serial/hierarchical approach which consists on the use of classifiers

successively where each classifier produces a simpler set of classes of the complex problem proposed,

so that individual experts can become increasingly focused. Naturally, an important criterion must be

met in this type of ensembles, which is the preceding expert must guarantee that the subset of extracted

solutions, from the initial set, contains a true class from the available classes of the final predictor.

Finally, an ensemble algorithm architecture can be drawn with a parallel approach where multiple

classifiers operate in parallel over the same dataset to classify patterns, then their outputs are merged

30

Page 49: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

to yield a final decision. This method, however, incurs in an increasing demand of the computational

power, since multiple classifiers need to be performing prediction tasks concurrently and, afterwards, a

fusion operation would be also necessary. The combination procedures can be differently implemented

depending on the classifiers output’s type, as stated above. One key advantage of this scheme is its

modularity, in view of the fact that the learning algorithms can be enhanced a priori and little or no

changes are required to the merging process.

One form of ensemble prediction algorithm, which is called Random Forest (RAF), has shown some

popularity over the last years, being employed in a variety of daily problems, as stated formerly, like

image recognition (Bosch et al., 2007), ecological prediction (Prasad et al., 2006) and microarray clas-

sification (Tan and Gilbert, 2003). Since, this algorithm has proven to be useful and efficient in solving

many prediction problems we decided to present, in the following subsection, a detailed explanation of

this algorithm.

2.4.3.1 Random Forest (RAF)

The Random Forest (RAF) algorithm was introduced by Leo Breiman (Breiman, 2001). A random

forest is a collection of uncorrelated tree-based classifiers which was designed to yield highly accurate

forecasts while not overfitting the training data (Ali et al., 2012).

When growing a Random Forest, random samples are drawn from the training data, using the

bagging method, to sprout multiple individual decision trees. Having the initial dataset defined as

T = {(xN , yN )}, the main objective is to find a function that is capable of efficiently map the initial

feature space (X = (x1, ..., xn)) into the output space (Y = (y1, ..., yn)). Specifically, each tree is grown

using the following procedure:

Let M denote the initial number of features.

1. Randomly select n features from the training dataset to make a bootstrap sample.

2. Having M features initially, m � M features are selected at random to grow a tree such that, at

each node, m features are selected and the best split on these m is used to split the node. During

the forest growth, the value of m is constant.

3. Each tree is grown to the largest extent possible, and no prune is used.

After a large number of trees have been generated, the ensemble produced forecast is made by

aggregating (through majority vote for classification or averaging for regression) the predicted value of

each individual tree in the forest.

By maintaining each tree unpruned and by selecting at each node the best split among all the

features in random subsets, random forests will keep overall bias low, while retaining all its prediction

strength and inducing diversity among trees (Breiman, 2001).

When comparing the random forests performance with other machine learning algorithms, one can

highlight numerous advantages (Wang et al., 2011). Since the final results of the model can be summa-

rized in a series of logical conditions (tree nodes), there are no implicit assumptions that the relationships

31

Page 50: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

maintained across the features and the target classes follow linear or nonlinear functions. Consequently,

tree methods are suitably qualified to solve problems where little or no knowledge about which features

relate and how these relationships are maintained, which is the case of data mining tasks.

Besides of being a prediction model which yields high prediction accuracy, the fact of each base

learner present a tree-based structure enhances its interpretability and makes it a non-parametric

method which states its uniqueness among popular machine learning methods (Qi, 2012).

As stated previously, the stock market data is very vast, unpredictable and, among technical in-

dicators and predictions are kept unsure relationships, making this a perfect fit for the random forest

prediction model.

2.5 Related Work

In this section it is introduced some of the existing solutions proposed until now, some are not directly

related with the studying subject, but helped determining the technologies to be employed as well as the

overall structure to follow.

The section will be divided into works directly related to Feature Selection, which fully detailed expla-

nation can be found over Section 2.4.1, Simple Prediction Algorithms (see Section 2.4.2, for more detail)

and works related with Ensemble Prediction Algorithms, detailed explanation found on Section 2.4.3. Fi-

nally, it will be presented a table to give a brief overview of the studied related work.

2.5.1 Works on Feature Selection

M.E. ElAlami (ElAlami, 2009) proposed a feature selection algorithm, based on Genetic Algorithm

(GA), which aims at optimising the output classes of an Artificial Neural Network (ANN). As the ANN is

trained, in a given training set, weights are assigned to both input-hidden and hidden-output layers, then

to each output node is designated a function that correlates the input with the extracted weights.

The proposed goal was achieved through the maximization of the output functions of each of the

classes, using the GA to find the optimal values for the input features. In the end, the overall relevant

features for a given dataset, are the assemble of each subset of relevant features of the output functions.

A variety of optimization algorithms have surfaced, making the authors choice even tougher, how-

ever the genetic algorithms recently have received a lot of credit due to their ability to solve difficult

optimization problems. Population-based algorithms, in face of other traditional optimization algorithms,

work with a coded version of the parameters, performing a search from a population of points, instead

of performing from one single point. However, as the GA follows a filter model, a fit function should be

met to properly evaluate the quality of the hypothesis. When using this sort of optimization algorithms, a

deciding issue in the design, is the choice of the fitness function. Consistently, the function chosen is the

one that was initially targeted to optimize, in this solution the fitness function used will change depending

on the output class of the trained ANN.

Zhengui Li and Linkai Luo’s (Xu et al., 2013) work focused on the study of two recursive feature

32

Page 51: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

elimination (RFE) methods, based on two famous prediction algorithms the Support Vector Machine

(SVM) and Random Forest (RAF). This study focused on the investigation of the stability - which refers

to the overall result’s sensitivity to the variations contained within the training dataset - and prediction

accuracy for the subset of features selected from both algorithms mentioned. Stability is a neglected

issue on the development of prediction algorithms, which are concentrated at achieving the highest

prediction accuracy possible, and this issue is relatively important when analysing high-dimensional

data such as the stock market (Yu et al., 2008).

After an extensive test, with around 55 technical indicators and 391 days of data for each of the nine

stocks, the paper authors concluded that both of optimization algorithms perform well, achieving good

results in stability, even though the stability indexes are not bright when the number of selected features

is less than 20, and performance accuracy. In short, the SVM performed better than the RAF, since

the redundant or uncorrelated features do not impact the former, thus there is no need to use recursive

feature elimination with the SVM but when using the RAF, it is needed.

Manoj Thakur and Deepak Kumar (Thakur and Kumar, 2018) proposed a new hybrid decision sup-

port system for algorithm trading in financial markets, more precisely NASDAQ, DOW JONES, S&P500,

NIFTY50 and NIFTY BANK indexes. The hybrid system proposed, integrates both Weighted Multi-

category Generalised Eigenvalue Support Vector Machine (WMGEPSVM) and Random Forest (RAF)

(RAF-WMGEPSVM) to yield buy, hold and sell signals from a set of proposed technical indicators and

oscillators. In the authors’ proposed solution, the RAF is employed to discover the optimal set of techni-

cal indicators that could enhance the prediction’s performance of the WMGEPSVM model.

In order to test the effectiveness of the hybrid system to achieve profitable investments, the authors

started by testing how the system would cope without the Random Forest model for dimensionality re-

duction (i.e, using all the input features). Compared to other solutions, as the Buy and Hold (B&H) strat-

egy, the Balanced Multicategory Support Vector Machine (BMSVM) and OVA-Multi-class Least Squares

Twin SVM (MLSTSVM), the proposed solution achieved slightly better Rate of Return (ROR), Maximum

Drawdown (MDD) and Probability of Winning (PP) in most of the markets tested. Afterwards, all the pre-

viously mentioned models were coupled with the RAF to understand its impact on the results obtained,

and it has demonstrated notable improvements on the overall performance of the models (reducing the

computational complexity of the models and improving the financial performance of the trading systems).

Overall, the proposed trading system is found to outperform all other trading systems based on the other

classifiers considered in bull, bearish and sideways market scenario.

2.5.2 Works on Simple Prediction Algorithms

An enhanced decision tree algorithm entitled as Neural-based Decision Tree (NDT) was proposed by

Xiongmin Li and Christine W. Chan in “Application of an enhanced decision tree learning approach for

prediction of petroleum production” (Li and Chan, 2010), which objective was to identity the relationships

amid the main variables for an oil wells’ production prediction, that is a critical issue for decision-making

in the petroleum industry.

33

Page 52: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

In the presented solution two prediction models were employed, each with its own objective, the

neural network was used first to determine the variables’ dependence and its values, subsequently,

using the neural network outputs, the well’s prediction was allotted to the decision tree. Due to the

integration of the ANN in the architecture of the solution, the authors expected to accomplish a more

accurate prediction.

The main advantages found by the authors to use NDT rather than ANN is that, in the first place,

the proposed model generates decision trees easier to understand by petroleum engineers, giving the

ability to understand the underneath relationships kept across the geoscience attributes, which in the

end will result in an unproblematic decision making. Secondly, apart from always converging, the NDT

is faster at learning new information than the ANN. And, at the end, the learning tree formed by the NDT

models conveys more information about the variables, which may help in selecting the parameters to be

used and aids at assessing the interdependent relationships among the attributes.

A novel genetic algorithm model was proposed by Kyung-Shik Shin and Yong-Joo Lee (Shin and Lee,

2002) which has focus in the prediction of corporate failure using financial information. Although, many

Neural Network systems have been proven efficient in predicting companies’ bankruptcy, there exists

major disadvantages of building and using this model. First, due to the fact of existing several network

architectures, learning methods and parameters, makes the task of choosing even tougher. Secondly,

neural networks are inherent in a characteristic, which is often called as “black box”, where the final user

can not readily comprehend the rules that were issued by the final NN model. Therefore, the authors

decided to propose a Genetic Algorithm approach which can be applied to forecast the company failure,

while producing rules easily understood by users.

The key difference in the authors solution in extracting bankruptcy rules, is the usage of GAs to

attain upper or lower variables’ thresholds which will rule the financial health of the company. By using

this approach, the main goal is to express a rule which will yield the highest hit ratio if the former

thresholds are exceeded across the company. Although the proposed solution can attain promising

results, showing that the usage of GAs is indeed effective extracting rules, thanks to their ability to learn

non-linear relationships among the system’s input variables, for the forecast of bankruptcy. It has a

foremost drawback since the model only produces predictions when all the rules are fired, while on the

other hand, the NN makes predictions on every case, except when the predictions are confined.

2.5.3 Works on Ensemble Prediction Algorithms

Christopher Krauss, Xuan Anh Do and Nicolas Huck (Krauss et al., 2017) analysed the effectiveness

of base learners - Deep Neural Network (DNN), Gradient-Boosted-Tree (GBT), Random Forest (RAF) -

in face of simple ensemble algorithms - equal-weighted (ENS1), performance-based (ENS2) and rank-

based (ENS3) - in the context of statistical arbitrage. Thereafter, a statistical arbitrage strategy was

developed using the previous evaluated algorithms, where a very simple approach was used when

determining the market’s positions that should be adopted by the traders, the authors ordered the stocks

in a rank, at the top k stocks are those where the traders should adopt a long position and at the bottom

34

Page 53: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

are the k stocks where the brokers should adopt a short position.

After applying the developed strategy on the index S&P500, the authors found that with k=10 the

equal-weighted ensemble (ENS1) returned values significantly higher than in the other employed algo-

rithms. After transaction costs, the ENS1 achieved returns of 0.25 percent per day and on an annualized

basis the returns achieved were of 0.73 percent.

In “Automated trading with performance weighted random forests” (Booth et al., 2014), Ash Booth,

Enrico Gerding and Frank McGroarty proposed a novel automated trading expert system, based on

performance weighted ensembles of Random Forests, which aims at predicting the price return over

seasonal events, as turn-of-the-month, exchange holiday and weekend effect. This solution empowers

a three-layer model, each of the layers has a very distinct function. In the first layer a regression random

forest is generated every d days on a moving window of training data to form an ensemble that will

predict the return of an investment made on a seasonality trade. Afterwards, the experts’ outputs are

merged together using an exponential weighting algorithm to produce useful trading signals before

passing to the final layer. In the final layer, the risk management layer, the decisions made previously

are analysed, eliminating weak signals, and a new technique, called maximum drawdown, is used in

order to liquidate positions in stocks which are difficult to predict.

When compared against other models, as equal-weighted random forests, simple random forests,

buy-and-hold strategy and a naıve seasonality strategy, both weighted models of random forests outper-

form any of the other models.

Manish Kumar and M. Thenmozhi (Kumar and Thenmozhi, 2006), made an extensive study on

prediction algorithms aimed at examining the predictability of the direction/behaviour of a market index,

S&P CNX NIFTY Market Index of the National Stock Exchange. After an empirical evaluation based

on out-of-sample data, the authors of the paper concluded that the Support Vector Machine and the

Random Forest outperformed the remaining models, Neural Network, Logit Model and Discriminant

Analysis, achieving a higher hit ratio in one-period ahead forecast of the NIFTY Index financial series.

Due to the ability of the SVM to minimize the generalization error, by implementing the structural risk

minimisation principle rather than the training error, its performance augments compared to both the

Random Forest and Neural Network, which make use of the empirical risk minimisation principle.

In “Evaluating multiple classifiers for stock price direction prediction” (Ballings et al., 2015), the au-

thors benchmarked a set of ensemble prediction algorithms (Random Forest, AdaBoost and Kernel Fac-

tory) against single classifier methods (Neural Networks, Logistic Regression, Support Vector Machine

and K-Nearest Neighbour), which end result was to predict one year ahead of stocks price’s direction.

The tests were performed on 5767 publicly listed European companies, covering a broad range of in-

dustries, and used the area under the receiver operating characteristic curve (AUC) as a performance

metric (which values can range from 0.5 to 1, where 0.5 means the predictions are not better than

random predictions and 1 means that the predictions are perfect).

From the results gathered, the authors concluded that all the classifiers used, performed better than

a random prediction, meaning that the AUC score was, in all the results, better than 0.5. However,

the Random Forest predictor was the top performer, achieving an AUC score of 0.9037 followed by the

35

Page 54: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Support Vector Machine with a score of 0.8395. From there they decided to determine the stability of the

results achieved, through the calculation of the interquartile range (IQR) (that measures the difference

between the 75th and 25th percentile), where higher values means the bigger the deviation is, thus less

stable is the algorithm, and vice-versa. The dominance of the Random Forest model is again confirmed

where it achieved the lower IQR score of all the tested algorithms.

A brief overview of the studied related work can be found at Table 2.1.

36

Page 55: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Tabl

e2.

1:O

verv

iew

over

diffe

rent

appr

oach

esto

fore

cast

Sto

ckM

arke

t

Wor

kR

efer

ence

Alg

orith

mU

sed

Mar

kets

Test

edP

erio

dof

Sim

ulat

ion

Bes

tPer

form

ance

Res

ults

Obt

aine

d

(Bal

lings

etal

.,20

15)

Sup

port

Vect

orM

achi

neK

-nea

rest

Nei

ghbo

urLo

gist

icR

egre

ssio

nN

eura

lNet

wor

ksR

ando

mFo

rest

Ker

nelF

acto

ryA

daB

oost

5767

publ

icly

liste

dE

urop

ean

com

pani

es

2009

thro

ugh

2014

Ran

dom

Fore

stm

odel

with

anAU

Csc

ore

of0.

9037

All

the

clas

sifie

rste

sted

achi

eved

anAU

Csc

ore

abov

e0.

5,w

hich

mea

nsth

atar

ebe

ttert

han

rand

ompr

edic

tions

,ho

wev

erth

eR

AF

mod

elou

tper

form

edal

loth

erm

odel

ste

sted

,ach

ievi

nghi

gher

pred

icta

bilit

yan

dst

abili

tysc

ores

(Kra

uss

etal

.,20

17)

Dee

pN

eura

lNet

wor

kG

radi

ent-B

oost

ed-tr

ees

Ran

dom

Fore

sts

Diff

eren

tEns

embl

es

S&

P50

0D

ecem

ber1

992

thro

ugh

Oct

ober

2015

73%

onan

nual

ised

basi

s

Equ

al-w

eigh

ted

ense

mbl

eou

tper

form

edea

chin

divi

dual

mod

el,b

utR

AF

isth

ebe

stsi

mpl

ele

arne

r

(Kum

aran

dTh

enm

ozhi

,200

6)S

uppo

rtVe

ctor

Mac

hine

Ran

dom

Fore

sts

S&

PC

NX

NIF

TYM

arke

tInd

ex

1Ja

nuar

y20

00th

roug

h31

May

2005

SV

Mac

hiev

ed68

.44%

Hit

Rat

io,

1.04

%ov

erR

AF

SV

Mou

tper

form

edR

AF

(Pat

elet

al.,

2015

a)

Sup

port

Vect

orR

egre

ssio

nS

VR

-AN

NS

VR

-RA

FS

VR

-SV

R

CN

XN

ifty

S&

PB

SE

Sen

sex

Janu

ary

2003

thro

ugh

Dec

embe

r201

2

Mea

nA

bsol

ute

Perc

enta

geE

rror

of2.

58w

ithM

ean

Squ

are

Err

orof

3581

57.6

3

Two

stag

eap

proa

ches

perfo

rmbe

tter

than

sing

lest

age

solu

tions

,hav

ing

the

SV

R-A

NN

asth

ebe

stpe

rform

er

(Pat

elet

al.,

2015

b)

Art

ifici

alN

eura

lNet

wor

kS

uppo

rtVe

ctor

Mac

hine

Ran

dom

Fore

sts

Naı

ve-B

ayes

Rel

ianc

eIn

dust

ries

Info

sys

Ltd.

CN

XN

ifty

S&

P50

0B

SE

Sen

sex

2003

thro

ugh

2012

Con

tinuo

us-v

alue

dda

ta-R

AF

achi

eved

83.5

6%ac

cura

cyTr

end-

dete

rmin

istic

data

-naı

ve-B

ayes

achi

eved

90.1

9%ac

cura

cy

RA

Fsh

owed

high

est

perfo

rman

cew

hen

faci

ngco

ntin

uous

-val

ued

data

,ho

wev

erw

hen

trend

dete

rmin

istic

data

isus

ed,

naıv

e-B

ayes

outp

erfo

rmed

the

othe

ralg

orith

ms

(Boo

thet

al.,

2014

)R

ando

mFo

rest

sD

AX

2000

thro

ugh

2012

Ann

ualis

edre

turn

of0.

09w

ith1.

27S

harp

eR

atio

The

mod

elpr

opos

edou

tper

form

edal

loth

erm

odel

sin

term

ofbo

thpr

ofita

bilit

yan

dpr

edic

tion

accu

racy

(Tha

kura

ndK

umar

,201

8)

Wei

ghte

dM

ultic

ateg

ory

Gen

eral

ised

Eig

enva

lue

Sup

port

Vect

orM

achi

ne(W

MG

EP

SV

M)

coup

led

with

Ran

dom

Fore

st(R

AF)

(RA

F-W

MG

EP

SV

M)

NA

SD

AQ

DO

WJO

NE

SS

&P

500

NIF

TY50

NIF

TYB

AN

K

Janu

ary

2007

thro

ugh

Dec

embe

r201

5

63.1

3%R

OR

with

-5.7

7M

DD

and

0.57

PP

onth

eN

IFTY

BA

NK

inde

x

The

prop

osed

tradi

ngsy

stem

outp

erfo

rmed

allo

ther

clas

sifie

rsco

nsid

ered

inte

rms

ofR

OR

,MD

Dan

dP

Pin

allm

arke

tsce

nario

s

37

Page 56: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Chapter 3

Implementation

The proposed solution intends to develop an expert trading algorithm to trade over the S&P500 index,

leveraging profitable investment decisions. In a primordial stage, a Genetic Algorithm (GA) will be used

to filter out redundant and noisy features of the historial stock data. Then, a Random Forest (RAF) will

predict the behaviour of the market, optimising the investment solution.

Firstly, is depicted an overall architecture of the system, showing the different layers used to struc-

ture the solution. Subsequently, a more precise description of the implemented modules will be given,

describing the technologies used according to the research made in the previous chapter.

3.1 Architecture Description

By combining different technologies, the main objective of the developed system is to detect market’s

entry and exit points, with the main goal of maximise the initial investment while taking into consideration

the number of days with capital invested in the market and the market’s trend.

This thesis presents a trading system which presents a three-layer architecture composed by: a user

presentation, a data and prediction and broker layers. While each of these layers is presented in detail

below, an overview of the system can be found in Figure 3.1.

Due to the layered architecture proposed, the modules contained within each layer can be easily

maintained and extended independently, as clear interfaces between modules and layers are kept, which

gives the researcher an extra level of flexibility and control over the system’s behaviour. Giving this

system the ability to be extensible, which can be recognised as a principle to be followed from a software

engineering and systems design perspective.

In order to grasp a market prediction accompanied with sensible investments, the execution of the

developed system can be summarized in the following steps:

1. The user starts by inputting the configuration data into the system. Starting by specifying the

desired market data to be used, which is loaded into the financial data module. Secondly, the user

selects the desired set of technical indicators, from the set of indicators supported by the system

(shown on Table A.1), which will be inputted in the data preparation module as GA’s individuals,

38

Page 57: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Prediction & Broker Layer

Trend Analysis Data Preparation Random Forest Stock Exchange

Data Layer

Financial Data

Presentation Layer

User Interface

Configuration Parameters

Results

Figure 3.1: Diagrammatic representation of the layered working of the autonomous trading system

to be used by the system to predict the market’s movement. Thirdly, the fitness function used to

assess the performance of a GA’s individual, as the previous configuration parameter will be used

by the data preparation module. Lastly, the number of levels that should be used by the algorithm

to assess the market’s direction (input to the trend analysis module).

2. After having the system configured as user intended, the system starts by analysing, in the trend

analysis module, the financial data inputted to determine the market’s trend (sideways, downtrend

or uptrend). From where is returned a dataset containing both the raw financial data (as opening,

highest, lowest, closing prices and volume) and the determined market’s trend.

3. Then, the information coupled is fed to the data preparation module, which generates two different

datasets, the train and test datasets, respectively. This datasets are formed by joining selected

technical indicators with the information convoyed from the previous module.

4. Having both the datasets created with extracted features from the unprocessed dataset, the ran-

dom forest module is responsible to achieve two different tasks. Firstly, until the GA stopping

criteria is met, the random forest module issues a performance metric, depending on the fitness

function selected by the user, which will help in determining the best performing individual. Lastly,

after the stopping criteria is being met, this module is responsible to arise to a set of trading signals

which will be fed to the last module in the chain.

5. Finally, the predictions made by the random forest module are used as input to the stock exchange

39

Page 58: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

module, where appropriated trading positions are adopted and are generated returns, accordingly.

6. At the end of the system’s execution, both the results returned from the random forest and stock

exchange modules are saved in files, to be formerly analysed by the user.

The trading expert presented in this thesis was fully developed using Python programming language,

version 2.7.10 (van Rossum and Team, 2015). Each layer and composed modules will be described in

detail in the next sections.

3.2 Layer 1: Presentation Layer

This layer is the upper layer of the system and is responsible for the interaction between the user

and the system. In the developed system, the user uses a command-line interface, implemented using

the Argparse python library 1, to specify the execution parameters of the algorithm and the market to be

studied.

When starting the execution of the system, the user starts by inputting the path to the financial

data file to be processed, which is a required argument, then a set of optional parameters can also

be specified, as the desired set of technical indicators (which by default are all the system’s technical

indicators), the fitness function (being by default, the prediction accuracy) and the number of levels that

should be used to determine the ingoing trend on the analysed market (by default there are only used

three levels, uptrend, downtrend or sideways) that will be used by the system to enhance the prediction

of the market’s movement.

At the completions of the system’s execution, the program outputs two different files:

– A csv file, which is comma-separated values file format, which contains the investment decisions

made per day, as well as the investments earnings. At the end, the Return on Investment (ROI)

will also be displayed.

– An image containing the plots showing the trading signals generated by the system coupled with

the accumulative ROI

In the following sections, one can better understand the influence that these parameters can have on

the system, as deeper implementation’s details of the modules are described.

3.3 Layer 2: Data Layer

The data layer is responsible to maintain and manage the financial market data used by the trading

system and is composed by a single module, the financial data module. In its simplest form, this module

can be seen as a dynamic data structure to which can be added /deleted and edited lines/columns.

The financial data module acquires and stores raw daily financial data of the user specified market.1https://docs.python.org/2.7/library/argparse.html, last accessed October 5th, 2017

40

Page 59: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

As imported, the raw financial data, is composed by the following attributes:

– Date - which gives the trading day’s date of which corresponds the following information.

– Open - a stock’s opening price, on a given day, corresponds to the price at which the same stock

first trades upon the exchanges commencement.

– High - represents the highest price a certain stock traded during the trading period.

– Low - represents the lowest price a certain stock traded during the trading period.

– Close - stock’s closing price, on a given trading day, corresponds to the price at which the stock

is last traded. The closing price gives the most updated price of a security, until the transactions

kick-off again on the next trading day.

– Adjusted Close - the adjusted closing price of a security is a stock’s closing price which has been

amended to include a combination of multiple factors, such as company’s dividends and corporate

actions (varying from stock splits to rights offerings), that could have happened between the current

day’s close and the next day’s open.

– Volume - the volume of a trading period corresponds to the amount of stocks that are exchanged

during that period.

Since, from an analyst point of view, stock’s adjusted closing price leads to a meticulous representa-

tion of a company’s equity value beyond the simple market price, often it is preferred when performing

a detailed analysis of historical returns. Having the main objective of predicting the market’s future be-

haviour, the use of adjusted closing prices is more suitable than the regular closing prices, therefore, at

this module, there is the need to discard regular closing price, thus reducing the amount of redundant

and noisy features.

The raw financial data is provided to the system as a csv format file, which is a comma-separated

values file format. The information convoyed by the csv file is then imported in a data frame through the

usage of the Pandas python library 2, that provides high-performance, and easy to use, data structures

and data analysis tools. A data frame is a 2-dimensional data structure, and can be seen as a set of

observations (rows) with multiple/variables (columns), from which can be made rows/columns operations

(as slicing, date shifting and lagging or handling of missing data) and complex computations (as moving

window statistics and linear regressions) in order to get valued information from complex datasets.

The Open, High, Low, Adjusted Close and Volume features of the raw financial data will be carried

on to the prediction and broker layer of the system, where further transformations will be applied to this

initial dataset, enhancing its prediction capabilities.2https://pandas.pydata.org, last accessed September 15th, 2017

41

Page 60: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

3.4 Layer 3: Prediction & Broker Layer

The core of the system lies within this layer, it is responsible for the majority of the transformations

carried on the raw financial data, imported into the system as described previously on Section 3.3, the

feature selection process which selects the best set of features that will be fed to the ML algorithm and,

finally, is responsible to make the market simulation of the future market’s prediction to estimate the ROI

of the trading algorithm.

As can be seen in the overview picture of the system, over the Figure 3.1, this layer is composed by

multiple modules, chained to each other, making the output of the first module the input of the second,

and so forth. However, more complex modules as the random forest and stock exchange have to yield

back in the chain assessed performance and financial metrics, respectively, which will influence the

behaviour of the data preparation module, as will be described in detail below.

Each of the aforementioned modules is composed by submodules and all its enlightenment will be

given in the following subsections.

3.4.1 Trend Analysis Module

The trend analysis module, in its computations is the simplest module from the layer, it is responsible

to add a new feature to the initial dataset, which is its input, and is responsible to, also, add the labels that

are used by the Machine Learning (ML) algorithm to learn historical market behaviour and to forecast

the market’s performance.

Market’s Trend Analysis

Through simple comparisons of the market’s price signal with a trend following technical indicator,

which full descriptions and computations can be found over the Section 2.3.2.1, one can assess the

ongoing trend in the analysed market. The chosen indicator to help us determining the market’s trend

was the Moving Average (MA) (full explanation can be found over the Section 2.3.2.1), which acts as a

smoothing signal of the price action, omitting sudden local price changes.

In order to determine the trend that the price was following, it was fundamental to use a MA whose

behaviour could closely trail the price action, therefore we decided to use a 50-day moving average, that

is considered to be a fast moving average. A moving average is said to be a fast moving indicator, when

the number of periods on which the mean has to be computed is small, eliminating almost all the lag

that the computation of the indicator could introduce.

The real value of the feature can vary between downtrend (-1), uptrend (1) or sideways (0), depend-

ing on the relation of the price signal with the moving average’s value. When assessing the value of the

feature for a determined trading day, one must start to compute the arithmetic difference between the

market’s price and the price’s MA:

– If the price signal makes a clear crossing above the moving average signal, therefore the difference

between the two values is positive, the market is said to be facing an uptrend, causing the trend

42

Page 61: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

feature’s value to be equal to 1.

– If the price signal makes a clear crossing below the moving average signal, therefore the difference

between the two values is negative, the market is experiencing a downtrend, which makes the trend

feature’s value to be equal to -1.

– There is still a third scenario where the market is facing a sideways trend, which could arise from

the fact that the market is highly volatile. This type of trend is characterized by price sways within

bounds, as described in the Section 2.1, making multiple moving average signal crossings, on

both directions. Therefore, to compute the value of the trend feature, it was fundamental to look

for the number of crossings of the price action over the MA signal for a determined period of time,

which, given the performance of the proposed market price to study, we believed that the more

sensible value to use is a 7-day period. So, if the price crossed more than a fixed number of times

the moving average signal, within a 7-day time frame, then the market’s trend is sideways, and the

value of the feature should be 0.

When computing the MA value, we relied on the use of a third-party python library, named TA-Lib 3,

which is a widely used library by developers of market analysis systems when it is required to perform

any sort of technical analysis of the studied financial market.

Label Creation

When using supervised learning algorithms, it is important to formulate a meaningful set of tar-

gets/labels for the analysed dataset so that a learner can assemble a concise model between the data’s

features and labels, by identifying common patterns between the occurrences. Since the main objective

of this thesis is to develop a trading system capable of forecast the behaviour of the studied market, the

most logical set of dataset’s labels would be the one that could disclose the notion of market movement

into the system. In its simplest form, the market’s movement can be regarded as having a positive or

negative action, depending on the stock’s closing price of the current day in comparison with the previous

day’s closing price.

For this thesis, as mentioned in the chapter’s introduction, we will propose a binary classifier, i.e.,

it will be forecasted whether the variation of the signal that represents the market’s price is positive or

negative. As mentioned, the market’s price variation can either be positive or negative, accordingly it

can be considered to follow a probability of binomial distribution, i.e., label ∈ {0, 1}, where 1 represents

a positive variation of the price’s signal and 0 represents a negative variation of the market’s price, and

the equation that represents this assignment can be defined as in Equation 3.1.

label =

1 if closet−closet−1

closet−1≥ 0

0 if closet−closet−1

closet−1< 0

(3.1)

3https://www.ta-lib.org, last accessed September 15th, 2017

43

Page 62: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Where closet−1 represents a stock’s closing price of the previous day and closet represents the value

of the current trading day’s closing price.

Once the labels and the new feature are created and added to the initial raw financial data, previ-

ously inputted in this module, a newly created dataset is outputted of the trend analysis module and is

introduced to the data preparation module, which operation will be outlined in the following subsection.

3.4.2 Data Preparation Module

The data preparation module receives a data frame that will work as a white canvas to the system,

from this point on only features regarding the computed technical indicators will be added to this data

frame, and will output two data frames, one containing the training data and other with test data. The

initial data frame is composed by the five main features of the daily financial data (open, high, low and

adjusted close prices, and transactional volume), the trend feature and the labels to the daily information,

of which full depiction can be found over the previous subsection.

As stated previously, this module is responsible for the computation of selected technical indicators,

which will aid the system to better perceive the market’s behaviour. In order to reduce the number of

features that will be processed by the learning algorithm, reducing potential redundant, irrelevant and

noisy features, for the sake of enhancing the forecast’s accuracy, we opted to use a Genetic Algorithm

(GA), whose efficiency was previously proven in the works developed by (Sikora and Piramuthu, 2007),

(Kim and Han, 2000) and (Tsai and Hsiao, 2010), which will be described in detail in the next section.

Considering all the computations performed in this module and its importance for the success of the

trading system, multiple independent submodules were defined to give a deeper insight into how the

module works as a whole and make it as resilient as possible to further development changes. A closer

look at the module’s architecture can be found in the Figure 3.2, where each submodule, as well as

theirs inputs and outputs, are presented.

Data Preparation Module

Trend Information

Financial DataGenetic Algorithm Technical Analysis Split Train & TestIndicators

parametersData to split

Random Forest Classifier

Random Forest Module

Train & Test

Financial Data with

indicators optimised

Figure 3.2: Diagrammatic representation of the data preparation submodules

44

Page 63: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

3.4.2.1 Genetic Algorithm

In its basis, the genetic algorithm is an evolutionary algorithm that tries to find an optimal solution

from a vast search space, as already presented in the Section 2.4.2.1. The search space on which the

GA tries to find the best solution is called a population and each solution has the name of individual,

through extended comparisons of individual’s fitness values the algorithm tends to converge to a solution

which has the best performance from the collective. However, most of the development process lies on

the choice of the individual’s structure that best suits the needs of the system, of the inner-algorithm

processes (as the selection, mutation and crossover processes), and the configuration parameters that

could extend the algorithm’s performance, such as the population size and the number of generations.

As the search space in the optimisation problem that the GA tries to solve is very big (includes all the

possible combinations for technical indicator’s parameters), the population size has to be big enough in

order to allow the search space to be properly explored. Notwithstanding, as the number of individuals

grows the computational effort required by the algorithm also increases, therefore, a balance between

the population size and the computational cost has to be met.

To develop this module and to make possible to implement several strategies shown in the Sec-

tion 2.4.2.1, the system’s genetic algorithm was developed based on the DEAP framework (Fortin et al.,

2012), which is a python-oriented framework specified for evolutionary algorithms.

The components of the developed algorithm will be explained briefly.

Genetic Algorithm Importance

The use of the genetic algorithm is of key importance for the optimisation of the developed trading

system, since the algorithm tries to undertake two possible problems:

– Which are the best features to be used by the ML algorithm - also known, as feature selection, that

tries to resolve the problem of selecting the best N features that should be used by the algorithm

in order to boost the learner’s performance, “cleaning” those that are noisy and redundant, as

explained in the Section 2.4.1.

– Which are the best technical indicator’s parameters - as different indicators can have multiple

parameters to be optimised (as can be shown on the Table A.1), one can not infer which are

the best values for these parameters without testing all of the possible solutions, therefore the

genetic algorithm tries to solve this problem by finding the individual that outperforms the general

population.

45

Page 64: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Genetic Algorithm Evolution Process

From a macroscopic point of view, the evolution process of the genetic algorithm can be represented

by a flowchart, as seen in the Figure 3.3.

Initialisation

Evaluation

Termination

Selection

Crossover

Mutation

Yes

No

Is termination criterion met?

Figure 3.3: Genetic Algorithm Evolution Process Overflow

In the aforementioned flowchart, one can distinguish five different processes whose behaviour can

be described as follows:

– Initialisation - this process is responsible to create a population with random generated individu-

als.

– Evaluation - at this step the fitness value of each individual is calculated, according to the used

fitness function.

– Selection - once the fitness values are computed, the algorithm, depending on the selection pro-

cedure adopted, chooses pairs of parents’ chromosomes to generate offsprings.

46

Page 65: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

– Crossover - selected parents swap genes in agreement to random generated crossover points.

– Mutation - selected offsprings randomly change one or more genes.

All the processes carried by the algorithm (evaluation, selection and mutation) as well as the chro-

mosome structure used by the developed solution, will all be reviewed in the following sections.

Chromosome Representation

The chromosome is represented as an array of integers, where each gene of the chromosome,

depending on its position, can have a different meaning for the performance of the genetic algorithm. As

mentioned in the Section 3.2, the user is responsible to choose a set of indicator, that will be used by the

system to predict the market’s behaviour, whose size will determine the size and split of an individual’s

structure.

Having that said, a gene that belongs to the first split of the chromosome could determine the com-

putation of a technical indicator, while a gene on the second split of the individual dictates the presence

of a certain indicator on the overall data frame that will be fed to the ML algorithm.

As the structure split can be determined by the size of the set of technical indicators specified by

the user, the size of the first individual’s split has to be determined by the indicator’s parameters to

optimise. Since we are using an external library (seen on Section 3.4.1) to perform the computation,

each technical indicator has a different number of parameters, as may be indicated in the Table A.1, that

have to be explicitly inputted into the system, which can result in different splits’ sizes.

The structure of the chromosome is presented in the Figure 3.4, and its split can be perceived to be

dependent of the inputted user’s set of indicators.

TI p2TI p1 TI p3 TIp1 TIp1 TIp2 TI1presence TI2presence TI3presence

Indicators Parameters Indicators Presence

Chromossome Split

Figure 3.4: Chromosome Representation

By using the same genes structure throughout an individual and by dynamically determining the

chromosomes split, we are introducing development simplicity and modularity to the solution. The values

used by each gene are integers randomly extracted between 2 and 100, since some technical indicators

can not be computed when there are used parameter’s values less than 2 and values above 100 would

introduce too much lag in the indicator’s value, possibly making its use expendable.

Having the structure of the chromosome defined, the evolution of the genetic algorithm depends on

the evaluation of the population’s individuals, which will be detailed in the next sections.

47

Page 66: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Individuals Evaluation Step

As was aforementioned, once the population is defined, and every time it changes, its individuals

have to be evaluated in order to assess their fitness value, whose process may depend on the fitness

function specified by the user when running the system.

The fitness function is a function that takes a candidate solution to the problem and assigns it a real

value, which determines how “fit” a solution is to the presented problem. Not only this process has to

be implemented in the most efficient way possible (to not create a bottleneck in the system, since it is

repeated until the algorithm gets to a stop criteria), but it should, also, clearly distinguish the best from

the worst individuals with the highest and lowest scores, respectively.

As the genetic algorithm tends to reach the best solution for the problem in hands, the assigned

values by the function also tend to reach highest highs, thus directing the system’s performance in the

right way.

The definition of the fitness function is compelling to the performance of the system and by using

different functions the system may have a different progress, thus each developed fitness function was

tested and analysed. As previously mentioned, the main goal of the trading system is to maximise

the ROI, but concepts adjacent to trading markets, as return on the investment made, daily profits and

amount of time spent with capital invested, had also to be taken into consideration when developing the

different fitness functions. Also, the accuracy, which is a popular metric to assess the quality of forecast

algorithms, was used as a fitness function. All these concepts can be defined by the following fitness

functions:

1. Accuracy - The classification accuracy expresses the amount of right predictions in the total of

predictions made, and can be computed as follows:

Accuracy =True Positives

Total Number of Predictions(3.2)

Where Total Number of Predictions is equal to the sum of True Positives, False Positives, True Negatives

and False Negatives.

2. Rate of Return per day (ROR/day) - The rate of return per day expresses the profit obtained, as a

percentage, by each day in which the algorithm decided that it was best to invest, and is computed

as seen in Equations 3.3:

ROR =Final Capital− Initial Capital

Initial Capital× 100 (3.3a)

ROR/day =ROR

Days Investing(3.3b)

48

Page 67: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

3. Mean of the Daily Profit (MDP) - The mean daily profit measures a mean of the obtained profit by

investing day. The value of the metric can be analytically determined by the set of Equations 3.4:

Profitt =

closet−closet−1

closet−1if positiont−1 = Long

closet−1−closetcloset

if positiont−1 = Short

0 if positiont−1 = Neutral

(3.4a)

Mean Daily Profit =Profitt1 + ...+ ProfittN

N(3.4b)

Where closet referes to the present day closing price and closet−1 refers to the previous day closing

price and positiont−1 indicates the position that was adopted by the system on the previous day,

which, as stated in the Equation 3.4a, determines how the daily profit is calculated.

4. Risk Return Ratio (RRR) - The risk return ratio assesses the performance of a given individual

based on the computed ROR, which can be determined as presented in the Equation 3.3a, and

Maximum Drawdown (MDD) of the investment pictured by the individual, which is a metric used

to ascertain an investments’ financial risk. It can be interpreted as the maximum possible loss

between a specific recorded period of time (tx and ty), whose computation can be found over the

Equation 3.5a, where the Max stands for a local maximum and the Min stands for a subsequent

minimum. Once the investment’s return reaches a new high, the objective is to track the percentage

of change between the former and the smallest trough. When a high percentage of change is

attained, it means that the asset in which the investment has been made is highly volatile.

The value of this fitness function can be computed by the set of Equations 3.5, described below:

Max Drawdown = MaxRORtx−MinRORty

(3.5a)

RRR =ROR

Max Drawdown(3.5b)

Parents Selection Step

After having defined the evaluation process carried by the algorithm to assess an individual’s fitness

value, individuals, also known as parents, will be chosen in a probabilistic manner to generate new

offsprings through the permutation of genes belonging to both parents. How the selection procedure is

carried depends on the chosen method.

In the developed system the chosen method to perform the selection was the Tournament Selection

process, which in order to select the individuals to mate, runs several tournaments between individuals

among the population, and the winner of each tournament will be selected to perform crossover at a

former execution point in the system (a deeper insight into this selection method can be found over the

49

Page 68: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Section 2.4.2.1).

When using this procedure to perform the individuals selection, one of the key parameters that can

influence its behaviour is the tournament size, which dictates a weak individual’s chance to perform

crossover. In bigger tournaments, less is the chance of weaker individuals to proliferate, hence when

we had to determine the developed trading system’s tournament size, we analysed how stable was the

price signal and how well the signal adhered to trends. As a result of analysing the market’s price signal,

we concluded that we could accommodate weaker individuals to spawn new offsprings, thus choosing

to use small tournament sizes.

Crossover Step

The basis of the crossover operand implemented in this system, which has a 50% probability of

generating new offsprings, is the same as the original Two-Point Crossover method, which behaviour can

be reviewed over the Section 2.4.2.1, that was previously implemented in the DEAP library, however due

to the developed chromosome structure, as seen in the Figure 3.4, a new function that could simulate

the performance of the default method had to be implemented.

Since each split of the chromosome tries to optimise a different problem, the use of the default

crossover method, whose behaviour would be to swap the individuals’ genes that fall between two ran-

domly selected crossover points, unconscious of the defined chromosome topology, would introduce

glitches in the chromosome structure, which may cause the algorithm to fail.

As stated previously, the custom function implemented will resort to the use of the Two-Point Crossover

method that is implemented as a native DEAP’s function. The implemented crossover operand, starts

by spliting the chromosome structure into two sub-structures, one that is responsible for the optimisation

of the indicators’ parameters and the other that is responsible for the selection of the processed indica-

tors. Then, to each sub-chromosome is applied the selected crossover operand, which will output four

offsprings for each part of the chromosome that are bind together to spawn two offsprings.

As this process ends, the newly generated offsprings will then suffer mutation, which process will be

described next.

Mutation Step

The mutation operand implemented is this system in its core has the same behaviour as the Uni-

form Integer Mutation method, which can be seen as Function-based Mutation (whose explanation is

found over the Section 2.4.2.1), since this mutation process uniformly drawn an integer from an interval

(defined by specified lower and upper bounds) to replace individuals’ features.

The overall mutation rate is set to 0.2, which means that each new candidate solution generated by

the aforementioned crossover operand has a 20% probability of suffering a mutation.

Due to the chromosome topology, we encountered the same problem as in the crossover operand,

consequently the default mutation operand could not be used and a custom function had to be imple-

50

Page 69: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

mented.

The development of the mutation operand is very similar to the crossover method, described in

the previous section, the only distinction between the two operands resides on the process applied.

Therefore, in order to apply mutation to the individuals’ features, it had to be applied to each sub-

topologies of an individual. As was mentioned before, the custom function basis its performance in the

Uniform Integer Mutation, which is a default mutation process of the DEAP library.

Termination Condition

The termination condition of the implemented genetic algorithm can affect the quality and speed of

the search and so, it should prevent premature termination and avoid needless computation. From an

evolutionary algorithm point of view, one can benefit from three standard termination conditions:

– A fixed number of generations.

– For a defined number of generations, the population highest fitness value does not show any

improvement.

– The highest population’s fitness value achieves a defined threshold, which is specific to the studied

problem.

Due to the characteristics of the market, it is very hard to define a system’s threshold that could

accomplish good enough performance. If the defined threshold is too low, the trading system can be

limited in performance, thus making a premature termination of the algorithm. On the other hand, if the

threshold is too high the system may never generate an individual which fitness value could never reach

the defined bound. Hence, only the first two standard termination conditions would suit the developed

system. However, since the algorithm’s performance into simpler executions was very stable and effi-

cient, using a fixed number of generations seemed to be the best choice, as the execution of the GA will

not deteriorate the global performance of the trading system.

When the termination condition is fulfilled, the genetic algorithm terminates its execution, saving the

individual with the highest fitness score of the overall population which will then be fed to the coming up

submodule, the Technical Analysis submodule.

3.4.2.2 Technical Analysis

By applying technical analysis to the studied market, this submodule is responsible to form a new

new data frame, containing the technical indicator’s values coupled with the previously created features

and labels that will be used by system’s learner to grasp the market’s characteristics. Note that the

technical indicators available in this system were specified over the Section 2.3.2

One of the key decisions that have to be made, regarding the technical analysis performed by the

system, is where should it be computed. Some start by computing all the possible parameter’s values for

each technical indicator supported by the system, although, since the developed system tries to reduce

51

Page 70: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

the larger subset of user specific indicators to a smaller, yet more accurate, set it seemed to be a waste

of computational power to be computing all of the indicators’ possibilities in beforehand.

This submodule brings major value to the performance of the system, since it is used together with

the genetic algorithm, essentially in the Evaluation step of the execution, to give a glimpse into the sys-

tem’s global performance. Its execution starts by taking an individual, which carries both the indicator’s

parameters and a set of integers that will be used to select the indicators that will be computed, and

through the usage of TA-Lib (previously described in the Section 3.4.1), technical analysis will be applied

to the selected financial market using the parameters carried by the individual, computing the technical

indicators’ values.

As mentioned, not all specified technical indicators will be computed, depending on the features’

values of the second sub-structure of the individual, which regards to the chromosome’s split that does

not conveys the parameters of the indicators. Once the set of selected indicators is obtained, it will be

split into complex and non complex/simple indicators, this split occurs in accordance with the number

of parameters necessary to compute the technical indicator (shown in detail in the Table A.1). Then,

the computation of the technical indicators takes place, producing a data frame per indicator. This data

frame is then coupled with the original data frame, that had the financial data’s features and labels,

resulting in a final data frame that will further be split into training and testing subsets.

3.4.2.3 Train & Test Data Split

Being this the last submodule of the data preparation module, it is responsible for the last data frame

related operation that will then be outputted.

This module is responsible to split both the features and labels into two subsets, which are addressed

as train and test set, respectively. Whilst the former is responsible to hoist all the information that will

help the ML algorithm to formulate meaningful correlations between the training features and labels,

the latter will convoy the features, that, in conjunction with the test labels, will assess the fitness of the

algorithm. Usually, developers tend to rely on machine learning related metrics to assess its fitness,

such as accuracy, precision and recall, however, in the presented trading system the performance of the

algorithm is determined by the fitness functions, presented previously.

For these tasks we relied on the Scikit-learn python library (Pedregosa et al., 2011), which provides

a method that, depending on the desired percentage of testing data, splits the inputted data into four

subsets (features and labels for each, train and test, phase).

3.4.3 Random Forest Module

This module, due to the machine learning nature of the developed trading system, can be appraised

as being the “brain” of the system, in the sense that this module is responsible for the development of

Random Forest (RAF) algorithm, used to foresee how the market will adjust its behaviour when facing

price changes.

52

Page 71: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

As stated in the Section 2.4.3.1, Random Forest is an agglomerate of tree predictors such that each

tree sprouts from a random vector sampled independently. The generalization error depends on the

strength of each of the base learners and the correlation between them (Breiman, 2001). In the light of

this fact, the randomly generated trees are grown to their full extent with the aim of keeping full prediction

accuracy of each decision tree, whilst inducing diversity among them.

The RAF module, depending on the point of execution of the system, can operate in one of two

modes when it receives the preprocessed data from the data preparation module, as can be stated by

the diagram that represents its performance, in the Figure 3.5:

– When the GA, which performance was detailed in the previous section, is still in its converging

phase, which means that the best solution for the problem was not found yet, this module is

responsible to yield the fitness value of an individual.

– On the other hand, when the GA ends its execution, reaching the termination criteria, this module

is responsible to yield the prediction obtained, which will then be used by the stock exchange

module to make profitable trading decisions.

Stock Movement Prediction

Random Forest classifier

Test features Stock behavior prediction

Prediction

System’s Performance Assessment

Random Forest classifier

Performance Estimation

Train data

Test data

Train featuresTrain labels

Performance metrics

Train & Test

Financial Data with

indicators optimised

Data Preparation Module

Genetic Algorithm

Figure 3.5: Diagrammatic representation of the random forest module performance

In the Figure 3.5, one can state that the random forest module is, in fact, two separated modules.

The System’s Performance Assessment module, which is responsible for the computation of the first

aforementioned mode, is composed by two submodules, the random forest classifier and the perfor-

mance estimation. While the Stock Movement Prediction module is responsible for the computation of

the second mode, and is only composed by the random forest classifier, since this submodule is only

executed when the genetic algorithm meets the stopping criteria. The inputted train and test data is

composed by four data subsets regarding train/test features and labels used by the learning algorithm.

In the following sections, the composed submodules will be described in detail, which will give a

better understanding on the split. But first we must start by clarifying the common element, the random

forest classifier submodule.

53

Page 72: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Random Forest Classifier

The random forest classifier relied on the Scikit-learn python library (Pedregosa et al., 2011), to

assure reliability in terms of computation and stability, since the main focus of this thesis is not the

implementation of an algorithm itself. The implementation of the algorithm follows the theory that was

previously described in the Section 2.4.3.1.

The random forest is a supervised learning algorithm and, as all the supervised learning algorithms,

learns, from properly labelled data, the correlation between features that will impact the final classifica-

tion. The execution of a supervised learning algorithm can be split into two distinct phases:

– Learn phase - during this phase the learner optimises its knowledge from the information provided

by the learning dataset, which can be divided into learning features and labels. In this execution

step is, also, crucial to find the algorithm’s optimal parameters that allow it to have the best perfor-

mance in the next step.

– Test phase - in this execution step it is assessed the performance of the algorithm. Never seen

before data features are supplied to the learner which has to yield a possible forecast, which, in

turn, is then evaluated to determine how the algorithm foresees unseen information. The metrics

used to assess the performance of the algorithm may vary with the studied problem, i.e., in stock

exchange systems the developers may use ROI or accuracy, while in weather forecast is nonsense

to use the return on investment.

In order to optimise the performance of the random forest classifier, some parameters had to be

tuned. When creating the classifier, it is important to specify the number of trees per random forest, their

length and the minimum amount of samples at a leaf node, since these will influence the prediction’s

accuracy.

The parameter number of trees, also referred as number of estimators, is used to specify the exact

amount of trees that need to be grown per random forest, these parameter can have an impact in

the performance of the global system, since using a greater number of trees can induce on higher

computational power usage but the algorithm would perform better, so a trade-off must be met. The

second parameter is used to specify how deep should each tree sprout, the deeper the tree, more

splits will occur and more information about the data is kept, however, as happened with the previous

parameter, greater values may induce the algorithm into overfiting the training data, which is not ideal.

The latter parameter is used specially to determine when should a node be considered as leaf node,

with the increase of the parameter the classifier may become underfitted, which means that the algorithm

could not “learn” enough information about the studied market.

3.4.3.1 System’s Performance Assessment

This module, which execution occurs essentially during the train phase, was created to assess the

performance of a genetic algorithm individual, since the performance of the selected parameters and

features depends on the impact that these could have on the random forest learner. Depending on the

54

Page 73: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

fitness function selected (all described in the Section 3.4.2.1), the fitness value can be either determined

by the performance estimation submodule or by the stock exchange module, which is responsible to take

proper investment decisions according to the forecasts made. But, in order to minimise the number of

modules that could assign the fitness value to a GA’s individual, the assignment of the individuals’

performance is only made on the performance estimation submodule.

Performance Estimation

As state above, when is chosen the Accuracy of the machine learning algorithm, which depends on

the learning process and on the tuning of its parameters, as the fitness function, the system exploits

this submodule in order to determine its value. Towards the assessment of the accuracy of the trained

algorithm, we decided to rely on the cross validate method that is provided by the Scikit-learn library

(Pedregosa et al., 2011), which implements a strategy called cross validation that is massively used

to determine an estimator performance given a dataset. There are many schemes used to realise the

cross validation, but for this thesis we will be using the K-fold cross validation type.

The K-fold cross validation process consists of the following steps:

1. Starts by splitting the initial dataset into K mutually exclusive subsets, of equal sizes.

2. Then, the chosen ML model will be trained on all mutually exclusive sets, except one, which will

be used as a test set.

3. The accuracy of the model will be computed, using the testing subset that was not used to train

the model.

4. Finally, the process is repeated K times, until all the mutually exclusive subsets have been used

as test sets and its accuracy is computed.

Once the execution of the process has terminated, it is calculated the mean of all accuracies com-

puted which will serve as the estimated accuracy of the RAF model.

An example of an execution of a K-fold cross validation scheme can be found over the Figure B.1,

which starts by dividing the initial dataset into three mutually exclusive folds (a red, a blue and a green

one). In order to determine the accuracy of the model there are three iterations in which each fold is

used as a test set, for a model that will be trained on the remaining folds. After all the iterations are

made, the accuracy of the model will be estimated, which is the result of the mean of the accuracies of

the red, blue and green test folds.

As soon as the validation process is finished, the value that represents the accuracy of the random

forest will then be returned to the genetic algorithm submodule, which will assign it as the fitness value

of the genetic algorithm’s individual, that shaped the training and validation phase of the model.

55

Page 74: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

3.4.3.2 Stock Movement Prediction

After the termination of the genetic algorithm, the individual with the best performance is used by

the random forest algorithm to make a forecast of the market behaviour, which will be used by the stock

exchange module to invest properly into the studied market.

In order to make this prediction, the trading system will use the features on the test set, that is a

subset of the original dataset, as can be seen over the Figure 3.5. This dataset is composed by the

raw financial data (as open, high, low and close prices and transactional volume) and by the selected

technical indicators that were present in the GA best performer.

Once the learner is trained and the stock behaviour prediction is made, which is an array, that has

the same size of the test subset, with daily market variations that will determine the positions that should

be adopted by the broker. The trading system will predict if the market will face a positive variation (1)

(the price signal undergoes a positive change, which is the result of the stock price rising), a negative

variation (-1) (the price signal undergoes a negative change, which is the result of the stock price falling),

which will result in a trading signal that will be outputted to the next module.

3.4.4 Stock Exchange Module

The stock exchange module is responsible for simulating the trading experience in a real world market

environment, which receives as input the trading signal outputted by the random forest module and

acting in the market accordingly. The trading module starts with an initial capital (which by default is

500.000 dollars) destined for investing according to the received trading signal. In its core the module

stores:

– The initial capital.

– The number of stocks held by the system.

– The current capital available to make trading decisions.

– The total earnings, which is the sum of the user holdings and his cash.

– The number of days on which the trading system has invested in the market.

– The daily profit (see Equation 3.4a), the Rate of Return (ROR) (computed as found in the Equa-

tion 3.3a) and the accumulative Return on Investment (ROI).

The forecast of the market could take the values within {0, 1}, as was mentioned in the Section 3.4.3.2,

from which is created a trading signal (by taking the difference between each prediction) and, then, trans-

formed it into investment actions by the implemented stock exchange module.

This way, the trading signal is interpreted as follows:

– A value of 1 represents a buy signal. If the trading system is not holding any stocks (neutral

position), the buying signal makes the system to buy as many stocks as possible with the current

capital. On the other hand, if the system is short in face of the current market situation then the

buying signal tells the system to buy back the stocks and adopt a neutral position.

56

Page 75: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

– A value of -1 represents a sell signal. If the trading system is not holding any stocks (neutral

position), the selling signal makes the system to adopt a short position in face of the current market

situation. On the other hand, if the system is holding any stocks (by adopting a long position) then

the sell signal makes the system to sell all his stock holdings, adopting a neutral position.

– A value of 0 represents a hold signal. Every positions held by the trading system must remain the

same.

To simulate the real trading environment, to each transaction (buy or sell) it is applied a transaction

cost, which can be charged differently depending on the investing market and on the broker used for

the transaction. The charged value can vary from a percentage of the total transactional value or as an

absolute cost per stock. In this trading system it is set a transaction cost of 0.1% of the total transactional

value.

All the financial performance metrics used to evaluate the market performance of a GA individual,

such as ROR per investing day (which formula can be seen in Equations 3.3), Mean Daily Profit (that

can be computed as found over Equations 3.4) and Risk Return Ratio (which can be computed as seen

in Equations 3.5), are also computed over the stock exchange module.

57

Page 76: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Chapter 4

Evaluation

In this chapter the results from the system described in the previous chapter are presented. First,

the financial markets and metrics used to evaluate the system’s performance are described. Next, the

achieved results for different fitness functions and execution scenarios are presented and analysed.

To assess the performance of the system, different testing scenarios were proposed. Firstly, the

performance of the whole system (i.e, using all the modules developed, as seen on Figure 3.1) was esti-

mated, which worked as a benchmark for future evaluations, since this experiment uses the full potential

of the trading system. Secondly, the system was tested without the GA module, which description can be

found over the Section 3.4.2.1, to determine its importance in the prediction effectiveness of the system.

Lastly, was gauged the importance of the Market’s Trend Analysis feature, which detailed explanation

can be found on the Section 3.4.1.

The experiments were conducted using a MacBook Pro, with a quad-core i7 Intel processor (2.6GHz)

and 16GB of RAM, running macOS High Sierra operating system. Having these experimental conditions,

the execution time of each GA’s generation (which, depending on the number of individuals can become

the bottleneck of the system) is, approximately, 16 minutes. As the implemented system is dependent

on multiple Python libraries (as presented in the Chapter 3), whose behaviour is not deterministic, and

as each individual’s gene of the GA is randomly withdrawn, each experiment was tested ten times in

order to obtain an average of the performance of the system, whose statistics are presented next.

For the reader’s convenience, all the parameters used in each execution of the system, which are

detailed in the previous chapter, can be summarised by the Table 4.1. It should be noted, however,

that the parameters with regards to the Random Forest module are described as ranges of values,

since these are found after performing a cross-validation over different samples of random forests with

different parameters settings. This procedure was performed to grasp the RAF’s parameters that would

enhance its performance during the train phase of a given random forest.

The return plots for the experiments presented next will be displayed in the appendix Section D,

except for two markets (the S&P500 index and the AT&T stock) which will be displayed in the section

below, since presenting the results’ plots for all the tested markets would be too extensive. However, all

the important information about the obtained results can be portrayed by the presented tables.

58

Page 77: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Table 4.1: Parameters of the implemented system

Parameter Value Component

Population Size 200 DEAP(GA)Mutation probability 20% DEAP(GA)Independent mutation probability of the genes 10% DEAP(GA)Crossover probability 50% DEAP(GA)Number of generations 20 DEAP(GA)Tournament size 3 DEAP(GA)

Number of trees/random forest [100, 1000] Scikit-learn(RAF)Number of features to consider at each split {log2, sqrt} Scikit-learn(RAF)Maximum number of levels in a tree (tree depth) [10, 1000] Scikit-learn(RAF)Minimum number of samples required to split a node [10, 1000] Scikit-learn(RAF)Minimum number of samples required at each leaf node [10, 1000] Scikit-learn(RAF)Function to measure the quality of a split {gini, entropy} Scikit-learn(RAF)

4.1 Financial Data

When training and testing the implemented trading system, we used daily financial data composed

by the daily prices (open, high, low, close and adjusted close) and transactional volume for different

markets over a determined period of time, from 02/01/2008 to 14/03/2018, using 80% of the data to train

the system and 20% of the data to test it.

In order to assess the robustness of the system in face of different market’s characteristics, the

experiments were performed in the following markets:

– S&P500 index, which description can be found in the Section 2.2, comprised by 2569 trading day

samples, train from 02/01/2008 to 26/04/2016 (2096 trading days) and test from 27/04/2016 to

14/03/2018 (473 trading days).

– Amazon.com Inc. (AMZN ticker symbol) stocks, which falls within the Consumer Discretionary in-

dustry sector (Internet and Direct Marketing Retail sub-industry sector), comprised by 2569 trading

day samples, train from 02/01/2008 to 26/04/2016 (2096 trading days) and test from 27/04/2016

to 14/03/2018 (473 trading days).

– Apple Inc. (AAPL ticker symbol) stocks, which falls within the Information Technology industry sec-

tor (Technology Hardware, Storage and Peripherals sub-industry sector), comprised by 2569 trad-

ing day samples, train from 02/01/2008 to 26/04/2016 (2096 trading days) and test from 27/04/2016

to 14/03/2018 (473 trading days).

– The Coca-Cola Company (KO ticker symbol) stocks, which falls within the Consumer Staples in-

dustry sector (Soft Drinks sub-industry sector), comprised by 2569 trading day samples, train from

02/01/2008 to 26/04/2016 (2096 trading days) and test from 27/04/2016 to 14/03/2018 (473 trading

days).

59

Page 78: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

– AT&T Inc. (T ticker symbol) stocks, which falls within the Telecommunication Services industry

sector (Integrated Telecommunication Services sub-industry sector), comprised by 2569 trading

day samples, train from 02/01/2008 to 26/04/2016 (2096 trading days) and test from 27/04/2016

to 14/03/2018 (473 trading days).

This way, the experiments are tested in the S&P 500 index and in the companies stocks, that are

representative of the aforementioned index’s sectors, guaranteeing diversity as well as change in mar-

ket’s characteristics and trend patterns in the tests performed. All the financial data was gathered from

the Yahoo Finance platform.

As stated previously, the transaction costs can vary (depending on the trading stock and on the

brokerage firm used to trade), however, in order to keep all the tests consistent, in the developed trading

system it was applied a transaction cost of 0.1% of the transacted value.

4.2 Datasets Characteristics

As mentioned previously, tests in financial markets with different characteristics and trend patterns

were performed, therefore an analytical studied was carried to grasp how the different markets’ nature

could influence the forecast and the investment decisions made by the trading system. In order to study

the experimental markets, we started by analysing the highest and lowest price on which the stocks

traded (to assess its range of values), the average closing price and the average candlestick size (to

determine the market’s volatility), which can be computed using the opening and closing prices or the

highest and lowest achieved prices. The candlesticks are popularly used in the study of financial mar-

kets, since it aids the investor in determining the relationship between the stock’s opening and closing

prices, and in assessing how the lowest and highest achieved prices compared to the trading day’s

aforementioned features, an example of a candlesticks chart can be found over the Figure C.2, which is

an example of a candlestick chart for AAPL stocks during the first three months.

The characteristics of the different markets can be found over the Table 4.2, where was computed

the averages of the closing prices and candlesticks sizes (for both cases, opening/closing prices and

high/low prices candlesticks) and found the highest and lowest closing price of the markets. All the com-

putations were performed throughout the whole period of time (train and test) for each of the mentioned

markets.

Characterisation of market’s returns

In order to assess the behaviour of the studied market and the possibility of the implemented trading

system being able to benefit from its behaviour, i.e., by having more profitable investment solutions,

which may incur in higher investment returns, two statistical metrics were used, the Skewness and

Kurtosis, as presented over the Table 4.3. Both of these metrics were used to weigh how the distribution

of the market’s returns compare with the normal distribution, which is assumed to be perfectly symmetric,

i.e., when the left and right tail of the distribution’s histogram is evened, and well balanced, as can be

60

Page 79: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Table 4.2: Characterisation of the studied markets

StockAverage

Closing PriceLowest

Closing PriceHighest

Closing Price

AverageCandlestick Size

(Close-Open)

AverageCandlestick Size

(High-Low)

S&P500 1635.71 676.53 2872.87 0.36 18.72

APPL 76.65 11.17 181.72 -0.01 1.42

AMZN 369.81 35.03 1598.39 0.07 7.85

KO 36.14 18.93 48.53 0.01 0.48

T 33.13 21.72 43.47 -0.01 0.54

seen over the Figure C.1.

The Skewness metric measures the lack of symmetry of a dataset distribution with respect to the cen-

ter point, which results in having an histogram with a longer right/left tail than the normal distribution, that

has a skewness of zero (Kim and White, 2004). Accordingly, when the skewness has a negative/positive

value it is said that the data is skewed to the right/left, presenting an histogram with a longer left or right

tail, respectively. On the other hand, the Kurtosis metric refers to the degree of sharpness of the peak

in a probability distribution curve, which illustrates the ’peakiness’ and ’flatness’ of the distribution curve,

ascertaining how the observations of the markets returns are clustered around the centre of the its dis-

tribution (Kim and White, 2004). As stated previously, the normal distribution is taken as a reference,

therefore a positive value of kurtosis means that the distribution of the market’s returns is more peaked,

with fatter tails and lesser probability of extreme outcomes, than the normal distribution and a negative

kurtosis represents a distribution’s histogram with a flatter peak, exhibiting more dispersed values with

more probability of extreme returns.

Table 4.3: Characterisation of returns over the studied markets

Stock Skweness Kurtosis

S&P500 -0.11 11.29

APPL -0.23 7.11

AMZN 1.02 12.12

KO 0.68 14.44

T 0.84 15.51

When applied to the studied markets’ returns, a positively skewed distribution may mean that are

more frequent small losses and few large gains, on the other hand, a negatively skewed return’s his-

togram could indicate that smaller gains are more frequent than large losses.

61

Page 80: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

4.3 Evaluation Metrics

As mentioned before, the primary goal of the implemented system is to maximise the investments’

return while minimising the risk associated with the trading decision, while being aware of the number

of days spent investing in a specific market. Despite the main objective of the trading system being

maximising the returns of an investment, it also focus on minimising the number of days spent investing

in a specific financial market, spreading the investor’s portfolio spectrum, since the money invested

previously is available to do future investments. For that reason, a set of financial metrics, which will be

explained briefly, were used to evaluate the performance of the trading decisions made throughout the

conducted experiments:

– The Rate of Return (ROR), presented in Equation 4.1. This metric allows to evaluate the financial

returns obtained by the trading system.

ROR =Final Capital− Initial Capital

Initial Capital× 100 (4.1)

– The Rate of Return per day (ROR/day), presented in Equation 4.2, allows to evaluate the judge-

ment of the trading system when its main goal is to maximise the investment’s returns while min-

imising the number of days on the market.

ROR/day =ROR

Days Investing(4.2)

– The Mean of the Daily Profit (MDP), presented in the set of Equations 4.3, measures the aver-

age of the daily profit per day invested in the financial market. Having that said, the daily profit

(analytically determined by the Equation 3.4a) assesses the investment’s return per trading day,

which depends on the trading position adopted previously, so when the trading system is adopting

a neutral position (meaning it is out of the market, with zero stocks) the daily profit is zero. On the

other hand, when the system is investing in the market (in a long/short position) the daily profit is

the change in the daily price of the stock.

Profitt =

closet−closet−1

closet−1if positiont−1 = Long

closet−1−closetcloset

if positiont−1 = Short

0 if positiont−1 = Neutral

(4.3a)

Mean Daily Profit =Profitt1 + ...+ ProfittN

N(4.3b)

– The Risk Return Ratio (RRR), determined by the series of Equations 4.4, allows to assess how

the developed system is coping with the main duel of maximising the returns on investment while

minimising the overall associated risk. As the mentioned metric is a ratio between the Rate of

Return (ROR) and the Maximum Drawdown (MDD), higher metric values represent more profitable

62

Page 81: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

investments with associated low values of MDD which, in the end, suggest that the investment

made had less financial risk associated.

Max Drawdown = MaxRORtx−MinRORty

(4.4a)

RRR =ROR

Max Drawdown(4.4b)

It should be noted that the equations of the evaluation metrics aforementioned, were already pre-

sented in the previous chapter (in the Section 3.4.2.1, regarding the fitness functions used to evaluate

the GA’s individuals) but, for the reader’s convenience, they are again presented in this chapter.

4.4 Case Study I - Performance of the Whole System

The first set of experiments were performed to analyse the quality of the investments made when

different fitness functions, proposed previously, were used. All of the averaged results on the test data

of the aforementioned financial markets per fitness function are presented in Table 4.4, coupled with the

results of the Buy and Hold (B&H) strategy for comparison.

Analysing the results obtained, the following observations can be made:

– In the S&P500 index, the Risk Return Ratio (RRR) fitness function is the best performing fitness

function and is the only one that allows the system to outperform the B&H strategy in this financial

market. All of the other fitness functions achieved lower returns, return by day invested in the

market and risk return ratio, and present strategies with more risk on the investment made (higher

MDD). On the other hand, the Accuracy fitness function got a closer result to the Buy and Hold

strategy on all the proposed criteria, with less risk associated to the investment made.

– In the Apple stocks, none of the strategies yield allowed the trading system to outperform the Buy

and Hold strategy. However, In this market, the RRR fitness function considerably outperformed

the other proposed fitness functions, presenting better results on all the evaluation criteria. From

comparing the performance of the system when using the RRR fitness function with the perfor-

mance of the system when testing the remaining fitness functions, one can assess that this is

the best performing fitness function in all the criteria (ROR, RRR, ROR/day and MDP), accom-

plishing higher returns associated with less risky investments (i.e, low MDD and high RRR) when

compared with the strategies used by the trading system while using the other fitness functions.

– In the Amazon stocks, the benchmark strategy, the B&H strategy, outperforms the implemented

trading system, despite the use of the various fitness functions. This is due to the fact of the

Amazon stock is a market with a long upward trend, with almost no troughs, making it very difficult

to the system to outperform the Buy and Hold strategy. The main reason behind this is the fact

of the implemented trading system has a more active way of investing in the market, making

63

Page 82: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Table 4.4: Results of the B&H and the different fitness functions tested with the full system

B&H Accuracy ROR/day Mean DailyProfit RRR

S&P500Transactions 2 20 16 20 24ROR (%) 30.492 26.349 23.098 25.300 35.026ROR/day (%) 0.064 0.056 0.049 0.054 0.074MDD (%) 9.983 9.958 10.960 9.101 7.229RRR (%) 3.054 2.646 2.107 2.780 4.845MDP (%) 0.058 0.051 0.047 0.049 0.065

AppleTransactions 2 55 64 36 70ROR (%) 87.334 15.122 20.764 8.568 25.067ROR/day (%) 0.185 0.032 0.042 0.018 0.052MDD (%) 13.298 20.928 13.610 21.716 12.783RRR (%) 6.567 0.723 1.526 0.395 1.961MDP (%) 0.140 0.036 0.042 0.025 0.053

AmazonTransactions 2 43 31 34 37ROR (%) 158.690 -34.280 -27.058 -30.837 -14.298ROR/day (%) 0.335 -0.072 -0.057 -0.063 -0.030MDD (%) 14.600 38.322 39.864 44.151 39.315RRR (%) 10.869 -0.898 -0.666 -0.626 -0.347MDP (%) 0.210 -0.075 -0.055 -0.069 -0.023

AT&TTransactions 2 16 65 41 84ROR (%) 4.058 13.214 11.047 -0.874 15.689ROR/day (%) 0.009 0.037 0.023 -0.003 0.032MDD (%) 19.582 14.241 25.482 17.189 11.354RRR (%) 0.207 0.928 0.434 -0.051 1.382MDP (%) 0.014 0.030 0.027 0.002 0.038

Coca-ColaTransactions 2 41 107 66 9ROR (%) 4.578 5.218 5.471 3.066 10.740ROR/day (%) 0.010 0.011 0.011 0.006 0.022MDD (%) 10.983 21.193 16.410 16.064 13.159RRR (%) 0.417 0.246 0.333 0.191 0.816MDP (%) 0.012 0.014 0.016 0.009 0.024

64

Page 83: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

more buy/sell transactions, instead of buying in the first trading day and selling on the last. As

presented earlier, the Amazon stock presents an almost clean uptrend, but the existing troughs

are steep enough to mislead the developed system into going short, meaning that the broker is

betting against the stock. The stock eventually ramps back up again and continues to grow its

value, while the trading system is still betting against its fall, which explains the poor performance

of the system on this market.

– In the AT&T financial market, almost all the fitness functions allow the system to clearly outperform

the B&H strategy, with the exception to MDP fitness function where the evaluation metrics, that the

measure the return achieved on the investment made (ROR, ROR/day and RRR) are all negative.

However, for the same fitness function the Mean Daily Profit is still positive which can be beneficial

to other possible strategies.

– In the Coca-Cola stocks, only the Mean Daily Profit fitness function did not outperform the Buy

and Hold strategy in all of the evaluation metrics. Despite being more prone to riskier investments

(thus achieving a lower Risk Return Ratio (RRR) percentage), the strategies using the fitness

functions Accuracy, Rate of Return per day (ROR/day) and Risk Return Ratio (RRR) achieved

higher return rates than the benchmark strategy. However, the Accuracy and ROR/day fitness

functions are still outperformed by the RRR fitness function, since it attained better performance in

all the tested criteria (ROR, ROR/day,RRR, MDP). By stating this, one can conclude that the RRR

is the best performing fitness function when investing in this financial market, accomplishing high

returns associated with less risky investments (i.e, low MDD and high RRR) and high daily profits

(high MDP and ROR/day).

In the Figure 4.1 it is presented the evolution of the returns obtained by the Buy and Hold strategy

and average returns obtained by the trading system with the different fitness functions in the S&P500

index during the testing period. When analysing the Buy and Hold strategy one can observe that initially

the ROR has a negative value which is justified by a short downtrend suffered by the index, which was

quickly recovered as can be seen by overall positive evolution of the ROR line of the aforementioned

strategy.

Analysing the Figure 4.1, it can be observed that all the fitness functions follow very closely the

ROR behaviour presented by the B&H strategy, however on the last quarter of the testing period, the

performance of the different strategies from the different fitness functions is notoriously different. Firstly,

the system’s strategy with Accuracy fitness function mimics the strategy of the Buy and Hold, meaning

both take advantage of the steep upward trend felt on the index’s value and both fall thereupon, as the

index takes a big hit in its evaluation, thus achieving less ROR than the Buy and Hold strategy. On

the other hand, the system with the ROR/day, MDP and RRR fitness functions misjudges the upward

trend felt by the index, giving them a big fall on the ROR achieved. However, the system’s strategy with

the RRR fitness function was able to understand the shift in direction of the index value, avoiding the

considerable fall in prices, which enables it to outperform the Buy and Hold strategy, by increasing the

ROR in 5%, in the last period of the testing data.

65

Page 84: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

-5

0

5

10

15

20

25

30

35

27/04/16 27/06/16 27/08/16 27/10/16 27/12/16 27/02/17 27/04/17 27/06/17 27/08/17 27/10/17 27/12/17 27/02/18

ROR

(%)

Date

B&H Accuracy ROR/day Mean Daily Profit Risk Return Ratio

Figure 4.1: Returns obtained by the system with the different fitness functions and the B&H in theS&P500 index

In the Figure 4.2 it is presented the evolution of the returns obtained by the Buy and Hold strategy

and average returns obtained by the implemented system with the different fitness functions in the AT&T

stock during the testing period. Analysing the Buy and Hold returns it can be observed that, in the

testing period, this market has an overall sideways action, meaning that an upward trend in the returns

is accompanied by subsequent falls in the strategy’s ROR, which can be availed by the implemented

trading system to gain economical advantage over the B&H strategy.

-15

-10

-5

0

5

10

15

20

25

30

35

40

27/04/16 27/06/16 27/08/16 27/10/16 27/12/16 27/02/17 27/04/17 27/06/17 27/08/17 27/10/17 27/12/17 27/02/18

ROR

(%)

Date

B&H Accuracy ROR/day Mean Daily Profit Risk Return Ratio

Figure 4.2: Returns obtained by the system with the different fitness functions and the B&H in the AT&Tstock

Analysing the Figure 4.2, it can be observed that the system with the ROR/day fitness function can

avoid the loss in returns, in the first quarter of the testing period, of the Buy and Hold strategy. The

66

Page 85: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

B&H strategy returns fall from about 12% to about -3% while the system with the aforementioned fitness

function can increase its returns to about 12%, but on the last quarter of the testing period, the returns

achieved of the strategy with the ROR/day fitness function has big falls in performance, although the

overall performance is still better than the benchmark strategy. When using the MDP fitness function the

ROR signal has an almost opposite behaviour of the Buy and Hold strategy on the last half of the testing

period, i.e., when the returns of the B&H strategy increase the returns of the MDP strategy decrease,

but the increases on the overall evaluation of the stock are higher than the drops suffered, thus making

the performance of the MDP worse than the Buy and Hold strategy. On the other hand, the system

with both the Accuracy and RRR fitness functions outperform the benchmark strategy, thus being more

profitable, during the second half of the sideways period of the testing period, achieving in the end about

13% and 15% of ROR, respectively.

It should be noted, however, that the system while using the RRR fitness function was able to stop the

steep loss that was felt, from 7/10/2017 to 7/11/2017, by the remaining strategies, changing its position

in face of the ongoing trend.

4.5 Case Study II - Influence of the Genetic Algorithm Module

Having established the overall performance of the trading system developed, in the case study pre-

sented below the importance of the Genetic Algorithm (GA) and its impact on the performance of the

system is tested and analysed.

The system without the Genetic Algorithm module, which is responsible to select the parameters for

the computation of technical analysis indicators and for dimensionality reduction of the training data,

inputs to the Random Forest (RAF) module all the features without any pre-processing. In the Table 4.5

the results of the system without the GA module are presented with the B&H results for benchmark.

Analysing the results obtained, the following observations can be made:

– The system without the GA module has worse returns (lower ROR) in almost all the markets com-

pared to the Buy and Hold strategy, except on the AT&T and Coca-Cola stocks, where some fitness

functions achieved more profitable strategies. Which when compared with the results achieved

while using the full system (which can be found on the Table 4.4), where in most of the tested

markets (depending on the fitness function), the tests’ results obtained are more satisfyingly, since

the full developed trading system could beat the Buy and Hold strategy and achieve profitable

investments in most of the tested financial markets.

– In comparison to the whole system performance, the system without the GA obtains worse results

in most of the criteria used when investing in the most significant markets, as the S&P500 index

and Amazon stock, despite the negative results on the Amazon market of both systems.

– When investing in the Apple stock, the system without the GA module performed considerably

better than the trading system tested with the GA, showing that on a financial market as Apple,

which is characterized, on the testing period, to be more volatile than the S&P500, performing less

67

Page 86: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Table 4.5: Results of the B&H and the different fitness functions tested without the GA

B&H Accuracy ROR/day Mean DailyProfit RRR

S&P500Transactions 2 21 17 19 22ROR (%) 30.492 21.182 24.685 21.011 22.090ROR/day (%) 0.064 0.044 0.052 0.044 0.046MDD (%) 9.983 12.476 9.026 9.427 9.258RRR (%) 3.054 2.265 2.727 2.233 2.394MDP (%) 0.058 0.040 0.048 0.041 0.043

AppleTransactions 2 36 14 15 18ROR (%) 87.334 40.203 56.615 55.785 46.964ROR/day (%) 0.185 0.084 0.118 0.116 0.098MDD (%) 13.298 18.464 15.997 16.657 19.959RRR (%) 6.567 2.538 3.985 3.770 2.718MDP (%) 0.140 0.075 0.100 0.098 0.087

AmazonTransactions 2 38 10 11 11ROR (%) 158.690 -18.856 -51.329 -46.765 -50.630ROR/day (%) 0.335 -0.039 -0.112 -0.108 -0.107MDD (%) 14.600 38.295 52.838 49.286 52.938RRR (%) 10.869 -0.489 -0.967 -0.944 -0.855MDP (%) 0.210 -0.032 -0.136 -0.118 -0.132

AT&TTransactions 2 20 2 10 10ROR (%) 4.06 11.652 4.693 0.909 -0.873ROR/day (%) 0.01 0.033 0.010 0.002 -0.002MDD (%) 19.58 17.316 19.584 21.216 22.170RRR (%) 0.21 0.673 0.240 0.086 0.016MDP (%) 0.01 0.028 0.015 0.007 0.003

Coca-ColaTransactions 2 51 33 46 35ROR (%) 4.578 0.066 5.410 9.104 7.983ROR/day (%) 0.010 0.000 0.011 0.019 0.017MDD (%) 10.983 14.520 9.353 10.213 10.229RRR (%) 0.417 0.030 0.578 0.891 0.780MDP (%) 0.012 0.001 0.012 0.019 0.017

68

Page 87: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

transactions (i.e., holding longer the investing positions adopted) could become more profitable.

Despite not beating the benchmark strategy, the results obtained from the experiments conducted

on the system without the GA module were much more consistent (i.e., have less variation on the

results achieved by the different fitness functions).

– In the AT&T and Coca-Cola stocks, markets where the full system, in all the fitness functions except

the Mean of the Daily Profit (MDP), performed better than the B&H strategy, the system without the

GA module performs worse, obtaining lower scores in the criteria responsible to judge the returns

obtained from the investment made (i.e., ROR, ROR/day and RRR). In regards to this results,

there is one fitness function, the Mean Daily Profit, on both stocks where the system without the

GA slightly exceeds the results obtained from the system with the module.

Having this in mind, one can conclude that the performance of the prediction system is considerably

impacted by the presence of the Genetic Algorithm, achieving investment strategies that are overall

more profitable throughout the experiments conducted. When comparing the results from both the

experiments, the rewards achieved by the developed system with the GA module are overall higher,

least in the Amazon market where both the systems were outperformed by the market itself, although on

the Apple stock the performance of the system without the module is more consistent, oscillating less

between the different fitness functions. Furthermore, the GA module enables the system not only to

accomplish solutions with higher returns but also solutions with a better proportion of returns and days

spent in the market, since it can produce higher ROR/day values in all the markets, depending on the

fitness function. Although the results achieved when using the GA module are better, the strategies used

were also much more prune to riskier investments, as can be told by the higher values overall of the

MDD criteria, this can be due to the fact of the trading system with the module makes more investments

and is more sensitive to subtle changes on the stock.

In the Figure 4.3 and Figure 4.4 it is presented the evolution of the returns obtained by the Buy and

Hold strategy and the average of the returns obtained by the trading system with all the fitness functions

in the S&P500 index and AT&T stocks,respectively, during the period of test presented previously. As

can be stated and as mentioned above, on both markets the system without the GA module gets outper-

formed by the system with the aforementioned module (which results can be found over the Figures 4.1

and 4.2). Despite the system without the GA module being more profitable than the B&H strategy, in

the AT&T stocks, when using the Accuracy fitness function, it still performs about 2% worse than the

system’s strategy for the same fitness function with the GA module.

69

Page 88: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

-5

0

5

10

15

20

25

30

35

27/04/16 27/06/16 27/08/16 27/10/16 27/12/16 27/02/17 27/04/17 27/06/17 27/08/17 27/10/17 27/12/17 27/02/18

ROR

(%)

Date

B&H Accuracy ROR/day Mean Daily Profit Risk Return Ratio

Figure 4.3: Returns obtained by the system without the GA module using the different fitness functionsand the B&H in the S&P500 index

-15

-10

-5

0

5

10

15

20

25

30

35

40

27/04/16 27/06/16 27/08/16 27/10/16 27/12/16 27/02/17 27/04/17 27/06/17 27/08/17 27/10/17 27/12/17 27/02/18

ROR

(%)

Date

B&H Accuracy ROR/day Mean Daily Profit Risk Return Ratio

Figure 4.4: Returns obtained by the system without the GA module using the different fitness functionsand the B&H in the AT&T stock

4.6 Case Study III - Influence of the Market Trend Feature

After establishing the system’s performance using and not using the GA module, we assessed the

performance of the trading system without a feature composed to understand the ongoing trend on the

financial market, which detailed explanation can be found over the Section 3.4.1. In order to understand

its importance on the system’s trading performance, equal tests were conducted to create a comparison

term between the different case studies. In the Table 4.6 can be found the results of the trading system

without the trend feature, coupled with the Buy and Hold results for comparison.

70

Page 89: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Table 4.6: Results of the B&H and the different fitness functions tested without the Trend Label module

B&H Accuracy ROR/day Mean DailyProfit RRR

S&P500Transactions 2 19 38 59 12ROR (%) 30.492 19.197 21.522 17.176 15.377ROR/day (%) 0.064 0.041 0.045 0.037 0.032MDD (%) 9.983 8.939 11.037 9.149 11.029RRR (%) 3.054 2.028 1.950 1.877 1.394MDP (%) 0.058 0.038 0.043 0.035 0.032

AppleTransactions 2 61 62 37 54ROR (%) 87.334 -0.020 6.148 6.256 16.918ROR/day (%) 0.185 0.001 0.012 0.014 0.034MDD (%) 13.298 28.931 19.719 30.529 20.372RRR (%) 20.372 0.060 0.312 0.259 0.830MDP (%) 0.140 0.004 0.020 0.019 0.039

AmazonTransactions 2 27 49 24 36ROR (%) 158.690 -23.399 -0.877 -41.218 -25.114ROR/day (%) 0.335 -0.049 -0.002 -0.100 -0.052MDD (%) 14.600 39.484 39.234 46.338 42.125RRR (%) 10.869 -0.555 -0.022 -0.886 -0.582MDP (%) 0.210 -0.047 0.010 -0.098 -0.049

AT&TTransactions 2 19 73 36 58ROR (%) 4.058 3.345 -6.136 -12.088 2.076ROR/day (%) 0.009 -0.005 -0.013 -0.021 0.017MDD (%) 19.582 19.444 28.380 25.352 12.631RRR (%) 0.207 0.864 -0.142 -0.310 0.164MDP (%) 0.014 0.007 -0.009 -0.024 0.006

Coca-ColaTransactions 2 104 90 68 80ROR (%) 4.578 -4.796 -2.493 0.425 0.138ROR/day (%) 0.010 -0.010 -0.005 0.001 0.000MDD (%) 10.983 16.531 15.102 14.057 15.692RRR (%) 0.417 -0.279 -0.132 0.030 0.064MDP (%) 0.012 -0.008 -0.003 0.004 0.003

71

Page 90: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Analysing the results obtained, the following conclusion can be taken:

– In the S&P500 index, the system that did not used the trend feature performed worse than the

system that used the feature, since none of the strategies used by the different fitness functions

could become more profitable than the Buy and Hold strategy.

– In the Apple stocks, the overall investing strategies of the system without trend feature are con-

siderably less rewarding, i.e., achieving less returns on investments, than on the strategies em-

powered when using the system with the trend feature. In addition of achieving less returns on

investments, the investing strategies used are also more prune to riskier investments, since the

MDD metric is higher on all the evaluated fitness functions.

– When investing in the Amazon stocks, both the systems performed poorly, achieving both negative

returns. Although both results were negative, the system that did not used the trend feature

got less negative returns, when comparing the performance of the strategies used with both the

Accuracy and ROR/day fitness functions.

– When investing in the AT&T and in the Coca-Cola stocks, the system that used the trend feature

performed considerably better than the system that did not used the feature, achieving higher

results in every evaluation criteria that measures the strategies’ financial performance (ROR,

ROR/day, RRR and MDP).

Having this in mind, one can conclude that the presence of a feature that portrays the ongoing trend

on the market has impact in the performance of the trading system, enabling it to achieve investment

strategies more profitable, in comparison, throughout the tests performed. When comparing the results

from both the experiments, the rewards achieved by the developed system in the presence of the trend

feature are overall higher, least in the Amazon market where both the systems were outperformed by

the market itself. Furthermore, the trend feature enables the system to accomplish solutions with higher

returns on investment and a better ratio between the investment’s rate of return and days spent with

capital invested in the tested market, since the strategies that used the trend feature achieved higher

results on the ROR/day evaluation metric.

In the Figure 4.5 and Figure 4.6 it is presented the evolution of the returns obtained by the Buy and

Hold strategy and average of the returns obtained by the system without the trend feature in the S&P500

index and AT&T stocks, respectively, during the tested period. It can be observed that the ROR of the

strategies adopted by the system without the trend feature on all the fitness functions, on both markets,

is worse than the B&H strategy. Hence, the performance of the system is hindered by the absence of

the feature that portrays the ongoing trend on the studied financial market.

72

Page 91: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

-5

0

5

10

15

20

25

30

35

27/04/16 27/06/16 27/08/16 27/10/16 27/12/16 27/02/17 27/04/17 27/06/17 27/08/17 27/10/17 27/12/17 27/02/18

ROR

(%)

Date

B&H Accuracy ROR/day Mean Daily Profit Risk Return Ratio

Figure 4.5: Returns obtained by the system without the Trend feature using the different fitness functionsand the B&H in the S&P500 index

-15

-10

-5

0

5

10

15

20

27/04/16 27/06/16 27/08/16 27/10/16 27/12/16 27/02/17 27/04/17 27/06/17 27/08/17 27/10/17 27/12/17 27/02/18

ROR

(%)

Date

B&H Accuracy ROR/day Mean Daily Profit Risk Return Ratio

Figure 4.6: Returns obtained by the system without the Trend feature using the different fitness functionsand the B&H in the AT&T stock

4.7 Evaluation Conclusions

All the experiments conducted to evaluate the overall performance of the implemented system led us

to reach to some conclusions regarding its performance.

Firstly, one can conclude that the presence of the GA module and the trend feature only supports

the trading performance of the system, since its performance is hindered when each of these modules

is detached from the original solution.

Secondly, the system struggles when the financial market studied is described as an overall clear

73

Page 92: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

upward trend, similar to the behaviour found on the Amazon and Apple stocks. As can be perceived by

the results found on the Table 4.4, in the Amazon and Apple stock markets, the system was not able to

understand its tendencies, producing strategies that were unable to gain financial advantage over the

market.

Lastly, regarding the performance of all the fitness functions developed, one can conclude that the

system with Risk Return Ratio (RRR) as fitness function can outperform the Buy and Hold strategy in

most of the tested market, with the exception to the Apple and Amazon stocks. Having this in mind, in

the tested markets, the Risk Return Ratio fitness function achieves better and more consistent results

overall. Which, on its own, enables the system to achieve the proposed objective of maximising the

returns while minimising the risk on the investments made.

74

Page 93: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Chapter 5

Conclusions and Future Work

5.1 Summary

In this thesis it is presented an ensemble system that combines a Genetic Algorithm (GA) for dimen-

sionality reduction and parameter optimisation, for the computation of the technical indicators, with a

Random Forest (RAF) algorithm, with the goal of maximising the returns and daily profits while minimis-

ing the risk associated with the investments made and the number of days spent with capital invested in

the studied financial markets.

At an early stage of the execution of the system the Genetic Algorithm jointly with the set of the

technical indicators selected by the user (as stated in the Section 3.2), optimising their parameters and

reducing the number of features present in the financial data, that will be used by the system in the

following modules. Then, the RAF uses this transformed data to find relations between the features

to conceive a trading signal that can enhance the performance of the system to meet the objectives

proposed.

In order to test the performance of the implemented system were settled tests over five markets (the

S&P500 index and four stocks from the index’s sectors, where Apple and Amazon are some examples

of the studied stocks), with different characteristics and trends to assess the robustness of the trading

system. The results are promising, and over the next sections it will be discussed the achievements

gathered from the different testing scenarios and different approaches to the future work that can be

performed as a succeeding to this work.

5.2 Achievements

In this work, four different fitness functions are tested in order to perceive their impact on the pre-

vision ability of the Random Forest algorithm to predict the future behaviour of the tested market, thus

yielding solutions with good performance. The results obtained using the different fitness functions are

considerably different, thus enhancing the importance of choosing the right fitness function that could

boost the performance of the system, to achieve good results on the proposed evaluation metrics.

75

Page 94: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

From the results obtained, it can be concluded that by using the Risk Return Ratio (RRR) fitness func-

tion (that includes the computation of the investment’s associated risk, thus penalising riskier strategies)

the implemented system was capable to perform reasonably well in some of the tested markets (partic-

ularly on the S&P500 index, and on the Coca-Cola and AT&T stocks). Despite, the worse performance

when compared with the Buy and Hold strategy on the Apple stocks, the system through the usage of

the RRR fitness function was able to yield a strategy that produced a higher Rate of Return (ROR),

thus closer to the benchmark strategy, than the strategies tested when using the other fitness functions.

Thus, the Risk Return Ratio fitness function unlocks the potential of the trading system to grasp robust

solutions capable of becoming profitable on the majority of the financial markets tested.

The importance of the Genetic Algorithm (GA) module and the Trend feature (which descriptions can

be found over the Sections 3.4.2.1 and 3.4.1, respectively), in the overall performance of the system, is

also tested and enabled us to conclude that, from the results gathered, the GA module and the Trend

feature have a substantial impact on the performance of the system and its ability to yield more profitable

strategies.

One of the major conclusions that can be taken from the experiments conducted, is that the devel-

oped trading system can not beat the Buy and Hold strategy on markets where there is a clean upward

trend on the evaluation of the stock, as can be observed on the Apple and Amazon stocks. In the afore-

mentioned stock, despite being positive the performance metrics are below the benchmark results, on

the other hand, in the later stock the results are not even positive, thus making it the market where the

trading system performed the worst.

5.3 Future Work

As a follow-up to this work, several approaches can delve into to extend the capabilities and the

performance results of the system described in this thesis. Some of the approaches are presented next:

– In order to fix the problem mentioned in the previous Section, where it is stated that the system

was unable to perform well on markets characterized by presenting a clean upward trend, two

new levels to distinguish the ongoing trend on the market can be added, enabling the system to

categorize the trend on strong uptrend and strong downtrend, which would enable the system to

yield strategies where are performed less transactions.

– More fitness functions could be tested, involving other financial concepts as the volatility of the

studied market, which could make the trading system less prune to invest in more volatile markets,

where it could have a hard time to gain financial advantage over the market.

– It can be interesting to adopt a new module, where it is analysed the overall importance and relation

of the features used, similar to what can be studied on a feature’s heat map, thus performing further

feature selection, which could, in the end, enhance the performance of the system, in terms of

prediction and running time.

76

Page 95: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Appendix A

Technical Analysis

A.1 Technical Indicators Description

A.1.1 Trend Following

Parabolic Stop and Reversal (PSAR)

The PSAR calculations are independent, as seen in the series of Equations 2.3, depending on the

trend that the market is facing. When the price is in an uptrend, the PSAR appears below the price

signal and converges towards it, acting as a resistance for the price and protects investor’s profits when

in a long position. On the other way around, when the price stops rising and starts experiencing a

downtrend, the PSAR appears above the price signal and gravitates downwards, operating as a price

ceiling, shielding the profits from a short position.

As can be stated, through using this technical indicator a decision maker can easily set a stop-loss

order, this type of order is set to minimize an investor’s loss when his position is in opposition to the

current market’s trend. Thus, when an investor in adopting a long position, and the price goes below the

stop’s value, then the PSAR signals it is best to close/sell his positions, conversely when the investor is

short on a security and the price is skyrocketing, then it is best to buy his positions back.

Average Directional Index (ADX)

By paying attention to the three lines (aforementioned in the Subsection 2.3.2.1), simultaneously, one

can formulate a successful trading strategy. When the +DI line moves above the -DI line, a bullish trend

is underway. Antagonistically, when the market is considered to be bearish the -DI line is greater than

the +DI line. On the other hand, when the ADX is having readings below 20 indicates that the major

market trend is weakening, while readings above 40 can indicate trend strength of the major trend. On

an extreme case, ADX values above 50 can sense that an extremely strong trend is underway.

Linking the information portrayed from all the components, an assets trader can successfully deter-

mine buy/sell moments. In order to gain some leverage over the market, a short position would occur

77

Page 96: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

when the -DI line is above the +DI line and the major trend is down and a long position when the +DI

line is higher than the -DI line, but also when the ADX line is moving up.

Moving Average Convergence/Divergence (MACD)

The MACD indicator can impart useful information to traders, thus it is crucial to properly interpret it.

There are four known techniques to decipher the information conveyed and, accordingly, apply different

trading strategies (Murphy, 1999):

1. Signal Line Crossover - when the MACD plot falls below the signal line, it evidences that the

trader is in the presence of a bearish market, which indicates that the asset price is about to face

a downward trend and it suggests that the smartest decision is to sell. Conversely, when the

opposite occurs, that is, when the MACD plot goes beyond the signal line, the trader may be in the

presence of a bullish market and may witness an upward momentum in the asset’s price, therefore

the trader’s decision should be to invest on the asset.

2. Divergence - a divergence occurs when the indicator value does not adhere to the trend observed

by the asset’s price. Although, when a divergence arises, it signals the end of the actual trend, it

can be either an upward/downward trend.

3. Dramatic Rise - a sudden rise on the MACD plot is observed when the price suffers an intense

fluctuation, which has more impact on the short-term moving average than on the long-term mov-

ing average, as declared previously. This dramatic rise signals that an asset is overbought and

sooner or later its price will return to normality.

4. Zero Crossover - this method provides evidence of change in the trend’s direction. If the MACD

value goes below the zero line, it means that the short-term average is below the long-term aver-

age, which signals the downward movement of a trend. The opposite is also true, when the MACD

plot goes above the zero line, it means that the short-term average is above the long-term average,

which is an evidence that the market is facing an upward momentum.

In order to gauge the price momentum, traders could take into account the MACD histogram which

gives a visual depiction of the divergence between the MACD value and its signal, which computation

can be found in the Equation 2.5c. The histogram is positive if the MACD is above its 9-day EMA and

negative in the other way round. In the presence of a bullish market, the histogram grows bigger as

the prices start to rise faster, and contracts as the bullish investor’s sentiment starts to ease. The same

principle can be contrarily applied, so when the market is facing a bearish trend the histogram tends

grow bigger negatively as the price starts to drop faster, and gravitates towards the zero-line as the

bearish trend starts to slow down.

Despite being one of the most reliable indicators of the market (Gorgulho et al., 2011), it has also

drawbacks, since it can generate false signals inducing the trader into making wrong trading decisions.

For example, a false negative signal could arise from the scene where the bullish crossover did not

78

Page 97: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

happened, but the stock movement suddenly experienced an upward trend. Sometimes, in order to

filter out false signals and confirm true ones, the MACD as a lagging indicator, since is based on two

EMAs, could be paired with a leading indicator, like the Relative Strength Index (RSI), presented in the

Section 2.3.2.2.

A.1.2 Momentum Oscillators

Relative Strength Index (RSI)

The indicator’s values range between 0 and 100. So, if the RSI value goes up to 70 or above, which

is the maximum threshold to hint that an asset is becoming overbought or overvalued, a trend reversal

or a correction on the asset’s price should be observed. Contrarily, if the RSI value goes to 30 or below,

which is the minimum threshold to hint an oversold or undervalued condition, may signal that an uptrend

or a price correction reversal to the upside is compelled in a near future.

Examining for a divergence, which occurs when the asset’s price reaches a new high or low value

but the same variation is not observable by the reading of the RSI, manages to give a deeper insight

into the trading decisions proposed by the RSI indicator, enabling the trader to adopt different strategies.

When a bearish divergence is observed, which means that the price made a new high but the RSI value

did not accompany the uptrend, a sell signal is issued. On the other hand, when a bullish divergence

occurs, which means that the price hits a lowest value and the RSI does not, a buy signal is acquainted.

Other solutions aiming to refine the indicator can be adopted, for instance, some may pair this indi-

cator with other technical indicators in agreement, as a mean to confirm that false buy/sell signals are

not created due to sudden fluctuations on the price.

Stochastic Oscillator (STO)

Towards the correct interpretation of the Stochastic Oscillator (STO) indicator, the trader must exam-

ine for a crossover between the %K and the %D plots (whose analytical calculations can be found over

the Equations 2.7), while having a divergence relative to the asset’s price, in an extreme area of a cycle

bottom, which is attained when the price of a security swings at the lowest value registered. Through

using the third indicator, the trader can get a confirmation about the signal issued, at or before a bottom,

at the turnaround of %D indicator (Murphy, 1999).

Since the indicator’s trading range will remain constant through time, two thresholds were defined,

which can be revised to better fit the securities’ analytical needs. In order to properly assign when a

stock is being overbought or oversold, traditionally are used 80 and 20, respectively (Pinto et al., 2015).

These boundaries used along with the forecast information described above, from an investor point of

view, could lead to useful trading strategies:

– When the upper threshold is crossed by the indicator coming from below, then the long position

adopted by the trader should be increased.

79

Page 98: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

– When the opposite occurs, which is the lower boundary being crossed by the plot from above, then

the short position should be increased.

– When stochastic %D crosses, in extreme areas with divergence from the actual asset’s price,

stochastic %K, then the position adopted should be liquidated.

Williams %R (WILLR)

When analysing the Williams %R (WILLR) indicator’s values, the trader must watch for two standard

levels that can dictate the asset’s market condition. Readings above -20 may hint that an asset is

becoming overvalued or overbought which may signal that a trend reversal or a correction on the asset’s

price should be observed shortly, so it is advised that the investor adopt a short position. Contrarily,

if the %R value goes beyond the -80 level, which is the minimum threshold to hint that an asset is

undervalued or oversold, may signal that a bullish trend is imminent, therefore an investor should adopt

a long position.

Commodity Channel Index (CCI)

Despite the scaling factor γ used (which formula and its impact on the indicator’s computation can

be found over the series of Equations 2.11), the percentage of values that will fall within the range

will depend also on the number of considered periods. Having this in mind, shorter periods will make

the indicator more volatile to market oscillations, with a smaller percentage of values within the range.

Conversely, the more periods used to compute the CCI, the higher the percentage of values will fall

between -100 and +100. Therefore, a trader needs to find a compromise on the number of periods

considered, in order to force some trade signals.

Whilst, between 70 to 80 percent of the CCI values fall between the range, there are still between

20 to 30 percent of the total values to consider which may result on buy/sell signals. As a result, this

indicator could portray valuable information helping planning a established trading strategy:

– Overbought/Oversold levels - through the usage of the aforementioned levels, one can assess the

asset’s market condition, which is deemed to be overvalued it the indicator’s value goes beyond

the maximum established threshold and oversold if the indicator’s value gravitates towards the

minimum level. Hence, from oversold or undervalued levels, when the CCI value rises back above

the -100 level, a long position should be adopted. From overvalued or overbought levels, a short

position should be adopted when the indicator’s readings dip below the +100 level.

– Divergences - as with most other indicators, this strategy could also be applied with the CCI indi-

cator to increase the robustness of the signals. So, a bullish divergence that occurs below -100

would increase the robustness of the buy signal issued if the indicator’s value goes above this

level. And the opposite can also be stated, thus a bearish divergence which occurs above +100

80

Page 99: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

would increase the robustness of the sell signal issued due to the descendant crossover of the

level.

– Trend line breaks - a trend line can be drawn from connecting the peaks and depressions of

the asset’s price. From oversold conditions, a positive crossover of the -100 level joined with a

trend line breakout, could signal that a bullish trend may be surging in the near future. From

overvalued conditions, a decline below +100 coupled with a trend line crossover might evidence

that a downward trend in the market may be forthcoming.

Advance/Decline Line (A/D)

Readings from the Advance/Decline Line (A/D) indicator, that can be analytically found using the

Equation 2.12, can lead to good performing trading strategies on the market. Through the usage of the

convergence/divergence technique, presented previously, an investor could assess how will the market

react to general investors’ sentiment. From the indicator’s readings, one can make four major analysis:

1. Bullish Divergence - the bullish divergence occurs when the market’s price is continuing to move

lower, but the indicator value does not adhere the trend, which could state that the sellers are

losing their conviction. Therefore, it should be better to take a long position.

2. Bearish Divergence - the bearish divergence occurs in the opposite case, where the market is

facing upwards, but the indicator’s readings are sloping downwards, which could state that the

markets are losing their strength to keep moving upwards and a reverse in direction may be about

to happen. Consequently, it is best advised to take a short position in face of the latest market’s

conditions.

3. Moves along the market - finally, there are still to different situations to consider, when the indicator

moves along the price of the market. So, when the market is moving upwards and the indicator’s

value is also moving upward, the market is said to be healthy and prices have a greater chance to

continue to rise. Conversely, when the market price is trending lower and the indicator’s value is

also sloping down, there is a greater chance that declining prices are likely to continue.

Percentage Price Oscillator (PPO)

Readings from the Percentage Price Oscillator (PPO) indicator, which tend to range between -10%

and +10%, will disclose the relation between the two EMAs (relation that can be portrayed by the Equa-

tion 2.13), which could help the investor in determining proper positions that should be adopted in face

of the market. Hence, when the PPO value is positive, indicates that the market is facing an uptrend and

it signals the investor to buy new positions. Whereas, when the indicator value is negative, it dictates

that the market is on a downtrend and the investor is better advised to adopt a short position. Withal,

due to the fact that it is considered to be an oscillator, the PPO can also be used to dictate the asset’s

81

Page 100: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

condition when compared with the market current health. So, when a PPO value falls out of the range,

above mentioned, hints that the traded stock is being oversold or overbought, respectively.

One big advantage in favour of the PPO is that it is a dimensionless quantity, i.e. it is not established

through the use of the closing price of the underlying stock or market, making the price of the security

almost of secondary importance, which is not observable in most of other technical analysis tools.

A.1.3 Volume Indicators

Money Flow Index (MFI)

As stated previously, in the Subsection 2.3.2.3, when analysing the indicator’s readings, a trader

could benefit from using two determined levels, to assess an asset’s overbought/oversold condition.

Thus, when the indicator value goes beyond the maximum threshold, typically 80, the security is consid-

ered to be overbought/overvalued therefore an investor should sell its positions. While indicator’s values

below 20, which is the minimum threshold, an asset is stated to be oversold/undervalued, thus the most

prevalent signal is to buy new positions, entering in a long position.

Another way to analyse the indicator is to watch for divergences between the price action and the

MFI, for instance when the prices makes a new rally high and the indicator high value is lower than MFI’s

previous high then that may indicate a weak advance, which is more likely to reverse. Nevertheless, this

technique is more used to confirm issued signals.

A.1.4 Volatility Indicators

Average True Range (ATR)

As stated in the Subsection 2.3.2.4, the ATR can be used to validate the enthusiasm behind a sharp

move on the market’s price. When occurs a bullish reversal with an increase in ATR would show strong

buying pressure and reinforce the reversal, thus signalling that it should be an opportunistic moment

to invest. Likely, a bearish price turning coupled with an increase in ATR would show strong selling

pressure and reinforce the support break, therefore this should be the best moment for the investor to

close all its open stock positions or adopt a short position.

Bollinger Bands (BBANDS)

The use of Bollinger Bands varies widely among traders, some tend to use it to identify overvalued or

undervalued periods in a stock, others use it as a volatility indicator. Either of the ways, buy/sell signals

will be issued when appropriate.

When looking for overbought/oversold levels, a trader must be aware about the stock’s closing price

closing near or beyond any of the bands. If it is closing near or above the upper band, the stock is said to

be overbought and the best investment strategy is to sell the share owned or adopt a short position. On

82

Page 101: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

the other hand, when the closing price is gravitating toward or it evens cross the lower band, the stock

is thought to be oversold, therefore it is advised to buy new positions, by adopting a long position.

However, since this indicator falls within the group of the volatility indicators, investors tend to use

BBANDS to measure the volatility of stocks. So, when the bands lie close together it indicates that the

stock is undergoing a low volatility period, and when they are far apart from each other it indicates that

the stock is experiencing a period of high volatility. There is still the case when the bands have only a

slight slope and are trending almost parallely for an extended period of time, the price of the stock will

be found to oscillate between the two boundaries as if it was within a channel, which may also indicate

that the stock is over a low volatility period.

A.2 Technical Indicators Parameters Table

Table A.1: Parameters used in the computation of the different Technical Indicators

Technical Indicator Parameters to Optimise

Trend Following

Simple Moving Average (SMA) SMA period

Exponential Moving Average (EMA) EMA period

Average Directional Index (ADX) ADX period

Parabolic Stop and Reversal (PSAR) No parameters to optimise

Moving AverageConvergence/Divergence (MACD)

Fast EMA periodSlow EMA period

Signal Period

MomentumOscillators

Relative Strength Index (RSI) EMA period

Stochastic Oscillator (STO) EMA K% periodEMA D% period

Williams %R (WILLR) WILLR period

Rate of Change (ROC) ROC period

Commodity Channel Index (CCI) SMA period

Advance/Decline Line (A/D) A/D period

Percentage Price Oscillator (PPO) Fast EMA periodSlow EMA D% period

Momentum Momentum period

VolumeIndicators

Money Flow Index (MFI) MFI period

On Balance Volume (OBV) OBV period

VolatilityIndicators

Average True Range (ATR) EMA period

Bollinger Bands (BBANDS) SMA period

83

Page 102: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Appendix B

Implementation

B.1 K-fold Cross Validation Scheme

Initial Dataset

Divide into 3 mutually exclusive subsets

Test model on green subset

Trains model in blue and red subset

Test model on red subset

Trains model in blue and green subset

Test model on blue subset

Trains model in red and green subset

Accuracy computed over the

green subset

Accuracy computed over the

red subset

Accuracy computed over the

blue subset

Figure B.1: Diagram which represents a 3-fold cross validation scheme

84

Page 103: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Appendix C

Evaluation Plots

C.1 Normal distribution graph

Left tail Right tailCenter point

Figure C.1: Bell shaped histogram of a normal distribution

85

Page 104: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

C.2 Apple stocks candlesticks

Figure C.2: Candlestick chart for the AAPL stocks

86

Page 105: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Appendix D

Return Plots

D.1 Full System return plots

D.1.0.1 Apple stocks

-20

0

20

40

60

80

100

27/04/16 27/06/16 27/08/16 27/10/16 27/12/16 27/02/17 27/04/17 27/06/17 27/08/17 27/10/17 27/12/17 27/02/18

ROR

(%)

Date

B&H Accuracy ROR/day Mean Daily Profit Risk Return Ratio

Figure D.1: Returns obtained by the system using the different fitness functions and the B&H in theApple stock

87

Page 106: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

D.1.0.2 Amazon stocks

-50

0

50

100

150

200

27/04/16 27/06/16 27/08/16 27/10/16 27/12/16 27/02/17 27/04/17 27/06/17 27/08/17 27/10/17 27/12/17 27/02/18

ROR

(%)

Date

B&H Accuracy ROR/day Mean Daily Profit Risk Return Ratio

Figure D.2: Returns obtained by the system using the different fitness functions and the B&H in theAmazon stock

D.1.0.3 Coca-Cola stocks

-15

-10

-5

0

5

10

15

20

25

30

35

27/04/16 27/06/16 27/08/16 27/10/16 27/12/16 27/02/17 27/04/17 27/06/17 27/08/17 27/10/17 27/12/17 27/02/18

ROR

(%)

Date

B&H Accuracy ROR/day Mean Daily Profit Risk Return Ratio

Figure D.3: Returns obtained by the system using the different fitness functions and the B&H in theCoca-Cola stock

88

Page 107: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

D.2 System without GA return plots

D.2.0.1 Apple stocks

-20

0

20

40

60

80

100

27/04/16 27/06/16 27/08/16 27/10/16 27/12/16 27/02/17 27/04/17 27/06/17 27/08/17 27/10/17 27/12/17 27/02/18

ROR

(%)

Date

B&H Accuracy ROR/day Mean Daily Profit Risk Return Ratio

Figure D.4: Returns obtained by the system without the GA module using the different fitness functionsand the B&H in the Apple stock

D.2.0.2 Amazon stocks

-60

-30

0

30

60

90

120

150

27/04/16 27/06/16 27/08/16 27/10/16 27/12/16 27/02/17 27/04/17 27/06/17 27/08/17 27/10/17 27/12/17 27/02/18

ROR

(%)

Date

B&H Accuracy ROR/day Mean Daily Profit Risk Return Ratio

Figure D.5: Returns obtained by the system without the GA module using the different fitness functionsand the B&H in the Amazon stock

89

Page 108: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

D.2.0.3 Coca-Cola stocks

-10

-5

0

5

10

15

20

27/04/16 27/06/16 27/08/16 27/10/16 27/12/16 27/02/17 27/04/17 27/06/17 27/08/17 27/10/17 27/12/17 27/02/18

ROR

(%)

Date

B&H Accuracy ROR/day Mean Daily Profit Risk Return Ratio

Figure D.6: Returns obtained by the system without the GA module using the different fitness functionsand the B&H in the Coca-Cola stock

90

Page 109: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

D.3 System without Trend feature return plots

D.3.0.1 Apple stocks

-30

-20

-10

0

10

20

30

40

50

60

70

80

90

100

27/04/16 27/06/16 27/08/16 27/10/16 27/12/16 27/02/17 27/04/17 27/06/17 27/08/17 27/10/17 27/12/17 27/02/18

ROR

(%)

Date

B&H Accuracy ROR/day Mean Daily Profit Risk Return Ratio

Figure D.7: Returns obtained by the system without the Trend feature using the different fitness functionsand the B&H in the Apple stock

D.3.0.2 Amazon stocks

-50

0

50

100

150

200

27/04/16 27/06/16 27/08/16 27/10/16 27/12/16 27/02/17 27/04/17 27/06/17 27/08/17 27/10/17 27/12/17 27/02/18

ROR

(%)

Date

B&H Accuracy ROR/day Mean Daily Profit Risk Return Ratio

Figure D.8: Returns obtained by the system without the Trend feature using the different fitness functionsand the B&H in the Amazon stock

91

Page 110: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

D.3.0.3 Coca-Cola stocks

-15

-10

-5

0

5

10

15

20

27/04/16 27/06/16 27/08/16 27/10/16 27/12/16 27/02/17 27/04/17 27/06/17 27/08/17 27/10/17 27/12/17 27/02/18

ROR

(%)

Date

B&H Accuracy ROR/day Mean Daily Profit Risk Return Ratio

Figure D.9: Returns obtained by the system without the Trend feature using the different fitness functionsand the B&H in the Coca-Cola stock

92

Page 111: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

Bibliography

J. Ali, R. Khan, N. Ahmad, and I. Maqsood. Random forests and decision trees. IJCSI International

Journal of Computer Science Issues, 9(5):1–7, 2012.

A. M. AlMana and M. S. Aksoy. An overview of inductive learning algorithms. International Journal of

Computer Applications, 88(4), 2014.

M. Ballings, D. V. den Poel, N. Hespeels, and R. Gryp. Evaluating multiple classifiers for stock price

direction prediction. Expert Systems with Applications, 42(20):7046–7056, 2015.

A. Booth, E. Gerding, and F. Mcgroarty. Automated trading with performance weighted random forests

and seasonality. Expert Systems with Applications, 41(8):3651–3661, 2014.

A. Bosch, A. Zisserman, and X. Munoz. Image classification using random forests and ferns. In ICCV

2007. IEEE 11th International Conference on Computer Vision, pages 1–8. IEEE, 2007.

L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

Q. Cao, K. B. Leggio, and M. J. Schniederjans. A comparison between fama and french’s model and

artificial neural networks in predicting the chinese stock market. Computers & Operations Research,

32(10):2499–2512, 2005.

C. Cheng, T. Chen, and L. Wei. A hybrid model based on rough sets theory and genetic algorithms for

stock price forecasting. Information Sciences, 180(9):1610–1629, 2010.

T. G. Dietterich. An experimental comparison of three methods for constructing ensembles of decision

trees: Bagging, boosting, and randomization. Machine learning, 40(2):139–157, 2000.

R. Edwards, J. Magee, and W. Bassetti. Technical Analysis of Stock Trends. CRC Press, 9th edition,

2007.

M. ElAlami. A filter model for feature subset selection based on genetic algorithm. Knowledge-Based

Systems, 22:356–362, 2009.

E. F. Fama. Efficient capital markets: A review of theory and empirical work. The Journal of Finance, 25

(2):383–417, May 1970.

F. Fortin, F. De Rainville, M. Gardner, M. Parizeau, and C. Gagne. DEAP: Evolutionary algorithms made

easy. Journal of Machine Learning Research, 13:2171–2175, 2012.

A. Gorgulho, R. Neves, and N. Horta. Applying a ga kernel on optimizing technical analysis rules for stock

93

Page 112: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

picking and portfolio composition. Expert Systems with Applications, 38(11):14072–14085, 2011.

A. Gosavi. Simulation-based optimization. Parametric Optimization Techniques and Reinforcement

Learning, 25, 2003.

A. Hirabayashi, C. Aranha, and H. Iba. Optimization of the trading rule in foreign exchange using genetic

algorithm. In Proceedings of the 11th Annual conference on Genetic and evolutionary computation,

pages 1529–1536. ACM, 2009.

C. Huang, D. Yang, and Y. Chuang. Application of wrapper approach and composite classifier to the

stock trend prediction. Expert Systems with Applications, 34(4):2870–2878, 2008.

K. Kim and I. Han. Genetic algorithms approach to feature discretization in artificial neural networks for

the prediction of stock price index. Expert systems with Applications, 19(2):125–132, 2000.

T.-H. Kim and H. White. On more robust estimation of skewness and kurtosis. Finance Research Letters,

1(1):56–73, 2004.

O. Koksoy and T. Yalcinoz. Robust design using pareto type optimization: A genetic algorithm with

arithmetic crossover. Computers & Industrial Engineering, 55(1):208–218, 2008.

S. Kotsiantis and P. Pintelas. Recent advances in clustering: A brief survey. WSEAS Transactions on

Information Science and Applications, 1(1):73–81, 2004.

S. B. Kotsiantis, I. Zaharakis, and P. Pintelas. Supervised machine learning: A review of classification

techniques. Emerging artificial intelligence applications in computer engineering, 160:3–24, 2007.

C. Krauss, X. A. Do, and N. Huck. Deep neural networks, gradient-boosted trees, random forests:

Statistical arbitrage on the s&p 500. European Journal of Operational Research, 259(2):689–702,

2017.

M. Kumar and M. Thenmozhi. Forecasting stock index movement: A comparison of support vector

machines and random forest. Proceedings of ninth Indian Institute of Capital Markets Conference,

2006.

Y. Kwon and B.-R. Moon. Evolutionary ensemble for stock prediction. In Genetic and Evolutionary

Computation, pages 1102–1113. Springer, 2004.

L. Lam. Classifier combinations: Implementations and theoretical issues. In International Workshop on

Multiple Classifier Systems, pages 77–86. Springer, 2000.

X. Li and C. W. Chan. Application of an enhanced decision tree learning approach for prediction of

petroleum production. Engineering Applications of Artificial Intelligence, 23(1):102–109, 2010.

W. Loh. Classification and regression trees. Wiley Interdisciplinary Reviews: Data Mining and Knowl-

edge Discovery, 1(1):14–23, 2011.

Y. Lyuu. Financial engineering and computation: principles, mathematics, algorithms, chapter Introduc-

tion, page 1. Cambridge University Press, 2001.

M. Maragoudakis and D. Serpanos. Towards stock market data mining using enriched random forests

94

Page 113: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

from textual resources and technical indicators. In IFIP International Conference on Artificial Intelli-

gence Applications and Innovations, pages 278–286. Springer, 2010.

M. Mitchell. An Introduction to Genetic Algorithms. MIT press, 1998.

J. J. Murphy. Technical Analysis of the Financial Markets: A Comprehensive Guide to Trading Methods

and Applications. New York Institute of Finance Series. New York Institute of Finance, 1999.

B. B. Nair, V. Mohandas, and N. Sakthivel. A decision tree—rough set hybrid system for stock market

trend prediction. International Journal of Computer Applications, 6(9):1–6, 2010.

J. Patel, S. Shah, P. Thakkar, and K. Kotecha. Predicting stock market index using fusion of machine

learning techniques. Expert Systems with Applications, 42(4):2162–2172, 2015a.

J. Patel, S. Shah, P. Thakkar, and K. Kotecha. Predicting stock and stock price index movement using

trend deterministic data preparation and machine learning techniques. Expert Systems with Applica-

tions, 42(1):259–268, 2015b.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,

R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-

esnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–

2830, 2011.

J. M. Pinto, R. F. Neves, and N. Horta. Boosting trading strategies performance using vix indicator

together with a dual-objective evolutionary computation optimizer. Expert Systems with Applications,

42(19):6699–6716, 2015.

S. Piramuthu. Evaluating feature selection methods for learning in data mining applications. European

Journal of Operational Research, 156(2):483–494, 2004.

A. M. Prasad, L. R. Iverson, and A. Liaw. Newer classification and regression tree techniques: bagging

and random forests for ecological prediction. Ecosystems, 9(2):181–199, 2006.

Y. Qi. Random forest for bioinformatics. In Ensemble machine learning, pages 307–323. Springer, 2012.

Q. Qin, Q.-G. Wang, J. Li, and S. S. Ge. Linear and nonlinear trading models with gradient boosted

random forests and application to singapore stock market. Journal of Intelligent Learning Systems

and Applications, 5(01):1, 2013.

J. R. Quinlan. Induction of decision trees. Machine learning, 1(1):81–106, 1986.

J. R. Quinlan. C4. 5: programs for machine learning. Elsevier, 2014.

M. Sebbana and R. Nock. A hybrid filter/wrapper approach of feature selection using information theory.

Pattern Recognition, 35(4):835–846, 2002.

K. Shin and Y. Lee. A genetic algorithm application in bankruptcy prediction modeling. Expert Systems

with Applications, 23(3):321–328, 2002.

R. Sikora and S. Piramuthu. Framework for efficient feature selection in genetic algorithm based data

mining. European Journal of Operational Research, 180(2):723–737, 2007.

95

Page 114: Telecomunication and Computer Engineering...A sequential approach in forecasting the S&P500 index: Combining Genetic Algorithm and Random Forests Ivo Miguel Fouto Pires Thesis to obtain

A. Suresh. A study on fundamental and technical analysis. International Journal of Marketing, Financial

Services & Management Research, 2(5):44–59, 2013.

A. C. Tan and D. Gilbert. Ensemble machine learning on gene expression data for cancer classification.

Applied Bioinformatics, 2003.

M. Thakur and D. Kumar. A hybrid financial trading support system using multi-category classifiers and

random forest. Applied Soft Computing, 67:337–349, 2018.

C. Tsai and Y. Hsiao. Combining multiple feature selection methods for stock prediction: Union, inter-

section, and multi-intersection approaches. Decision Support Systems, 50(1):258–269, 2010.

G. Valentini and F. Masulli. Ensembles of learning machines. In Italian Workshop on Neural Nets, pages

3–20. Springer, 2002.

G. van Rossum and P. D. Team. Python 2.7.10 Language Reference. Samurai Media Limited, United

Kingdom, 2015.

M. Wallace. General Purpose Applications of AI, volume 160, pages 1–2. IOS Press, 2007.

Q. Wang, J. Li, Q. Qin, and S. S. Ge. Linear, adaptive and nonlinear trading models for singapore

stock market with random forests. In Control and Automation (ICCA), 2011 9th IEEE International

Conference on, pages 726–731. IEEE, 2011.

Y. Xu, Z. Li, and L. Luo. A study on feature selection for trend prediction of stock trading price. In

Computational and Information Sciences (ICCIS), 2013 Fifth International Conference on, pages 579–

582. IEEE, 2013.

L. Yu, C. Ding, and S. Loscalzo. Stable feature selection via dense feature groups. In Proceedings

of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages

803–811. ACM, 2008.

96