Classification and Clustering of Stocks, using Genetic ... · Classification and Clustering of...
Transcript of Classification and Clustering of Stocks, using Genetic ... · Classification and Clustering of...
Classification and Clustering of Stocks, using GeneticAlgorithms and Fundamental Analysis
David Bugalho de Moura
Thesis to obtain the Master of Science Degree in
Electrical and Computer Engineering
Supervisors: Prof. Rui Fuentecilla Maia Ferreira NevesProf. Nuno Cavaco Gomes Horta
Examination Committee
Chairperson: Prof. Horacio Claudio de Campos NetoSupervisor: Prof. Rui Fuentecilla Maia Ferreira Neves
Members of the Committee: Prof. Joao Paulo Baptista de Carvalho
November 2016
Acknowledgments
I would like to thank my supervisor, Rui Neves, for all the advice, ideas and knowledge transmitted,
especially in the financial field, through this last year. I would also like to thank my family for the support
given at every moment of my life, especially when solutions to the problems encountered seemed so
distant. Lastly I would like to thank Clara Paiva, for the motivation given to do this work, even when
power of will failed me.
Abstract
Since the last two decades the ease of access to information has grown exponentially, making it easier to
analyze and use this data in every field of science, including computational finance. This work presents
an architecture made from scratch of a trading system that classifies stocks using two techniques, in
order to conclude which one is superior: one with user input parameters, and an unsupervised one
using a genetic algorithm to optimize clustering position, with a constant number of clusters. A genetic
algorithm is also applied to optimize fundamental indicators to give buy and sell signals in each of the
groups obtained, in order to conclude if stocks in the same group behave in similar fashion. Results have
shown that the group with best results obtained with user input parameters is superior to the group with
best returns obtained with the clustering algorithm. However the clustering algorithm classified stocks
better, having increased performance over the user input method when used with the genetic algorithm
using fundamental indicators. The proposed system was implemented from scratch, and was contains
optimization modules, processing modules and a trading simulator.
Keywords
Genetic Algorithms; Fundamental Analysis; Fundamental Indicators; Classification; Clustering; Stock
Market; S&P500
iii
Resumo
Desde as ultimas duas decadas o acesso a informacao cresceu exponencialmente, sendo mais facil
usar estes dados em todos os campos cientıficos, incluindo o campo de inteligencia artificial aplicado
a financas. Este trabalho apresenta um sistema que classifica accoes de empresas usando duas
tecnicas, de forma a concluir qual a melhor: uma supervisionada, com parametros dados pelo uti-
lizador, e uma nao supervisionada que usa um algoritmo genetico para optimizacao de clustering. Um
algoritmo genetico e tambem aplicado para optimizacao de indicadores fundamentais com o objectivo
de dar sinais de compra e venda em cada um dos grupos previamente obtidos, de forma a concluir se
as accoes dum grupo se comportam de forma semelhante. Os resultados mostram que os grupos com
mais semelhancas entre si foram obtidos com o algoritmo de clustering. O sistema de compra e venda
de accoes dado com indicadores fundamentais mostrou um melhoramento significativo no retorno do
grupo com melhores resultados do algoritmo de clustering.
Palavras Chave
Algoritmos Geneticos; Analise Fundamental; Indicadores Fundamentais; Classificacao; Clustering; Mer-
cado Bolseiro; S&P500
vi
Contents
1 Introduction 1
1.1 Motivation and Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 State of Art 5
2.1 Financial Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 Fundamental Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Technical Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Computational Intelligence Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.2.1 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2.3 Fuzzy Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.3 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Portfolio Composition Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Investment Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5.1 Value Investing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.2 Growth Investing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5.3 Growth At Reasonable Price (GARP) Investing . . . . . . . . . . . . . . . . . . . . 26
2.5.4 Income Investing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.6 Classification of Stocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Data Set / Markets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Related Work Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.8.1 Evolutionary Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.8.2 Other methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3 Architecture 33
3.1 Module View of the Global Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
vii
3.2 General System Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.1 Download . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3.2 Stock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.3 Fundamental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.4 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.3.5 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
3.3.6 GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3.3.7 Investment Simulator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.3.8 Portfolio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.1 Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.4.3 Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4 System Validation 61
4.1 Validation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.1.1 ROI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.1.2 Drawdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.3 Sharpe Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.1.4 Success rate of trades . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1.5 Average time in the market . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.1.6 Rate of positive quarters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2.1 Case Study I - User Input Classification . . . . . . . . . . . . . . . . . . . . . . . . 67
4.2.1.A Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.2.2 Case Study II - Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2.2.A Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
4.2.3 Case Study III - Using GAs to optimize FI . . . . . . . . . . . . . . . . . . . . . . . 79
4.2.3.A Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5 Conclusion 83
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 System limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
A List of Stocks used 93
viii
List of Figures
2.1 Macroeconomic Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Agilent Technologies Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3 S&P500 index Technical Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.4 Crossover . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.5 Mutation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.7 Dynamic Optimization Problems (DOP) Diversity Approaches . . . . . . . . . . . . . . . . 20
2.6 DOP Memory Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.8 DOP Multi-Population Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.9 Artificial Neural Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 Modules view of the architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2 Modules Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.3 Stock’s Raw Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Stock module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.5 FA module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3.6 Types’ Quadrants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.7 Classifier Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
3.8 Fundamental Indicators’ Chromosome Representation . . . . . . . . . . . . . . . . . . . . 50
3.9 Clustering Chromosome Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.10 Investment Simulator, Portfolio and Stock Interaction . . . . . . . . . . . . . . . . . . . . . 56
4.1 User input classification portfolios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.2 User input portfolios metrics table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
4.3 Types B&H - Yearly Returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Investment by Type table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 Clustering classification portfolios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.6 Clustering portfolios metrics table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
ix
4.7 Cluster’s portfolios yearly returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
4.8 Cluster’s representation in the whole plane - 2012Q4 . . . . . . . . . . . . . . . . . . . . . 75
4.9 Cluster’s representation in the whole plane-2013Q4 . . . . . . . . . . . . . . . . . . . . . 76
4.10 Cluster’s representation in the plane - 2012Q4 Zoomed . . . . . . . . . . . . . . . . . . . 76
4.11 Cluster’s representation in the plane - 2013Q4 Zoomed . . . . . . . . . . . . . . . . . . . 77
4.12 Investment by Type table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4.13 GA optimization portfolios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.14 FA GA portfolios metrics table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.15 FA GA portfolios’ yearly returns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
x
List of Tables
4.1 Average fitness per quarter of the clustering algorithm . . . . . . . . . . . . . . . . . . . . 72
4.2 Best, worst and median run of each portfolio . . . . . . . . . . . . . . . . . . . . . . . . . 79
xi
xii
List of Algorithms
2.1 Simple Genetic Algorithm (GA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.2 Roulette Wheel Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Automatic Clustering Genetic Algorithm (ACGA) . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1 Used GA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
xiii
xiv
Acronyms
GA Genetic Algorithm
ML Machine Learning
AI Artificial Intelligence
CI Computational Intelligence
DM Data Mining
FA Fundamental Analysis
TA Technical Analysis
FI Fundamental Indicators
TI Technical Indicators
GDP Gross Domestic Product
CPI Consumer Price Index
PPI Producer Price Index
DR Debt Ratio
ROE Return On Equity
NI Net Income
PM Profit Margin
PER Price Earnings Ratio
RG Revenue Growth
CSO Common Stock Outstanding
xv
EPS Earnings Per Share
NIG Net Income Growth
PR Payout Ratio
CE Capital Expenditures
CFOAG Cash From Operating Activities Growth
MA Moving Average
MSCI Morgan Stanley Capital International
OECD Organisation for Economic Co-operation and Development
DJI Dow Jones Industrial
NASDAQ National Association of Securities Dealers Automated Quotations
OBV On Balance Volume
VAMA Volume Adjusted Moving Average
EC Evolutionary Computation
EA Evolutionary Algorithms
GA Genetic Algorithm
ANN Artificial Neural Networks
AGA Adaptive Genetic Algorithm
DOP Dynamic Optimization Problems
SOS Self-Organizing Scouts
KDD Knowledge Discovery from Data
FCM Fuzzy C-means
ACGA Automatic Clustering Genetic Algorithm
MO Multi-Objective
SO Single-Objective
GARP Growth At Reasonable Price
xvi
MOEA Multi-Objective Evolutionary Algorithm
SPEA2 Strength Pareto Evolutionary Algorithm 2
R-SPEA2 Robust Strength Pareto Evolutionary Algorithm 2
MOGP Multi-Objective Genetic Programing
MCDM Multiple-Criteria Decision-Making
GP Genetic Programming
RST Rough Set Theory
MEPA Minimize the Entropy Principle Approach
CPDA Cumulative Probability Distribution Approach
FCMAC Fuzzy Cerebral Model Articulation Controller
RL Reinforcement Learning
SMA Simple Moving Average
RSI Relative Strength Index
MACD Moving Average Convergence Divergence
xvii
xviii
1Introduction
Contents
1.1 Motivation and Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Proposed Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.4 Document Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1
2
This section is an introduction to the subject of computational techniques applied to companies’
finances past data as a way to find trading rules and patterns in the stock market to achieve bigger
returns than the S&P500 index.
1.1 Motivation and Context
Since the recent advance of computational technologies (specially with the ease of information ac-
cess since the early 2000’s), several techniques have been used to try to extract trading rules from past
data of the stock market [1]. The stock market is constituted by humans, which gives it the same un-
predictability that humans have, making it hard to find patterns in it. Nevertheless, with the increase of
available information observed in the last couple of decades, this task seems more and more feasible.
Though there is a property of randomness intrinsic to the stock market (associated with all the variables
that are unknown to us), a large number of scientific researchers (for examples of it, see the next chap-
ter) showed results that makes us conclude that it is possible to obtain returns on the stock market using
computational resources.
To predict future stock prices with Artificial Intelligence (AI) and Computational Intelligence (CI) sys-
tems one needs to analyze data, applying data mining and machine learning algorithms to do so. Al-
though Machine Learning (ML) started attracting attention since the 80’s, it only flourished in the 90’s
when AI shifted from rule based methods1 to data driven methods2 (using approaches it had inherited
from AI but shifting towards methods using statistics and probability theory). Data analysis evolved al-
most in the same way, specially Data Mining (DM), that overlaps in terms of methods employed with
ML. They can be distinguished in a key aspect that is, ML focuses on prediction using known propri-
eties (learned in the training phase), and DM focuses on the discovery of unknown properties of data.
Although these algorithms have some decades now, only later these techniques have been applied
to Finance, not because the technology was not worth it, but because data availability and price was
not. In the late 90’s/early 2000’s, the globalization of Internet has made the access to information eas-
ier, cheaper and faster than in any other point in history, taking a big step in the way problems were
approached and in the technologies used.
Several techniques are used to predict stock market quotes, being the most popular one’s AI methods
to optimize financial indicators’ parameters. There are two types of financial indicators: fundamental
and technical indicators. Other methods include the evaluation of each stock of a certain type, being this
type defined by its sector or any other rule defined. Since the amount of data available is exponentially
increasing, probabilistic methods are gaining a lot of attention.
1defines a sequence of steps to be taken, based on a Knownledge Base (which consists in facts, and an inference engine)2describes data to be matched (pattern matching) and the processing in a more abstract way
3
1.2 Problem Statement
The problem here is to construct a system that evaluates the stock market and accurately groups
stocks that show similar behaviors. It should also choose the companies that show bigger returns in the
market, and study which technique is better to do so. The system should be able to perform in a similar
way when trading in real time. The implemented system is constructed from scratch with the objective
of being used by human traders. The architecture of the system should also be open for improvements.
The main objective when solving this problem is to obtain investment strategies that obtain better results
than the S&P500 index.
1.3 Proposed Solution
The implemented solution is a system made from scratch that classifies companies into groups based
on their financial statements, in two ways: a supervised way with parameters defined by the user and
an unsupervised way, applying a genetic algorithm to optimize clustering position, with a fixed number
of clusters. After both classifications, the application uses a genetic algorithm once more to optimize
indicators to buy stocks, comparing the results of the optimization done with the whole pool of stocks
with each classification given, to conclude on which classification classifies stocks better. The proposed
system was implemented in C++ with about 9000 lines of code and it was made to be used by third
parties that wish to improve its features.
1.4 Document Structure
The presented thesis is structured as follows:
• Chapter 2 shows theoretical concepts required to develop this project, including the theory behind
financial indicators, machine learning algorithms with special focus on evolutionary computing,
clustering algorithms and portfolio management. It is also given an overview of the results obtained
by some related works.
• Chapter 3 documents the proposed solution, with a detailed description of the architecture, meth-
ods and details of the application.
• Chapter 4 presents a validation of the system, showing parameters used and a detailed study of
the solution performance and robustness through case studies.
• Chapter 5 summarizes this work, concluding the achievements, limitations and proposing future
improvements.
4
2State of Art
Contents
2.1 Financial Indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Computational Intelligence Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.3 Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.4 Portfolio Composition Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.5 Investment Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.6 Classification of Stocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.7 Data Set / Markets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.8 Related Work Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5
6
2.1 Financial Indicators
Financial indicators analyze statistics and present that information in the form of ratios, which sup-
ports managers on future stock prices’ decisions. These indicators will give better results if used along-
side the right strategy [1].
There are several types of financial indicators, divided into Fundamental Indicators (FI) or Technical
Indicators (TI), which come from Fundamental Analysis (FA) and Technical Analysis (TA).
2.1.1 Fundamental Indicators
FA is used to check for the intrinsic value of a certain market, industry or company, being FA applied to
the latter the most used one. FA uses financial statements to know if a company is under or overvalued.
FI are constructed using FA, and these type of indicators do not take into account market trends, only
its intrinsic value, or in other words, the real raw value of the company. It does not take into account
the people’s feelings about the company, neither the stocks quotes or trends, it only cares about how
the finances of the company are [2]. Through the analysis of the factors that reflect (or influence) a
company’s productivity, profitability or competitive advantage, one can identify if a stock is overvalued or
undervalued.
FA has been made famous by value investors (see section 2.5.1) like Benjamin Graham and Warren
Buffet (for their value investment strategies [3], [4]).
Fundamental analysis can generate trading rules that determine which stocks show signals of being
a good investment (by being financially stable), and which stocks show signs of not being financially
stable [5].
FA collects data on financial statements of several years and analyzes the financial evolution of a
company. This helps managers to make a prediction on the growth of a company. Other applications
of FA also include the evaluation of data that is external to the company, for example Gross Domestic
Product (GDP) or currency value, to evaluate the potential that a market may or not have. There are three
main types of FA [2]: Macroeconomic (analysis of macroeconomic factor like GDP growth to study the
effect of the macroeconomic environment on the future profit of a company), Industry Analysis (analysis
of the industry status and prospect, to estimate the value of the company inserted in a certain industry)
and Company Analysis (which analyses the operational status of a company to evaluate its internal
value, usually by analyzing company financial reports).
• Macroeconomic
This type of analysis use macroeconomic indicators to make assumptions about the type of market
we are investing on [6]. One example is economic growth and price stability in the economy, and
7
price stability1 can be measured as the rate of change in inflation [7]. There are several macroe-
conomical indicators (some can be found in: http://www.rbcpa.com/economic fundamentals.pdf).
The Consumer Price Index (CPI)2 is one of such indicators. CPI measures changes in consumer
prices and theoretically determines to what extent life is getting more expensive for the aver-
age consumer. Another important indicator that also measures inflation is the Producer Price
Index (PPI), that measures the rate of change in prices of goods received by domestic producers,
used in their output. When these prices increase substantially, it is likely that companies eventually
pass the price increase’s burden to consumers.
GDP is also an important, and one of the most used indicators, because it represents the total
output of a given economy. The trend at which the GDP is evolving (up/down) may represent a
expandability/contraction of the economy. When the GDP is stable or declining most companies
will not be able to increase their profits, however if GDP growth is too high, it may mean trouble,
because it will usually come with a growth in inflation, and may come with other negative side
effects. See figure 2.1 to see how these indicators evolved in the United States in the last years
(from 2006 until 2016).
(a) United States GDP growth rate (b) United States Inflation rate
(c) United States Producer Price Index
Figure 2.1: Macroeconomic indicators in the United States of America: GDP growth rate, Inflation rate and PPI,from 2006 until 2016 (adapted http://www.tradingeconomics.com/)
In figure 2.1 one can see that the inflation graph looks like the PPI one, and GDP growth follows a
1http://www.eestipank.ee/en/monetary-policy/importance-price-stability2http://www.investopedia.com/terms/c/consumerpriceindex.asp
8
similar fashion.
• Industry
The analysis of the fundamental value of an industry or sector (amount of possible clients, volume
of transactions on that industry, etc.) is used as indicator to check if the target market is good or
not for investment [2].
The possible number of costumers in an industry can be an indicator about what kind of market we
are targeting. Usually markets that rely on a small number of clients for a big part of their revenues
are not good markets to invest on, since a loss of one of those clients may cause a major loss on
revenues (for example, if a military supplier has 100% of its sales to the government, a change in
a defense policy may cause the company to go bankrupt).
Industry Growth, just as macroeconomic growth, is another indicator to check if a market is good
or bad for investment. Before looking for companies with certain requirements, one can check the
growing potential of an industry to check if a target market is promising or not. If a market has a
stable or a declining number of clients, it will be harder for a company to grow in that market, since
it will need to steal market share from other companies.
• Company Using information given in financial statements, one can calculate some fundamental
ratios used to compare companies, to decide which ones are the best to invest in. See figure 2.2
to see the evolution of Agilent Technologies fundamentals. Some companies’ FI are described as
follows (these appear in [8], [9], [10] [11], for example):
– Debt Ratio (DR)
DR is a ratio used to measure the level of debt of a company. Companies with a higher debt
ratio will have a larger amount of debt compared to their assets, leaving them more vulnerable
to an adverse economy, a reduction in their profits or an increase in their debt interests. In
most cases, a high DR can mean that a company is in a highly competitive market, with a
constant need for research and development, usually carried by external financing.
DR =Total Debt
Total Assets(2.1)
– Return On Equity (ROE)
ROE measures the performance of the company Net Income (NI) using the company eq-
uity (measures performance of profits to equity level). This is obtained through operational
efficiency, efficient use of assets and financial leverage. This ratio allows one to select com-
panies that maximize the return on the investment made in them, since the higher this ratio
9
is, the higher the return made of the money invested in the stock.
ROE =NI
Total Equity(2.2)
– Profit Margin (PM)
PM is a ratio that measures the cost of the business to generate profit, or, as the name says,
the margin (profit) the company has after paying all the operating, administrative and financial
costs, along with taxes. Although it is strange (and may be a bad signal) if this ratio varies a
lot (it may mean a decapitalization if it increases, or that the revenue is not making profits, if it
decreases) it is usually a good sign when this indicator is high.
PM =NI
Revenue(2.3)
– Price Earnings Ratio (PER)
PER is a ratio that indicates the value of a company share price when compared with its
per-share earnings. It is the inverse of the percentage of the per-share earnings. It is usually
used to look for undervalued companies. When the PER ratio is going up, it is usually because
investors are expecting a higher growth in the future. However this indicator has to be taken
into account in comparison with PER of stocks of the same sector, and because of these
nuances, it may be a misleading indicator, when used without comparison.
PER =SharePrice
EPS(2.4)
– Revenue Growth (RG)
RG is an indicator that shows the evolution of the business. It increases with two main factors:
either the company is gaining market share from other competitors, or the company is inserted
in a growing market and its growing with it. It only reflects the growth of a company’s revenue
and not its profits.
RG =RevenueCurrent −RevenueLastY ear
RevenueLastY ear(2.5)
– Common Stock Outstanding (CSO)
CSO is an indicator of the ownership hold of the company by shareholders. When a company
issues shares there is a share dilution, and when a company reduces the outstanding shares,
there is an increase in the Earnings Per Share (EPS) (since the same earnings go to a fewer
number of shares), and a decrease in the PER ratio. This is a good indicator to find companies
10
that have repurchased their shares (reduced the outstanding shares).
∆CSO =CSOCurrent − CSOLastY ear
CSOLastY ear(2.6)
– Net Income Growth (NIG)
NIG is an indicator about the trend of the profits of a certain company, and it is used to check
if a good result obtain in a certain year is not just a result of the economic conjecture or of
financial engineering. This indicator can be used to search for undervalued stocks, if the stock
prices do not follow the same behavior as the net income trend.
∆NI =NICurrent −NILastY ear
NILastY ear(2.7)
– Payout Ratio (PR)
PR indicates the percentage of net income distributed by the investor as dividends. High PR
indicates a stable company that does not need to do a lot of investment to keep their business
running but at the same time that is inserted in a stable market where stock performance will
be smaller than those in a fast growing pace, since the part of earnings not paid to investors is
used to invest and create future earning growths. Investors seeking high incomes with limited
earnings growth choose high PR, and investors seeking for capital growth choose lower PR.
PR =DPS
EPS(2.8)
– Capital Expenditures (CE)
CE, when increasing with a greater momentum than NI, is an indicator that the company is
probably inserted in a competitive market. This indicator is compared to NI to avoid compa-
nies that show this type of behavior (to avoid companies in competitive markets).
∆CE =CECurrent − CELastY ear
CELastY ear(2.9)
– Cash From Operating Activities Growth (CFOAG)
CFOAG is a measure of performance of generating money through operations, or operational
money (ability to transform paper operating income into the income statement in receivable
cash). This accounts the cash flow that comes in the company, because the company may
have a high operating net income but be inefficient in the collection of its cash profits.
CFOAG = ∆CFOA =CFOACurrent − CFOALastY ear
CFOALastY ear(2.10)
11
(a) Balance Sheet (b) Cash Flow Statement
(c) Income Statement
Figure 2.2: Financials of Agilent Technologies Inc from 2011 until 2015 (Adapted fromhttps://www.google.com/finance). The balance sheet holds information about total debt, total as-sets and the DR. The cash flow statement holds information about cash from operating activities, cashfrom investing activities and cash from financing activities. The income statement shows the revenue,the net income, the profit margin, the operating income and the operating margin
2.1.2 Technical Indicators
TA [12] and FA use different approaches towards investment, since TA uses movement of stock
prices [13] and volume of transactions [14] as the main information to predict stock markets. TIs look
for patterns in past data and use those patterns to forecast market tendencies(see figure 2.3 to see
how some TI follow the trends of the S&P500 index). TA generates trading rules by analyzing previous
patterns of technical indicators [2], and can be grouped into eight main groups [15], five of them are
described as follows (apart from these there are also other kind of TI: flow of funds, sentiment and raw
data):
• Trend
Trend analysis is a price-based indicator used to track stocks (or other assets) price’s trends.
12
Figure 2.3: S&P500 index and 3 TI - SMA, RSI and MACD (adapted from https://www.google.com/finance)
Strategies that use this indicator assume that political and economical events usually change mar-
ket prices through a change in market trends instead of returning to the most rational point. The
most common trend indicator are Moving Average (MA)s (see for example [16]).
• Momentum
Momentum analysis is also a price-based indicator but used to evaluate the velocity of price
change, and evaluate if a trend reversal is about to happen.
• Volatility
Volatility analysis investigate fluctuations of price ranges in stocks. It can be used to evaluate risk
and identify the level of support and resistance. Stock prices usually are recognized to fluctuate
between the level of support (lower level) and resistance (higher level), but continue to fall/rise if
they break through that level. Volatility indicators include Average True Range, Bollinger Band,
among others. Volatility can also be used to predict Macroeconomic Indicators.
According to [17], volatility is a good GDP growth measure, since GDP growth shrinks after spikes
in volatility. Markets also react to volatility, either in or out a crisis context, and regardless the
market context (either bull or bear). An increase in volatility is usually associated with an increase
in inflation and unemployment rate, and during recessions, on average, volatility rises and interest
rates drop. When a random shock in volatility occurs, GDP reacts to it but reverts to its mean
13
quickly after (1 or 2 quarters). However, if volatility is created by economic politics uncertainty, the
reversion to the mean can take a lot longer (specially if the shock in the politics is unexpected).
One way of measuring volatility, according to [17], is by the quarterly and monthly variance of the
average daily Morgan Stanley Capital International (MSCI) country stock market index.
In an attempt to proxy monetary policies, one can control short term interest rates, with a given
lag (to proxy implementation and effectiveness). Also, one can check for the overall tax level of
a country checking for the ratio between Tax Revenue and Real GDP. Industry production will
decrease with an increase in tax rates.
Volatility affects growth much more than the other way around. Three possible measures of
Macroeconomic uncertainty are the Leading indicator index from Organisation for Economic Co-
operation and Development (OECD) (contains various macroeconomic indicators, one of which is
industrial production index), the Oil Price Volatility and economic policy volatility [17].
[18] shows that permanent shocks (being shock defined as a volatility measure) explain the bulk
of the variation of stock prices over short periods. The author also says that three big American
indexes (Dow Jones Industrial (DJI), National Association of Securities Dealers Automated Quota-
tions (NASDAQ) and S&P500) share a common trend and a common cycle relationship, therefore
shocks will affect all markets similarly.
• Volume
Volume based indicators reflect the amount of investment from buyers/sellers, which can also
predict stock price movements. Volume indicators include Volume change rate, On Balance Vol-
ume (OBV), among others.
In [14] it is used a Volume Adjusted Moving Average (VAMA). It is based on equivolume charting,
a technique that analyses stock prices in relation with the amount of volume traded. In this type of
charting the stock price goes to the vertical axis, and the volume traded goes to the horizontal axis.
Short and wide boxes tend to occur at turning points (stock price is having difficulties moving), and
tall, narrow boxes usually occur at stable markets (stock price is moving easily).
• Cycle
Cycle analysis is a type of indicator that assumes periodic variation in stock prices. Long cycles
can take years and include several smaller cycles. Strategies that use this indicator analyze the
position of the stock price in the cycle.
[19] tries to find a correlation in the amount of business between countries, and the impact of
shocks in business cycles and GDP.
14
Dow’s theory [20] (one of the origins of the trend analysis) assumes there are three types of trends
in the stock market:
– Primary trend: Long term movement of prices (from a year to three years)
– Secondary trend: Short term deviations of prices from the underlying trend. It can be seen
as a correction from the primary trend (from three weeks to three months).
– Tertiary trend: A corrective movement from the secondary trend (less than three weeks).
A cycle is defined as an up trend, down trend and up trend again [21], taking only one of the Dow’s
theory trends into consideration. Longer cycles are constituted by several smaller ones.
2.2 Computational Intelligence Algorithms
CI combines methods and tools to solve problems that normally would require human intelligence.
There are several known CI algorithms: artificial neural networks, fuzzy logic systems, evolutionary
algorithms, among many others. In all of them the success on solving a problem depends mostly on
how that problem is represented by the algorithm.
When it comes to algorithmic implementation in computational finance, a popular approach is Evolutionary
Computation (EC) to optimize rule discovery, because the population based system used by EC greatly
increases the number of searches in the solution search space (by doing parallel search), thus reducing
computational time.
EC is a subfield of AI that will receive the focus of this work. Evolutionary Algorithms (EA) are
algorithms that optimize or learn tasks with the ability to evolve. EAs have three main characteristics,
they are: population-based (the algorithm maintains a set of solutions to search the solution space in a
parallel way), fitness-oriented (the algorithm has a fitness function which measures the success of the
solution, and this is the main aspect that guarantees convergence) and variation-driven (solutions will
suffer several variation operations, to cover more of the search space and to avoid local maximums) [22].
Most CI problems can be seen as a mapping of a domain space into a solution space, and usually
the possible number of solutions becomes so huge it becomes impossible to search all of it. EAs are
stochastic methods that use heuristics to find solutions, which means they will not guarantee the best
solution, but will take a significant reduction in cost and time [23].
The EAs used in this work are GAs. GAs are the most used kind of EAs. They can be either used
as an optimization algorithm or to study adaptive systems. GAs simulate natural selection, where better
solutions are more prone to reproduce than worst solution, each solution (individual) has a limited life
span, there is variation in the population and the ability to survive is positively correlated with the ability
to reproduce [24].
15
Apart from EC techniques, there are other popular approaches, such as Artificial Neural Networks
(ANN) and Fuzzy Systems.
2.2.1 Genetic Algorithms
GAs are a type of EAs that will receive focus on this work. GAs are based on the theory of evolution
developed by Darwin, simulating the evolution of a specie in a certain environment. It starts with a
population of individuals (chromosomes), where each one codify a solution. As it happens with species
evolutionary process, these individuals reproduce in order to create offspring solutions better than the
parent solutions.
GAs were discovered as a useful optimization an search algorithm. A lot of problems in AI can be
defined as a search in a solution space (called search space) which contains every possible solution.
GAs search this space by comparing solutions and looking for the best one.
This heuristic allows the search of several solutions in parallel, converging to better ones. This
convergence is measured by a fitness function. Fitter solutions are privileged when selecting solutions
to ”reproduce”, attracting the whole population of solutions to somewhere near them in the search space
[22].
Usual implementations of GA individuals are arrays or trees of values (as in [25]), where each value
codifies a parameter to be optimized.
A set of genetic operators has to be defined for the GA. The way these genetic operators are imple-
mented determine the success of the algorithm. In a simple GA, the algorithm has to take four steps on
each iteration (generation) [26].
• Selection After evaluating the fitness of each individual of the population, the first step is to select
individuals to reproduce. This selection is done randomly, taking into account the relative fitness
of individuals, such that the best solutions are chosen.
• Reproduction In this step, offspring are created from the selected individuals. For this, it can be
used both recombination and mutation of values.
• Evaluation The fitness of the new population is reevaluated.
• Replacement In the last step, recently created individuals replace individuals from the old popu-
lation.
The algorithm will repeat until a stopping condition is reached, and this is either a maximum number
of generations, no change in the best fitted individuals of the population for a predetermined number of
generations or when a specified time elapsed.
16
A simple GA pseudocode is given in algorithm 1.
Algorithm 2.1: Simple GAt← 0;P (t)← random;Evaluation P (t);while notEndcondition do
Pp(t)← Selection of parents from P (t);Pc(t)← Crossover from Pp(t);Pm(t)← Mutation of Pc(t);Evaluation Pm(t);P (t+ 1)← New Generation Creation from (P (t), Pm(t));t← t+ 1 ;
The basic operators of a GA are defined as follows:
• Selection
Selection is made by evaluating and ranking individuals, using the fitness function of the GA [27].
Selection has several ways of being implemented. This work will focus on two implementations:
Roulette Wheel Selection and Ranking Selection. These are described as follows:
– Roulette Wheel Selection The principle here is that of a linear search in a roulette wheel,
where the slots in the wheel are weighted in proportion to the individual’s fitness value. To
implement the roulette wheel one has to go through the following steps: first the total expected
value of individuals in the population is obtained (see equation 2.11), and afterwards the
algorithm can run (see algorithm 2)
T =
N∑i=1
Fitnessi (2.11)
then:
Algorithm 2.2: Roulette Wheel Selectioni = 0 ;while i != N do
chose random number r ∈]0, T ];j = 0 ;Fit Sum = 0 ;while Fit Sum < r do
Fit Sum = Fit Sum + Fitnessj ;++j ;
++i ;
– Ranking Selection In Ranking Selection individuals received their fitness by their ranking.
This results in slow convergence, however avoids quick convergence and possibly getting
trapped in a local maximum. A suggestion to do this is to select two individuals at random,
the one with the best ranking becomes the parent. Then, repeat this process to find the other
17
(a) Single Point Crossover (b) Multi-point Crossover
Figure 2.4: Single and double point crossover
(a) Boolean Valued Mutation (b) Integer Valued Mutation
Figure 2.5: Boolean valued and integer valued mutation. The integer valued mutation is not associated with anyprobabilistic distribution in this image, is purely figurative
parent.
• Crossover
Crossover simulates the biological crossover, and mixes values of individuals (of the old popula-
tion) to generate an offspring (that will be an individual in the new population). The Crossover can
be made in a single point (2 segments are exchanged between individuals), or in multiple points
(more than 2 segments exchanged) as shown in figure 2.4.
• Mutation
Mutation is an operator used to increase diversity in solutions (being able to cover more of the
search space, and avoiding being stuck at a local maximum). Mutation perturbs a value in the chro-
mosome, adding noise with a certain probability distribution (a popular choice is Gaussian noise) in
real valued chromosomes, randomizing that value, interchanging values or flipping boolean values,
as shown in figure 2.5.
• Other paradigms
In [25] a tree structure representation of a portfolio as a GA is used, where the GA has to fill
out some more rules, for example, each branch of the tree must be a portfolio by itself, each node
represents the weight of that branch, and the leafs are the stocks. In this representation operations
in the GA are handled differently.
There are also other operators or techniques that can be applied to the GA in order to improve its
18
results, such as elitism, that propagates a percentage of the best individuals in a population into
the next generation.
Adaptive Genetic Algorithm (AGA)s are used for Dynamic Optimization Problems (DOP) (problems
where variables change over time). These kind of problems need a solution that tracks the moving
optima over time. To achieve this, one has to make some enhancements to GAs such that it adapts
to the new optima over time. An AGA can be a GA whose parameters (such as population size,
mutation or crossover probability) changes while the GA is running. According to [28], a DOP is
characterized as:
F = f(~x, ~ψ, t) (2.12)
Where ~x are the decision variables, ~ψ the parameters and t is time. The challenge here is to track
the moving solution without having to restart the algorithm. There are 5 main approaches to this:
– Memory: store useful information
– Diversity: handle convergence
– Multi-Population: co-operate between sub-populations
– Adaptive: adapt generators and parameters
– Prediction: forecast changes and take action
Details about these techniques are given as follows:
– Memory Approaches
Memory approaches are particularly useful for cyclic DOPs. This approach can be divided into
implicit memory approaches and explicit memory approaches. Implicit Memory approaches
uses redundant information. A way of implementing it in GAs is by using a pair of chromo-
somes on each individual (Diploid GA) that encode the genotype of the individual, and a
dominance scheme that maps the genotype to phenotype. Explicit Memory approaches use
extra memory to store useful information of the population. The best solutions are saved in
memory, such that when a change occurs the memory solution will be used to track the new
optima. If Direct Memory is used only good solutions are stored into memory, and if Asso-
ciative Memory is used, good solutions and environmental information (context) is stored. In
this case when a memory update occurs, a new pair (AD) (with ~D being the environmental
information) replaces another, and solutions are generated by sampling ~DM . An example of
it is in figure 2.6.
19
(a) Random Immigrants (b) Memory-based Immigrants
Figure 2.7: Diversity Approaches
Figure 2.6: DOP Memory Approaches
– Diversity Approaches
Diversity Approaches will use diversity of individuals to cover more of the search space in
order to have a faster convergence when a change occurs. A way of achieving this is the
Random Immigrants approach (see figure 2.7). This approach inserts random individuals
each generation to maintain diversity, such that when a change occurs the random individu-
als will attract the population to the new optimum. A second approach is using Memory-based
Immigrants, where some points in the search space are stored into memory and re-evaluated
each generation. On each generation the best memory point is chosen and the immigrants
are generated by mutating this point with a certain probability, and then the population re-
places the worst individuals with these solutions.
– Multi-Population Approaches
Multi-Population approaches use several co-operating sub-populations to explore the search
20
(a) Shifting Balance (b) Self-organizing Scouts
Figure 2.8: DOP Multi-Population Approaches
space at the same time. One approach to this is the Shifting Balance, where a core population
explores the area of the present optimum while several colonies (sub-populations) explore the
rest of the search space. Whenever a change in the optimum occurs, the most fit individuals
of the colony searching the space of the current optimum will migrate to the core population,
attracting the core population to this search space. Another approach is the Self-Organizing
Scouts (SOS), where a core population explores the promising search space and is split into
child populations under certain conditions. Each child population explores limited promising
areas and are also split under certain conditions (see figure 2.8).
– Adaptive Approaches
Adaptive approaches change the operators/parameters of the GA, usually after a change, to
pressure the population to dramatic changes for a certain period. Hyper-mutation, Hyper-
selection and Hyper-learning are 3 operators used to achieve this (augmenting mutation rate,
selection pressure and learning rate temporarily).
– Predictive Approaches
Prediction approaches analyze patterns in the DOP to forecast the next optimum, when the
next change will occur and which environment may appear. Kalman Filters and forecasting
are two examples of techniques used.
2.2.2 Neural Networks
ANN have attracted the attention researchers due to its predicting power and flexibility. ANN is a
biological inspired computational model which consists in processing elements (neurons) and connec-
tions between them with coefficients (weights). These connection weights are the ”memory” of the
21
system [23]. This kind of systems can be used for either supervised or unsupervised learning.
Usually, neurons are visualized as being arranged in layers, and typically neurons in the same layer
behave in the same manner. The arrangement of neurons into layers and the the connection patterns
within and between layers is called ”net architecture”. In figure 2.9 is the example of a feedforward
network: a network in which the signals flow from the input units, to the output units, in a forward
direction [29].
ANN applied to computational finance are implemented in several ways (see [30], [31], [21], [32], [33]
or [34] for some examples).
Figure 2.9: Artificial Neural Network representation - wij is the weight given to the connection between nodes iand nodes j, w′
jk is the weight of the connections between nodes j and k. These weights are chosenaccording to a mathematical function, that will decide which neurons (nodes) will be used as path forthe inputs.
2.2.3 Fuzzy Systems
In 1965, Lotfi Zadeh published a paper ( [35]) formally developing the multi value set theory, that later
has come to be known as fuzzy logic. In that paper, the author showed how the function IA of non-fuzzy
subset A of X, described as equation 2.13 could be extended to the multivalued indicator function, µA
of fuzzy subset of X, given by the membership function in equation 2.14.
IA(x) =
{1→ x ∈ A0→ otherwise
(2.13)
In equation 2.13 0 represents non-membership and 1 represents membership.
µA(x) : X → [0, 1] (2.14)
22
In equation 2.14 µA(x) is interpreted as the degree of membership of element x in fuzzy set A for each
x ∈ X [36].
If the universe is discrete, a membership function can be defined by a finite set in the following way:
A =∑
µi/ui (2.15)
In equation 2.15 the symbol / separates the membership degrees µ(ui) from the elements of the universe
ui ∈ U [23].
Fuzzy Rules applied to computational finance usually create linguistic rules of the type IF this THEN that
using technical indicators, which can be understood by a human trader. Fuzzy systems are usually used
with ANNs, examples of these are given in [37], [32].
2.3 Clustering Algorithms
Data mining is the process of exploring data from different perspectives to discover previously un-
known patterns, and develop a model used to understand phenomena from the data and summarizing
it into useful information [38].
This analysis allows to obtain correlations and to learn new features about the data set. Although the
term is relatively new, the technology is not, and it is used by large distribution companies (Walmart for
example) to relate costumer’s buying patterns, being able to increase revenue using this information.
DM is considered a process in Knowledge Discovery from Data (KDD), which processes consists in
the iteration of the following steps [39]:
• Data cleaning Removes noise and inconsistent data
• Data integration Multiple data sources may be combined
• Data selection Relevant data for the analysis task is retrieved from the databases
• Data transformation Data is transformed or consolidated into appropriate forms for mining
• Data mining Intelligent methods are applied to extract patterns from data
• Pattern evaluation Identifies patterns that represent knowledge based on some interesting mea-
sures
• Knowledge presentation Visualization and knowledge representation techniques are used to
present the mined knowledge to the user.
23
There are several DM algorithms, for an explanation of some of them see [40]. This work will use a
clustering algorithm, that is a type of algorithm that can be used in data mining (although in this work is
used as a classification algorithm) and so it will focus more on detailing this technique.
Clustering is a tool of data analysis, which solves classification problems, applied when there is
no class to be predicted, but instead, when instances can be divided into natural groups. Clustering
itself is not an algorithm, but a task, with several algorithms that can be used to find a solution. The
best algorithm to apply depends on the data and desired results [41]. It is a unsupervised technique,
since it does not use preclassified data. Instead the algorithm discovers similarities (in the requested
attributes) between objects of the set, grouping them in the same cluster. The identified groups may be
exclusive (an instance belongs to one only group), overlapping (an instance belongs to several groups),
or probabilistic (an instance belongs to each group with certain probability) [42].
Exclusive clustering objective is to group a set into smaller subsets, such that the degree of associ-
ation is strong between members of the same cluster and weak between members of different clusters.
There are many clustering algorithms (see [43]). Some Clustering algorithms use metrics to measure
intra-cluster or inter-cluster distance. In [44] a GA is used to optimize clustering, using the Calinski-
Harabasz index as fitness function, obtaining better results than classical clustering methods as K-
means and Fuzzy C-means (FCM) (FCM is a Fuzzy Clustering method, a method where an object can
belong to more than one cluster, for more information see [45] and [46]).
The algorithm from [44] uses cluster points as chromosomes for the GA, and activation values for
those points in an algorithm called ACGA. It is described in algorithm 3.
Algorithm 2.3: ACGAt← 0;P (t)← random;A(t)← random;while notEndcondition do
Pp(t)andAp(t)← Selection of parents from P (t)andA(t);Pc(t)andAc(t)← Crossover from Pp(t)andAp(t);Pm(t)andAm(t)← Mutation of Pc(t)andAc(t);Check for bigger A(t) values;Choose clusters corresponding to bigger A(t) values ;Compute calinski-harabasz index to attribute fitness values ;P (t+ 1)← New Generation Creation from (P (t), Pm(t))and(A(t), Am(t));t← t+ 1 ;
In algorithm 3, P (t) is the population (each individual is an array of points in the space that can be
chosen as a cluster center) and A(t) are the activation values (each individual is an array of values
∈ [0, 1]), and each individual in A(t) corresponds to an individual in P (t) (chromosomes are of the same
size).
24
2.4 Portfolio Composition Problem
The original portfolio composition optimization problem described by Markowitz [47] is described as
in equations 2.16.
Max(expectedreturn) =
M∑i=1
uixi (2.16a)
Min expected risk =
M∑i=1
M∑j=1
oijxixj (2.16b)
s.t.
M∑i=1
xi = 1 (2.16c)
In equations 2.16 ui is the expected return of asset i, xi is the investment portion on the asset, and oij
is the covariance between asset i and j.
There are two main approaches to this problem. One is Single-Objective (SO) optimization, and the
other Multi-Objective (MO) optimization.
In the case of SOs, a single criteria is optimized, and an optimum is either its maximum or its minimum
and a solution dominates another if it is above (for maximums) or below (for minimums) another solution.
In a MO there are several criteria to be optimized, and the exact mutual influences between objectives
can become complicated, and are not always obvious. This approach uses Pareto optimality, which
defines the frontier of solutions that can be reached by trading off conflicting objectives [48]. According
to [49] and [50], a MO algorithm is said robust if solutions maintain as close as possible to the Pareto
Front, the rankings are the same in training and validation and solutions are diverse (uniform distribution
by the Pareto Front) and non dominated solutions maintain that way in training and test. In [50], there
are also techniques used to improve robustness, for example Mating Restriction (restricting mating to
occur only between dominated and non-dominated individuals).
Fitness can be described as proximity to the Pareto front, and solution’s diversity can be described as
the distribution of solutions in the Pareto Front. The MO approach has shown great results in optimizing
the portfolio composition problem and in ranking stocks [51]. Examples of works using MO are [52] [11]
[53].
2.5 Investment Strategies
Strategies vary a lot, ranging from Value Investing strategies to Growth Investing strategies, or even
a mixture of both. One can choose between several strategies when investing, some of which are
described below.
25
2.5.1 Value Investing
As referred above, FI were largely used by investors such as Warren Buffet, that made value investing
famous. He would look for companies with a high intrinsic value, and/or companies that had some sort
competitive advantage, and buy them, as said in [4]. This generated huge profits for these investors,
because the market (that is self regulated) eventually realized the value of the stocks, and since the
companies had a competitive advantage in the market they were inserted in, the stocks never had a big
breakdown in recession times, and continued to grow further. This type of strategy uses FI above all
other indicators, since it gives the best mechanism to evaluate the financial strength of a company, even
if its quote is currently falling. FIs do not take into account any kind of trends, and so, all you can assume
(using only FI) is that a company is good or bad, according to its financial statements (evaluating the
intrinsic value with the market value) and wait for the market to eventually realize the company’s value.
2.5.2 Growth Investing
TI are mostly used for strategies that take into account tendencies such as growth investing (this
investment strategy is the most commonly used in computational finance). This means that instead of
measuring the intrinsic value of an asset at given times, it measures how the market reacts to it, either
by simply analyzing the stock price trend, or going into a more complex analysis of relating the trend of a
stock price to its volatility. Studying past tendencies of certain TIs can help to predict the tendency, using
that information to buy, sell or hold a certain stock. Many times this is not done by analyzing only a single
TI, but several of them, and by drawing conclusions on how a market will evolve given that information.
2.5.3 GARP Investing
GARP takes the best of value and growth investing, by looking for companies with a good intrinsic
value, with good growth prospects. One of the biggest supporters of this kind of investing is Peter Lynch
(see [54]). He segmented the type of existing markets, and search for cycles, and other type of indicators
(mainly, but not only, macroeconomic indicators) that indicate the type of market in which the investor
is inserted. By doing so, he was able to adapt a better strategy to that kind of market, and use more
relevant indicators for the type of stocks he is looking for. GARP investing avoids companies with huge
growths, since those companies have an higher risk associated with it. It also avoids companies that
have a good intrinsic value but do not grow.
26
2.5.4 Income Investing
Income Investing prefers to rather have a fixed income than to risk investing in stocks that show
volatility. So, income investors prefer bonds to stocks, and in stockpicking they pick stocks that have
high DPS values, so they can have a fixed income on their dividends.
2.6 Classification of Stocks
Several investors classify stocks according to company’s FA or TA ( [55], [56], for example). This
classification allows them to group stocks with similar features, making the study of the behavior of those
stocks easier. If stocks are well classified and inserted in the right cluster, it will be easier to evaluate if
a company is going to grow or not evaluating its behavior inserted in the group. Having this information,
the investor is able to create customized investment strategies to each group of stocks (a better adapted
investment strategy, since behaviors will be similar). Lynch and Rothschild did this on their book [54].
Using FA they evaluated stocks according to their growth rate, capitalization and economic behavior,
and classified companies into six major types, creating specific investment strategies to each of these
types. The main characteristic of each type is explained as following, according to the book:
• Slow Grower
These companies are usually large and aging companies, that are expected to grow only slightly
faster than the Gross National Product. Normally slow growers start out as fast growers and
eventually stop growing as fast, either because they have grown as much as they can, or because
the industry they are inserted in slows down its growth. Every Fast growing industry eventually
slows down and becomes a slow growing industry. Usually slow growers pay a generous and
regular dividend (because these companies usually can not use that money to expand business).
• Stalwart
These are usually multi-billion dollar companies, that are not exactly agile climbers but are faster
growers than slow growers. These companies have around 10 to 12 percent annual growth in
earnings. They can give a sizable profit if bought and sold at the right time, and are also a good
protection against recession times, since they are so big that will not go bankrupt, and soon enough
after the recession their value will be restored.
• Fast Grower
These are small, aggressive enterprises that grow at a rate of 20 to 25 percent a year. A fast
growing company does not necessarily has to belong to a fast growing industry. All it needs is
the room to expand in a slow growing industry. Usually these upstart enterprises learn to succeed
in one place, and then replicate their winning formula over and over. These kind of stocks are
27
usually risky, especially in younger companies that tend to be overzealous and underfinanced, and
underfinanced companies do not end up well during recession times. Also, Wall Street does not
look kindly on fast growers that run out of stamina and turn into slow growers. Once a fast grower
grows too big it faces the problem of having trouble growing further.
• Cyclicals
These are stocks whose sales and profits rise and fall in a regular if not completely predictable
fashion. In a cyclical industry business expands and contracts, then expands and contracts again.
When coming out of a recession and into a vigorous economy, the cyclicals flourish, and their stock
prices tend to rise much faster than stalwart prices. However, during a recession the cyclicals
suffer, and so do shareholders. Buying a cyclical in the wrong part of the cycle can make one lose
a lot of money, and so, timing is everything when buying a cyclical.
• Turnaround
These are companies that have no growth at all. Sometimes turnarounds are poorly managed
cyclicals that go so far down in a cycle that people think they will never come back up. Neverthe-
less, turnarounds are companies that can make up lost ground quickly, and the best thing about
investing in successful turnarounds is that of all the categories of stocks they are the least related
to the general market. Failed turnarounds are dragged into bankruptcy, making it a very risky type
of stock, that varies between a major success and a major failure.
• Asset Plays
These are companies that own one or more valuable assets that Wall Street has overlooked (and
so, has not valued the stocks accordingly). These assets can be as simple as a pile of cash or the
subscribers of a TV cable provider, for example, but usually they are real state assets. These are
companies whose assets may value more now (or may get more valuable) than the value given to
the company itself. When the market realizes this value (or when the assets grow in value), the
stock prices grow accordingly.
2.7 Data Set / Markets
Most works done in computational finance have been done with well known and stable markets (from
strong European or American economies). This work will also be focused on these kind of markets
(specifically focused on the S&P500 index), not only because most work has been done with similar
markets, but also because other kind of markets have different relationships between indicators, and
although being easier to predict (because they are less efficient), are usually more unstable. According
to [57] funds located on the US that invest in emerging markets underperformed funds physically located
28
in the emerging markets (one of the reasons is because some of this markets have not so stable financial
policies, and the market is very affected by policy changes). The author also shows that geographically
focused funds outperform the ones that invest globally. These factors made me chose a well known and
studied market, with easy access to information.
2.8 Related Work Results
In this section, some previous work results and data used will be analyzed and compared between
them in order to know which are the best solutions.
2.8.1 Evolutionary Computing
In [8] it is used an hybrid approach to portfolio composition, using both fundamental and technical
indicators. In this paper uses a Multi-Objective Evolutionary Algorithm (MOEA) with two objectives
(return and variance of returns), computes the Pareto front and tries to find solutions near it with the
technical indicators.
In [13] is proposed an approach of technical rules optimization using a GA. In this approach each
individual is an asset classifier equation that takes into account the value of the technical indicators
applied to the available data prices.
In [49] is proposed a change to the Strength Pareto Evolutionary Algorithm 2 (SPEA2) algorithm
(becoming the Robust Strength Pareto Evolutionary Algorithm 2 (R-SPEA2)) in order to make it more
robust. In [50] Multi-Objective Genetic Programing (MOGP) robustness is also studied and it is con-
cluded that mating restriction is a promising technique to use accordingly: mating of similar parents will
converge solutions, and mating of dissimilar parents will promote diversity.
In [25] is proposed a tree structure representation of the GA for portfolio optimization, instead of the
more common array approach. In a tree structure GA each terminal node holds an asset and each
non-terminal node holds the weight of the subtree, and each subtree is also considered a portfolio.
In [16] is used an approach using GAs to search for the optimal period lengths, adjustment frequen-
cies and adjustment volumes of moving averages to predict changes in price of crude oil for investment
in the crude oil future market.
In [58] a Multiple-Criteria Decision-Making (MCDM) method is applied to portfolio optimization, divid-
ing the criteria of return and risk into several measurements.
[59] uses standard Genetic Programming (GP) optimization, with a function set comprising a com-
mon set of arithmetic and a terminal set comprising a collection of technical indicators and constants.
The objective is consistently outperform the B&H strategy, based on the work of [60].
29
[61] uses GAs to decide trading strategies, consisting of two stages, elimination of unacceptable
stocks and stock trading construction.
2.8.2 Other methods
In [62] a correlation matrix, between the most significant indicators and future prices, is applied,
alongside a data discretization using a Cumulative Probability Distribution Approach (CPDA) and a
Minimize the Entropy Principle Approach (MEPA). Afterwards Rough Set Theory (RST) is used to obtain
linguistic rules and a GA is used to refine it.
In [14] is used a mix of strategies, applying ANN, fuzzy logic and GA to an approach that uses VAMA
as the indicator used. The system consists of three phases: the ANN system having VAMA as baseline,
the definition of fuzzy rules with the ANN outputs and the refinement of those rules made by a GA.
In [32] a Fuzzy Cerebral Model Articulation Controller (FCMAC) approach to Forex Exchange is
proposed. This approach divides data into sets and uses local learning (focus on useful local information
from observed data).
In [30] it is used a Modular Neural Network with sliding window, error back propagation method and
supplementary learning as a way of retraining avoiding over fitting.
In [21] it is used an Adaptive Network Fuzzy Inference System, supplemented with Reinforcement
Learning (RL). This RL process uses feedback reward/punishment according to environment state.
The author uses Momentum and MA indicators to discover cycles in the data, and invest using that
information.
30
Wor
kD
ate
App
roac
hFi
nanc
ialD
ata
App
licat
ion
Ben
chm
ark
Valid
atio
nPe
riod
Ret
urns
[8]
2015
MO
GA
FI&
TIPo
rtfo
lioC
ompo
sitio
nS
&P
500
2013
-201
428
.3%
[13]
2011
GA
TIPo
rtfo
lioC
ompo
sitio
nD
JI20
03-2
009
Bet
tert
han
B%
Han
dR
ando
mW
alk
[49]
&[5
0]20
08R
-SP
EA
2R
awD
ata
Port
folio
Com
posi
tion
FTS
E10
0M
ay20
04-
Dec
embe
r200
5
Bet
terR
esul
tsth
anth
eS
PE
A2
appr
oach
[58]
2004
EC
Mar
kow
itzM
odel
Fund
Man
agem
ent
Com
posi
tion
S&
PD
atab
ase
N/A
GA
gotb
ette
rre
sults
than
2PLS
,SA
and
TS
[16]
2015
GA
TIC
rude
Oil
Futu
reM
arke
tC
rude
Oil
Mar
ket
1987
-201
3B
ette
rtha
nB
&H
[62]
2010
ME
PA&
CP
DA
&R
ST
&G
ATI
Port
folio
Com
posi
tion
TAIE
X20
00-2
005
Bet
terr
esul
tsth
anB
&H
,GA
orR
ST
alon
e
[14]
2009
AN
N&
FL&
GA
TIPo
rtfo
lioC
ompo
sitio
nS
%P
500
1997
-200
2B
ette
rres
ults
than
B&
H,N
N,
NF
orG
Aal
one
[32]
2007
RS
T&
Fuzz
yR
ules
TIFo
rex
Exc
hang
eU
SD
vsot
her
mai
ncu
rren
cies
2004
-200
633
.95%
[59]
2009
GP
TIPo
rtfo
lioC
ompo
sitio
nS
&P
500
1990
-200
2E
very
mod
elou
tper
form
edB
&H
[61]
2014
GA
TIPo
rtfo
lioC
ompo
sitio
nTA
IEX
118
days
(end
date
14Ju
ne20
14)
Out
perfo
rms
B&
H
[30]
1990
AN
NR
awD
ata
Port
folio
Com
posi
tion
TOP
IXJa
nuar
y19
87-
Sep
tem
ber1
989
Out
perfo
rms
B&
H
[21]
2011
AN
FIS
TIPo
rtfo
lioC
ompo
sitio
n
IBM
&W
alm
art&
Citi
grou
p&
Wye
th&
Gen
eral
Mot
ors
24th
Aug
ust
1994
-30t
hA
ugus
t200
6
240.
32%
,bet
ter
than
S&
P50
0an
dD
JI
31
32
3Architecture
Contents
3.1 Module View of the Global Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.2 General System Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.3 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.4 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
33
34
This Chapter will give a description of the system architecture. First it will be given an overview of
the proposed solution, afterwards a module style description of the most important modules and lastly,
an explanation of some other implementation details.
The architecture of the proposed solution was made from scratch in C++. The implementation has
22 C++ classes, and a size of about 9000 lines of code.
3.1 Module View of the Global Architecture
The overview of the module architecture of the proposed solution is presented in this section (see
figure 3.1).
There are nine main modules in the system, each one with a specific functionality. Their specific
function is described as follows:
• Download This module is used to fetch information about companies (the stocks’ raw data) and
their financial statements (the financial statements are fetched with [63] algorithm) and store that
information to later be used by the Stock and Fundamental Analysis modules.
• Fundamental Analysis (FA) This module is used as a data analysis module. It processes in-
formation of the financial statements of the company, computes all the FI needed and does the
growth analysis of the variables used.
• Stock The Stock module holds information about the stock’s raw data (the quotes) and information
about the classification of the stock, given by the Classifier and the Clustering modules in each
quarter.
• Classifier The Classifier module is used by the Stock module. Uses a configuration file to define
thresholds, and uses those thresholds to classify the stock in a certain quarter, according to the
information given by the Fundamental Analysis module.
• Clustering This module is also used by the Stock module, and its objective is to attribute a cluster
to a stock in a certain quarter, using the Clustering Genetic Algorithm module, according to the
information given by the Fundamental Analysis module. It uses an unsupervised data mining tech-
nique, inspired in the ACGA algorithm from [44] that uses a GA to optimize clustering positioning.
The number of clusters used is fixed (given by the user as input) and the algorithm simply opti-
mizes the location where each should be. The number of clusters used in this work was 5, in order
to have the same number of clusters as the number of types created by the Classifier module.
• Genetic Algorithm (GA) This module contains two submodules: the Clustering Genetic Algorithm
submodule, used by the Clustering module to optimize clustering position (with a fixed number
35
of clusters), and the Fundamental Analysis Genetic Algorithm module, used by the Investment
Simulator module, to optimize Fundamental Indicators’ weights to give buy and sell signals.
• Investment Simulator This is the module responsible by the kind of portfolio that is created. Is
used by the Investor module, and uses both the Stock and Fundamental Analysis Genetic Algo-
rithm modules. It creates the Portfolio module, and is responsible for giving the buy and sell signals
to the Portfolio module, making the bridge between the Investor module and the Portfolio module.
• Portfolio The Portfolio module is used by the Investment Simulator one, and uses the Stocks
module to retrieve quote’s information. It saves the state of the portfolio information: the stocks
that are currently in the portfolio, when they were bought, at which price they were bought, and
the current return of the portfolio. It is also responsible for getting the necessary stock information
from the Stock module to simulate buy and sell.
• Investor This is the main module of the system, it receives user input, coordinates data flow and
calls every other method as necessary. Is responsible for: using the Download module to fetch
information, creating the Fundamental Analysis and Stock module (one for each company) using
the information from the Download module, receiving and distributing user input information by the
modules that use it, and giving information to the Investment Simulator module for the portfolio
construction.
36
Figure 3.1: Modules view of the architecture - The UML schematic shows which modules are used by each mod-ule. The full arrow in the GA modules represent inheritance, broken line arrows mean usage withoutinstantiation, an full line arrows mean association
3.2 General System Dataflow
The overview of the general dataflow of the proposed solution is presented in this section (see figure
3.2).
37
Figure 3.2: Modules Data Flow - User input and the system output is represented with a box, every module has itsname and functionality, and the arrows represent data flow
A step by step description of the data flow of the algorithm is given as follows:
• The system starts by receiving system parameters as inputs. In this work this is done with a
configuration file for general configuration (GA parameters), a second file with parameters for the
Stock Classifier (with the thresholds used) and a third file with the list of stock tickers used in this
work.
38
• The second thing the system does, is to create the Download module to fetch stock information.
This includes downloading quotes from yahoo site (constructing the URL and using GNU’s wget
to retrieve the CSV files with stocks’ information), and also downloading, sorting and rewriting the
financial statements (Balance Sheets, Income Statements and Cash Flow Statements) from the
Security Exchange Commission website1 (using [63] algorithm).
This fetching from Edgar has several steps. To better understand these algorithms see [63]. Since
this part was not developed during this work, it will not be detailed.
• Afterwards, the Investor module creates both the FA and the Stock module (for each company),
with the information fetched, and attributes the FA module to the Stock one, so the FA data from a
company can be accessed by the Stock module of that company.
• The Investor creates and gives the user input with the thresholds to the Classifier module, that
each stock uses to get a classification per quarter. The Classifier receives the Stock’s FA data,
iterates over it, and gives back the classifications to the stock (containing size, growth, health and
type classifications) for each quarter.
• The Investor creates then the Clustering module, giving it the configuration used by the Clustering
GA (number of clusters and GA parameters). The Clustering module receives the Stock’s FA and
creates and uses the Clustering GA submodule to attribute a cluster to each quarter.
• With all the information processed, the Investment Simulator module is created and given access
to the Stocks and the GA parameters for the submodule FA GA. It creates the Portfolios modules,
and gives them information about the stocks it has to buy and sell, each quarter. There are five
portfolios that use exclusively the classification given to the stock and other five that use exclusively
the cluster of the stock. The number of clusters created was fixed, and it was chosen to be 5
clusters since there were 5 types of stocks created with the classifier using user input.
There are other ten portfolios that are created in the same manner as the ones explained above,
with the difference that after checking each stock to see if it is of the type the Investment Simulator
wants and creating a pool of stocks with a certain type/cluster, it creates and uses the submodule
FA GA for that pool of stocks. It is also constructed a portfolio that uses the FA GA to optimize
indicators weights using the whole dataset.
• The Portfolio module receives the buy and sell signals. It first buys the stocks, so it accesses the
stocks and retrieves and holds information about the stocks’ quotes (since it applies transaction
costs, adds them in this step) and the date that stock is being bought (that in the point of view of
the Portfolio is the current date, transmitted by the Investment Simulator).
1https://www.sec.gov/edgar/searchedgar/companysearch.html
39
• Afterwards the Portfolio module accesses the stocks it holds to retrieve information about the cur-
rent quote of each one, and updates the return of the portfolio. Lastly it sells the stocks indicated by
the Investment Simulator with the sell signals, or in the case of classification/cluster portfolios, the
stocks that are not indicated in the buy signal. In the last quarter of every portfolio, the Investment
Simulator gives information to the Portfolio module to sell all the stocks.
3.3 Modules
There are nine main modules in the application. The modules are connected between them as
showed in figure 3.2. This section explains the functionality of them (leaving out the Investor module,
whose functionality was already explained in sections 3.1 and 3.2)
3.3.1 Download
The download module is divided into two main functionalities, one is to download stock’s raw data
and the other is to download companies financial information.
The companies’ financial information is fetched with [63] algorithm, as explained before.
The Stock raw data is fetched using GNU’s wget (for more information check https://www.gnu.org/software/wget/)
to download CSV files from the yahoo finance URL API for historic information. It first constructs the
URL to be used, that is of the form:
MAIN?&TICK&SM&SD&SY&EM&ED&EY&P&IG
Where
• MAIN: http://ichart.finance.yahoo.com/table.csv - the main URL of the API
• TICK: s =TICKER - (TICKER is the stock ticker)
• SM: a =STARTMONTH - (STARTMONTH is the start month to be downloaded)
• SD: b =STARTDAY - (STARTDAY is the start day to be downloaded)
• SY: c =STARTYEAR - (STARTYEAR is the start year to be downloaded)
• EM: d =ENDMONTH - (ENDMONTH is the end month to be downloaded)
• ED: e =ENDDAY - (ENDDAY is the end day to be downloaded)
• EY: f =ENDYEAR - (ENDYEAR is the end year to be downloaded)
40
• P: g =PERIOCITY - (PERIOCITY is the periocity to be downloaded, it can be daily with ”d”, weekly
with ”w” and monthly with ”m” )
• IG: ignore = .csv - (the URL used to end)
To fetch the information daily (this can be used to update the data used in the application, making it
available to use in real time) the URL used is:
http : //finance.yahoo.com/d/quotes.csv?s = TICKS&f = FLAGS
Where
• TICKS: Are the tickers of the Stocks wanted. Several tickers can be put together with the sign +.
• FLAGS: Are the flags that specify the information required by the API (for more information check
http://www.jarloo.com/yahoo finance/).
This will give the necessary information to create the Stock raw data. The information will be down-
loaded in the format described in figure 3.3, and saved in the Stock module.
Figure 3.3: Structure of the stock’s raw data downloaded from the Yahoo API
3.3.2 Stock
The stock module is where company’s data is saved. This includes stock’s raw data (quotes, adjusted
quotes, the stock quote higher and lower point of the day, the volume of transactions on that day and the
date of the quotes) and the Fundamental Analysis of the company.
The module also holds information about the Classification attributed by the Classifier, and the Clus-
ter given by the Clustering module. We can think of this module as the database from which information
to simulate investment is retrieved (see figure 3.4).
3.3.3 Fundamental Analysis
The Fundamental Analysis module is made from the public information about the companies. This
data is obtain through three spreadsheets (Balance Sheet, Income Statement and Cash Flow State-
ment), each one with information organized by quarter. For a better understanding of the structure of
the module see figure 3.5.
For further detail on how data is processed see section 3.4.1.
41
Figure 3.4: Representation of the Stock module
After all the information is retrieved, this module uses this information to create several indicators
and performs evaluations on the growth of the variables obtained by the sheets. This growth analysis is
made annually (compares the same quarter in different years).
Figure 3.5: Representation of the FA module
This is also the module where Fundamental Indicators are computed to later be optimized by the
GA module and used by the Investment Simulator module to give buy and sell signals. Each indicator
is modified if needed for the objective to be a maximization of the indicator. The indicators used are
42
described as follows:
• Debt
Assuming that no company in the S&P500 index has more total debt than total assets, the DR will
always be ∈ [0, 1], however, the objective is to minimize this indicator. To change this minimization
objective into a maximization one (as required so that all indicators’ objective is to maximize that
indicator), instead of calculating the percentage of debt, the indicator calculates the percentage of
the company that is not in debt, doing TotalAssetsTotalAssets −DR. The Debt indicator used in this work will
then be:
Debtindicator =TotalAssets
TotalAssets−DR = 1− Total Debt
Total Assets(3.1)
• PR
The Payout Ratio, as described in Chapter 2, is the percentage of net income distributed to the
investors as dividends. Since financially healthier companies have a bigger PR, the objective will
be to maximize it, so the PR indicator used in this work will be:
PR =DPSEPS
(3.2)
• ROE
The Return on Equity is described in Chapter 2, and in this work, this indicator will be used as it is,
and the objective will simply be to maximize it. The indicator used is:
ROE =Net IncomeTotal Equity
(3.3)
• PM
The objective in this work will be to maximize the indicator as it is explained in Chapter 2:
PM =Net Income
Revenue(3.4)
• RG
Although Revenue Growth (explained in Chapter 2) is used already in Classification and Clustering
of stocks, it is used as an indicator to choose the stocks that grow more inside each group. The
indicator used is:
RG = ∆Revenue =RevenueActual −RevenueLastY ear
RevenueLastY ear(3.5)
43
• NIG
NI growth will be used as an indicator, as explained in Chapter 2. The indicator used is:
NIG = ∆NI =NetIncomeActual −NetIncomeLastY ear
NetIncomeLastY ear(3.6)
• ∆ RG
Sometimes the growth of a company does not affect the stock quote growth as much as the per-
spective of growth. Since the market is sensible to changes in predicted returns, if a company
grows more (or less) than it is supposed to, it may change market confidence on that stock. This
indicator reflects the growth momentum of revenue, admitting that a bigger momentum will create
more confidence in a certain company, and so it will be represented by:
∆RG = ∆∆Revenue =RGActual −RGLastY ear
RGLastY ear(3.7)
Since it compares the momentum of annual growth, this indicator will only be available after the
second year of analysis.
• ∆ NIG
This indicator will have the same impact as ∆RG, and will also be used to measure changes in
different NI grows. This will indicate if the NI grow of a company is slowing down or not. The
indicator used is:
∆NIG = ∆∆NI =NIGActual −NIGLastY ear
NIGLastY ear(3.8)
• CFOA
This indicator (as explained in Chapter 2) will be used to choose companies that create cash flow
income from operating activities. The indicator used is:
∆CFOA =CFOAActual − CFOALastY ear
CFOALastY ear(3.9)
3.3.4 Classifier
This module does the classification of stocks for each quarter, based on the approach used by Peter
Lynch in [54], explained in section 2.6.
Although the book is old, and Economy changes at a fast pace, there are underlying theories that
make sense (and can be applied) in the present, with adaptations. The book was written in the end of the
1980’s, when economic context was completely different, and because of this the reference parameters
used by the author are not suitable. This update in parameters, either for economic context, as for
44
implementation ease is explained in the following paragraphs.
Although the author identifies six types of stocks, only five types are used in this work (not all types
are directly deduced from the book), mainly because they are the ones one can identify by directly
analyzing companies’ accountings. Cyclicals and Asset Plays are two types of stocks from the book
not used in this work. Cyclicals need a more careful analysis, to look for patterns in the quotes and
in accountings, in a way one determine cycle parameters and in which part of the cycle a company is.
Asset Plays are mainly based on the evaluation of the companies assets, which requires careful asset
examination, which can not be determined in spreadsheets.
There is a new type of stock used in this work, which was introduced to cover all the Size × Growth
space. This type represents the small stocks with normal and good growths (see the following para-
graphs to understand these classifications, and see figure 3.6 to better understand the structure) and
are given the name Potential Stocks.
Very Good
Good
Normal
Bad
Very Bad
MediumSmall Big
Fast Grower
Stalwart
Slow Grower
Potential
Turn Around
Growth
Size
Figure 3.6: Types’ Quadrants
The Classifier is the module that defines in which of the five types of stocks a stock belongs to, and
also gives a classification to the company financial health, all evaluated through the company’s FA. The
classification is given quarterly, so, a company may be of a type in a quarter, and change its classification
in the next one. The necessary information for the classification made by this module is given by user
input, however one could implement a fuzzy system instead of a human input classifier, in order to make
the system more sophisticated, and possibly to get better returns.
The classification given has only into account the size and the growth of the company (see figure
45
3.6), and not the financial health, that was implemented for possible human analysis.
• Size
The size of a company is classified given classifications to assets (using thresholds given as user
input), and then averaging the classification given to each one. Each asset is classified by doing
an average of the value over the last year (for further detail on how data is processed see section
3.4.1) and comparing that value with two thresholds. These thresholds will indicate if the asset is
either small, medium or big. After all the assets are classified, the company’s size is classified by
averaging all those classifications and rounding them (i.e. if taken into account 3 assets, and 2 of
them are classified as big, and the third one as medium, the company is classified as big).
The way classification is given using thresholds is the following:
– Classification = 1→ V alue < THLow
– Classification = 2→ THLow ≤ V alue < THHigh
– Classification = 3→ THHigh ≤ V alue
Afterwards the Classification is averaged:
TotalClassification =
∑Ni=1 Classificationi
N(3.10)
In equation 3.10 N is the total number of assets to be classified, and Classificationi is the classi-
fication given to asset i. For last, the classification given to the size is accordingly:
– Small→ Total Classification ∈]0, 1.5[
– Medium→ Total Classification ∈ [1.5, 2.5[
– Big→ Total Classification ∈ [2.5, 3]
In this work the only size indicator used is the last year’s Total Assets average, and the thresholds
are:
– THLow = 5B2
– THHigh = 10B
• Growth
The Growth of the company is classified in a similar way (by classifying the growth of user input
variables and then averaging the classification), however there are some differences. In the growth
classification, each classification has five possibilities (very bad, bad, normal, good, very good),
2the 5B and 10B represents respectively 5 and 10 Billion dollars
46
unlike the size classification that had only three. One out of 10 possible classifications is given to
each variable, and as in the size classification, the growth classification will also be the average of
the classifications of all the variables used as user input.
The growth is measured yearly (between the same quarter of different years). For further detail on
how this is done see section 3.4.1.
The procedure is similar with the one used to classify the size:
– Classification = 1→ Indicator < TH1
– Classification = 2→ TH1 ≤ Indicator < TH2
– Classification = 3→ TH2 ≤ Indicator < TH3
– Classification = 4→ TH3 ≤ Indicator < TH4
– Classification = 5→ TH4 ≤ Indicator < TH5
– Classification = 6→ TH5 ≤ Indicator < TH6
– Classification = 7→ TH6 ≤ Indicator < TH7
– Classification = 8→ TH7 ≤ Indicator < TH8
– Classification = 9→ TH8 ≤ Indicator < TH9
– Classification = 10→ TH9 ≤ Indicator
Afterwards the Classification is averaged:
TotalClassification =
∑Ni=1 Classificationi
N(3.11)
In equation 3.11 N is the total number of assets to be classified, and Classificationi is the classi-
fication given to asset i. For last, the classification given to the growth is accordingly:
– Very Bad→ Total Classification ∈]0, 2]
– Bad→ Total Classification ∈]2, 4]
– Normal→ Total Classification ∈]4, 6]
– Good→ Total Classification ∈]6, 8]
– Very Good→ Total Classification ∈]8, 10]
In this work the only growth indicator used is the revenue yearly growth, and the thresholds are:
– TH1 = −0.2
– TH2 = −0.1
47
– TH3 = −0.05
– TH4 = −0.02
– TH5 = 0
– TH6 = 0.02
– TH7 = 0.05
– TH8 = 0.1
– TH9 = 0.2
• Health
The Health evaluation of a company has into account the amount of debt the company has. It is
classified in a similar way to the size (by doing an annual average of parameters, and comparing
them to two thresholds). The financial health of a company does not interfere with its type (there
may be several companies of the same type with different financial healths), it is simply indicative
for human analysis.
In this work the only financial Health indicator used is the last year DR indicator average, and the
thresholds are:
– THLow = 0.3
– THHigh = 0.7
(a) Health (b) Size
(c) Growth
Figure 3.7: Classifier Structure
48
• Type
The type of the stock is, as said before, obtained from the evaluation of the size and growth of a
company, based on the approach presented by Peter Lynch in [54]. After a stock has its growth
and size evaluated, the type is determined by combinations between them.
The 5 classifications given in this work are (see figure 3.6):
– Slow Grower - These are the companies that are considered Medium in terms of size and
had a Normal or Good growth classification
– Stalwart - These are the companies that are considered Big in terms of size and had a Normal,
Good or Very Good growth classification
– Fast Grower - These are the companies that are considered Small or Medium in terms of size
and had a Very Good growth classification
– Potential - These are the companies that are considered Small in terms of size and had a
Normal or Good growth classification
– Turn Around - These are all the companies that had a Bad or Very Bad growth classification,
independently of their size.
3.3.5 Clustering
The clustering in this work is not threated as a typical clustering problem, but instead it is used for
classification, given a fixed number of clusters. This module creates five clusters each quarter, and
associates each stock to the nearest cluster in the Growth × Size space. It uses the GA module to
optimize clusters’ positions, given at least one year training. The number of clusters chosen (5) was
chosen taking into account the number of types created by the user input classifier (also five), to check if
there was any kind of resemblance between the two classification methods. After the GA module outputs
the clusters’ locations in the search space, the clustering module associates each company to a cluster
by minimizing the euclidean distance between company and clusters (it choses the cluster with minimum
distance). Since each axis comes with a big unit difference (size comes in billions, and growth comes
in a fraction representing the percentage), values are scaled to help the algorithm to converge. This is
done such that 1B$3 in assets represents a distance equivalent to 1% in growth, both representing a unit
distance from the origin. These scaled units are given in equation 3.12.
Size =Total Assets[Million$]
1000(3.12a)
Growth = ∆Revenue× 100 (3.12b)3B stands for the American billion, 1000 million in European units
49
The growth measure is the annual revenue growth (for more details on how data is processed check
section 3.4.1), and the size measure is the average of the Total Assets over the last year.
After trying with different scales this one was the most successful in clustering the same type of
stocks, since companies’ data is so sparse (the difference in size of the companies is very big comparing
to the difference in growth). The most intuitive normalization would be to normalize both values over the
maximum value of that quarter, however, this would make the density of points near the origin would be
too high for the algorithm to converge properly.
There will be 5 clusters (the same amount as the types in the Classifier module), enumerated from A
to E, and they will move each quarter (the GA will recompute the best locations for clusters each quarter
passed). So, to maintain consistency, the first cluster (cluster A) will always be the one nearest to the
origin of the plane (Origin = Coordinates (0, 0)), B the second nearest, and so on. This way we can
check in a coherent way if a stock changed its cluster.
3.3.6 GA
This module has two functionalities (divided into two submodules). One is to define weights given
to fundamental indicators, used with the Investment Simulator module in order to give buy/sell signals.
Other is used with the Clustering module to optimize the location of the clusters in the plane. The way
each works is described as follows:
• Fundamental Analysis Genetic Algorithm (FA GA)
This is where all the training phases used for different portfolios occur. Since the Portfolios use
Fundamental Indicators, and FI requires a more long term analysis than Technical Indicators, each
time unit is considered a quarter. A generation is an iteration over the last 4 quarters of the GA.
See figure 3.8 to see a structure of a chromosome.
Figure 3.8: Fundamental Indicators’ Chromosome Representation
A pseudocode of the GA used is described:
50
Algorithm 3.1: Used GAg ← 0 ;t← current Quarter −4 ;P (g)← random ;while g != number of generations do
while t != current Quarter doInvestment Simulation from P (t) ;
fitness← Returns from P (t) Simulation ;Pp(g)← Selection of parents from P (t) ;Pc(g)← Crossover from Pp(t) ;Pm(g)← Mutation of Pc(t) ;P (g + 1)← New Generation Creation from (P (t), Pm(t)) ;g ← g + 1 ;
In algorithm 4 t are trimesters and g are generations.
At the beginning of the algorithm the population is generated randomly. Each FI weight is initialized
as in equation 3.13a, the buy signal value is initiated as in equation 3.13b, and the sell signal value
initiated as in equation 3.13c.
r ∈ [0, 1] (3.13a)
b =
N∑i=1
riai (3.13b)
s =b
2(3.13c)
In equations 3.13 N is the number of indicators used, r and a are different random numbers, b
is the buy signal value and s is the sell signal value. Even though b can take values in [0, N ],
finding a random number in this interval will not simulate randomness of indicators and weights.
To construct a truly random b one has to create N random numbers r to simulate weights, N
different random numbers a to simulate indicators’ values, and apply the equation 3.13b. The s
value is calculated as in equation 3.13c to guarantee that it is smaller than the b value, and that it
has a substantial percentage difference from the b value.
A buy signal is given if the sum of the weights times the value of the fundamental indicators is
above b, and a sell signal is given if this sum is below s, as described in equations 3.14. Signal is
the type of signal given to the Portfolio module to simulate buy or sell, vi is the value of indicator
index i and wi is the weight of indicator index i. The average of the top 5 individuals of the algorithm
is used at the end to define the values used by the Investment Simulator.
51
Signal =
{BUY →
∑Ni=1 vi × wi > b
SELL→∑Ni=1 vi × wi < s
(3.14a)
The fitness function will be the ROI of each individual, when simulating investment.
Fitness = ROI =Return - Initial Investment
InitialInvestment(3.15)
The iterations start by simulating investment with the Fundamental Indicators’ weights and the
buy/sell values of the chromosomes. The returns from the simulations with each individual will be
the fitness of that individual.
• Clustering GA
This is where the training and validation of the optimization of clustering positions algorithm occur,
inspired in the Automatic Clustering Genetic Algorithm from [44]. The GA module will receive as
input the size of the chromosome and the number of desired clusters (a fixed value) and run the
GA to find the best locations for clusters centroids, being the output the centroids locations in the
Size×Growth plane. It is relevant to note that
It will use the stocks’ FA to calculate their positioning in the plane, and use the Calinski-Harabasz
index as fitness function. See figure 3.9 to see the structure of a chromosome.
The GA, apart from the usual GA parameters (such has population size, number of generations,
etc..) uses two user inputs:
– Number of possible cluster positions
– Number of solutions (or clusters) created
The chromosomes are constructed with cluster points and have an auxiliary structure called acti-
vation values, in equal numbers, since the activation values will determine if a certain cluster point
is going to be used or not. The number of cluster points and activation values is the possible
number of cluster positions given by the user. Cluster points and activation values are such that:
– Cluster Point is a tuple (size, growth), where size ∈ R+ and growth ∈ R
– Activation value is a number n ∈ [0, 1]
Figure 3.9: Clustering Chromosome Representation
52
A stock position in the plane Size × Growth is normalized before computing centroids and dis-
tances, and described in equation 3.16 (see section 3.3.5 for an explanation on why this normal-
ization was used).
Size = Total AssetsLast Year Average/1000 (3.16a)
Growth = ∆RevenueLast Year Average ∗ 100 (3.16b)
At first, the maximum size of all stocks in the first year (the minimum training period is 1 year) is
obtained, and used as a reference (maxsize). The maximum growth is also computed and used
as reference (maxgrowth). Afterwards chromosomes are randomly initiated, by assigning random
values to Size and Growth, in order to create the N different possible points. This is done as in
equation 3.17.
Sizerandom =maxsize
2× r1 (3.17a)
Growthrandom =maxgrowth
2× r2 (3.17b)
In equation 3.17 r1 ∈ [0, 1] and r2 ∈ [0, 1] are distinct random numbers and maxsize, maxgrowth
are the maximum size and growth measured in the first year.
Activation values, that are an auxiliary structure whose only purpose is to evaluate which clusters
have more members (the GA does not apply to the activation values) are computed after this
initialization, measuring the percentage of stocks that belong to each cluster.
Fist, stocks are assigned to clusters. To check the distance between stocks positions and cluster
positions, the distance function used is the euclidean distance, or L2 norm as in equation 3.18.
‖A−B‖ =√
(xA − xB)2 + (yA − yB)2 (3.18)
That applied to this specific problem comes in the form of equation 3.19.
Dist = ‖Stock−Cluster‖ =√
(Stocksize − Clustersize)2 + (Stockgrowth − Clustergrowth)2 (3.19)
The activation values are obtained as in equation 3.20.
Activationi =Number of Stocks ∈ CiTotal Number of Stocks
(3.20)
In equation 3.20 i is the index of the solution, and Ci is the cluster of index i.
Within the number of solutions decided by the user, the clusters with bigger activation values are
53
chosen as solutions of that chromosome. To measure the fitness of the chromosome after this
process, one has first to find the centroid of the whole data set. Afterwards stocks are assigned to
the solution clusters (assign to the nearest cluster) and then the Calinski-Harabasz metric (some-
times called variance ratio criterion) is applied (as in equation 3.21). The bigger the result, the
fitter the individual is. This metric has into account intra cluster similarity and inter cluster dissim-
ilarity, getting higher values when clusters have high intra cluster similarity and high inter cluster
dissimilarity.
CH =SSBSSW
× N −KK − 1
(3.21a)
SSB =
K∑j=1
nj‖Cj − C‖2 (3.21b)
SSW =
K∑j=1
∑i∈Ij
‖Xij − Cj‖2 (3.21c)
C = (
∑Stockssize
N,
∑Stocksgrowth
N) (3.21d)
In equations 3.21 SSB is the between-cluster variance, SSW is the within-cluster variance, N is
the total number of stocks, K is the number of clusters (the number of solution clusters), nj is the
number of data points belonging to cluster index j, C is the centroid of the dataset, Cj is cluster of
index j, Ij is the set of data points belonging to cluster j and Xij is data point index i belonging to
cluster index j.
Although the Calinski-Harabasz index is used as a metric to optimize the number of clusters cre-
ated by a clustering algorithm, in this work this index is simply used as a fitness function to optimize
the position of the clusters.
• Operators
Two types of selection are implemented. A roulette wheel and a ranking selection (as explained
in 2.2.1). For this it will be used the fitness values shifted by the fitness of the worst individual, to
avoid negative fitnesses.
The type of crossover done is a single point crossover, that takes a random integer value between
1 and Numberweights − 1 (in a vector starting at 0) and exchanges the values of the weights in the
indexes from that random point on (for example, if there are 5 weights, and the random value is 3,
the weights 3, 4 and 5 are exchanged). The same crossover is applied in both the clustering and
the FA GA.
The mutation implemented (also inspired in [44]) uses a ψ value to determine the maximum value
54
of the perturbation. This ψ is a percentage δ of the value that will be mutated. After finding this
value, the mutation value α will be computed as being a random number between [0, ψ]
The mutation can be mathematically written:
ψ = δ × v (3.22)
α = random ∈ [0, ψ] (3.23)
value = value± α (3.24)
Where value is the value receiving the mutation, α is the perturbation applied to the value, ψ is the
maximum value of the perturbation and δ is the percentage of perturbation chosen. To chose if the
mutation will be a sum or a subtraction it’s used a ”coin flip”, meaning a random number r ∈ [0, 1]
is generated, if the number is below 0.5 the mutation will be a subtraction, otherwise it will be a
sum.
In this work values of δ = 0.5 in indicators weights, and δ = 0.2 in cluster positions are used, this
way mutation can perturb at most 50% of weights’ value, and at most 20% of clusters’ positions
(since 5 clusters will be used, 100% of the Size X Growth plane5 = 20%)
3.3.7 Investment Simulator
The Investment Simulator is the coordinator between the Investor, the Genetic Algorithm and the
Portfolio modules. This is the module that will determine which model will run. There are 3 type of
models implemented:
• Whole data set models
• Classification based models
• Cluster based models
The Investment Simulator is the module responsible for coordinating the resources needed for the mod-
ules to run (GA usage, access to pools of stocks, and parameter definition). It is the Investment Simulator
module that is responsible to give the buy signals to the portfolios, and to define the weights given to
each indicator (given as an average of the top 5 individuals of the population of the GA). See figure 3.10
to see how the Investment Simulator and the Portfolio module interact.
3.3.8 Portfolio
The portfolio is the output of the system. It receives as input access to the Stock module, the buy
signals of the stocks, and the current trimester. It uses this information to save information about the state
55
Figure 3.10: Investment Simulator, Portfolio and Stock Interaction
of the investment for a certain strategy. It contains the current date (the date that is being evaluated),
the companies with open positions in the portfolio, the date and quote at which one of them was bought
and the value of the quotes at present time. It is the module responsible for simulating the buying and
selling of stocks (looking for the quotes at a given date, saving that information).
The portfolios using classifications or clusters use only the stocks of a certain type/cluster. Since
the type/cluster of the stocks may change each quarter, the pool of stocks available for transactions is
dynamic (changes each quarter). There are 2 types of Portfolios that use the whole dataset. One is the
normal Buy&Hold, used for comparison with other portfolios, and the other is the FA GA portfolio using
all of the Data Set.
• Buy & Hold
There is only one B&H portfolio in this work, using the whole data set, used to compare other
portfolios (alongside the S&P500 index) Using only the S&P500 index as control study may not
be enough, since every portfolio uses a dataset that is not the one of the index (but a subset of
the index), and so, a B&H of the dataset used is useful for comparison. In this B&H the whole
dataset is bought at the beginning of the time duration of the portfolios and kept until the end of
56
their duration.
• Classification and Clustering portfolios There are 10 portfolios using a classification technique
in this work. Five are portfolios using user input classification, and five using the clustering algo-
rithm.
Since the pool of stocks is dynamic (each stock may change type/cluster each quarter), the port-
folios are done in such a way that if a stock is of the required type/cluster in a quarter, is added to
the portfolio in that quarter (is bought), and when it stops being of that type/cluster is taken out of
the portfolio (is sold). The evolution is tracked after buying, but before selling, this way if a stock
enters the portfolio in a quarter it will not change the growth of the portfolio (since the evolution is
checked after buy, and for the same day, the stock will have the same quote, and the difference will
be only the transaction cost). However if that same stock changes its type in the next quarter and
has to leave the portfolio, the evolution of that quarter will be accounted (since evolution is tracked
before selling the stock).
This will allow to monitor how the stocks of each type or cluster evolve, making it able to conclude
which type or cluster has superior results.
• FA GA portfolios
These are the Portfolios that use the GA module to train and use Fundamental Indicators to give
the buy and sell signals over a pool of stocks. This pool of stocks may be the whole data set, or
stocks from one of the 5 classification/clusters portfolios. There are 11 portfolios like this, one with
the whole data set, 5 using the stocks classification and 5 using stocks clusters.
The Portfolio starts after the training period of the GA, and the buying signals are given by FI
weights made from the average of the weights of the top 5 (5% in this work) chromosomes of
the training. At each iteration, after the portfolio computed its solution for the quarter, the GA
module retrains, with a sliding window, so the information used in the next quarter is updated with
information about the current.
When the portfolio uses classifications or clusters, the pool of stocks is dynamic (changes each
quarter). However these portfolios buy only stocks of a certain type/cluster, it maintains them in
the portfolio until a sell signal is given. This way the GA will be able to optimize parameters to a
single type/cluster of stocks but will not sell them prematurely.
57
3.4 Implementation Details
3.4.1 Data
There are two types of Data obtained from available information on the Internet, one are the quotes
of a certain company for a given period, and the other type of data are the companies financials (Balance
Sheets, Income Statements and Cash Flow Statements). The stock quotes come in tuples as in figure
3.3. The type of quotes used in this work are closing quotes.
Although the sheets used come with a lot of data (Balance Sheet with 19 variables of data, Income
Statement with 22 variables and Cash Flow Statement with 6 variables), the only information used from
each sheet is the following:
• Balance Sheet:
– Total Equity
– Total Assets
– Total Liabilities
• Income Statement:
– Revenue
– Net Income
– Dividends per Share
– Diluted Normalized EPS
• Cash Flow Statement:
– Cash from Operating Activities
When this data is being processed, there are several things to take into account. The first one is
data integrity. Some companies had missing rows or columns in their information. Companies missing
crucial information for the system were taken out of the dataset.
3.4.2 Data Processing
The evaluation of assets and growth in the clustering and classification methods averaged the vari-
ables being evaluated, to smooth abrupt changes in these variables, and to take into account more than
one quarter of information. For example, revenue growth is measured between trimesters of different
years, lets say we have the growth measured between every quarter of 2012 and 2011, the value used
58
as a growth measure is the average of the growth of all the quarters of 2012. This average does not
take into account zero values, since these correspond to missing information (for example in 2010 there
is no way of measuring growth, since there is no information about 2009). This is also valid for size
evaluations.
3.4.3 Configuration
The parameters given in the configuration file include the following:
• GA Parameters
The parameters used by the GA are:
– Population size - number of individuals (chromosomes) used in the GA.
– Number of generations - number of iterations over the same period
– Mutation rate - percentage of total weights or cluster points that receive a mutation
– Elitism rate - percentage of the population with best fitnesses that is copied to the next gen-
eration
– Immigration rate - percentage of the population with worst fitnesses that is replaced by random
immigrants
– Training period - number of quarters used only for training
– Validation period - number of quarters used for validation of the models
– Chromosome size - number of parameters (weights or cluster points) to be optimized by each
individual
The GA parameters are the one’s that describe the GA functioning. These may be tunned up as we
please, and bring slight changes to the output. The balance we want to find is between a system
that has a meaningful training and one that adapts well to changes (that avoids over fitting).
The used parameters are:
– Population size - 100. This value was chosen since related works chose similar sizes. There
was given no consideration to the problem size or features.
– Number of generations - 50 for FA GA and 200 for clustering location optimization. 50 gener-
ations for the FA GA were chosen so the models could be computed in a viable time period
(values of 35 and 100 generations were also tested, and 50 was the biggest value that could
compute in viable time for the different tests made). 200 generations were chosen for the
59
clustering location optimization, since this algorithm had to run fewer times the execution time
could be bigger (values of 100 and 150 generations were also tested). Also, the clustering
position optimization algorithm did not change much in the last generations.
– Mutation rate - 5%. This value was chosen having as reference the related works. A value of
3% was also tested, however 5% seemed more appropriate after selecting an elitism rate of
40%.
– Elitism rate - 40%. This value was chosen to guarantee that the global fitness of the population
would not go down.
– Random immigration rate - 20%. This value was chosen to give flexibility to the system and
help the system adapt to changes in the optimization problem.
– Training period - the minimum training period for the clustering GA is of 1 year, and the
minimum training period for the FA GA using the classification or clustering portfolios is 2
years (1 to obtain the classification and 1 to train the FA GA)
– Validation period - the maximum validation period is from the end of the training period until
the last quarter with information available (5 years for the clustering GA and 4 years for the
FA GA)
– Chromosome size - 11 for the FA GA (equals the number of indicators) and 75 for the clus-
tering position optimization (number of possible cluster points). The size of the clustering
chromosome was chosen to be 75 because of the execution time associated with it. Sizes
of 50 and 100 were also tested, but 75 was the biggest size whose execution time would be
viable.
• Transaction Costs
Although none of the portfolios solves the portfolio composition problem, and there is no allocation
of budget, transaction costs are taken into account in every trade, when the stock is bought and
when it is sold. These transactions have a cost of 0,3% (so when it is bought every stock costs
0,3% more, and when it is sold it is worth 0,3% less).
60
4System Validation
Contents
4.1 Validation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
4.2 Case Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
61
62
In this section the performance of the implemented system is tested and metrics used described.
First the metrics used to evaluate the models are described, and later the case studies on the proposed
solution. The quarterly returns presented in the tables are measured as the returns of positions since
the first day of the quarter measured until the first day of the next quarter. This means that first quarter’s
returns are measured from the first day of quarter 1 until the first day of quarter 2, and quarter 4 returns
are measured from the first day of quarter 4 until the first day of quarter 1 of the next year.
Every solution is compared with the S&P500 index and the B&H of the dataset in the specific time
period. The B&H of the dataset is constructed by buying all the stocks of the dataset (the 272 stocks listed
in appendix A) in the first day of the investment period, and tracking its returns during the investment
period.
There is no comparison with related studies because when this work was proposed I was unable to
find neither works that used the same validation period, neither works that traded only quarterly, and
so this work uses the B&H and the S&P500 index (specially the latter) as comparison with the obtained
results.
The creation of each portfolio is independent of each other, and so, although the algorithm was run
sequentially without any parallelization, parallelizing the algorithm to create each portfolio at the same
time would not be hard if the necessary alterations were made.
The execution time of the clusters’ positions optimization algorithm is of about 5 hours and it takes 7
hours to create all of the portfolios once, with an i7 microprocessor.
4.1 Validation Metrics
In order to test the performance of the proposed solution, the metrics used in this work are the
following:
• ROI
• Drawdown
• Sharpe Ratio
• Success rate of trades
• Average time in the market
• Average return per trade
• Rate of positive quarters
63
• Average return per quarter
These metrics are used to evaluate every type of portfolios, and the performance of the portfolios are
compared to the S&P500 index, and the B&H of the dataset during the same time.
4.1.1 ROI
The Return on Investment is used to measure the amount of return, given in percentage, that a
certain investment had. In this work, this is given exclusively by the percentage difference in the stocks’
quotes, and is mathematically represented as in equation 4.1.
ROIn[%] =Returnn − InitialInvestmentn
InitialInvestmentn(4.1a)
InitialInvestment = Pn,t (4.1b)
Return = Pn,τ (4.1c)
TotalROI[%] =
∑NUMn=1 ROInNUM
(4.1d)
In equation 4.1d, NUM is the total number of companies in the data set being used, P is the quote of
stock index n, t is the period at which stock index n was bought and τ is the period at which stock index
n was sold or had the return evaluated (one can see the ROI of an investment without selling the stock).
The number of stocks is not accounted because this work has only in considerations the evolution of
stock’s quotes, and does not solve the portfolio problem.
To include transaction costs in this work, the used ROI formula will suffer a slight change, as de-
scribed in equations 4.2a and 4.2b.
InitialInvestmenttx = (Pn,t + (Pn,t × ψ)) (4.2a)
Returntx = (Pn,τ − (Pn,τ × ψ)) (4.2b)
In equations 4.2a and 4.2b, ψ is the transaction cost. This work uses ψ = 0, 003 (transaction costs of
0,3%).
Since all the portfolios studied in this work are long term investments, and the buy and selling dates
will be at the beginning of each quarter, the total ROI of a portfolio at a given time (compounded over
the time studied) will be given by:
TotalROI(t)[%]tx = (
T∏t=1
(1 +ROI(t)tx))− 1 (4.3)
64
In equation 4.3, T is the total number of quarters that a portfolio has and t is the current quarter being
evaluated. ROI(1)tx, the ROI of buying and selling stocks without any change in the quotes (equivalent
of buying and selling in the first day) would be 0 without transaction costs, however with transaction
costs, this will have a negative effect.
ROItx[%] =Returntx − InitialInvestmenttx
InitialInvestmenttx=
Return(1− ψ)− InitialInvestment(1 + ψ)
InitialInvestment(1 + ψ)=
−2× InitialInvestment× ψInitialInvestment(1 + ψ)
=−2× ψ(1 + ψ)
(4.4)
Applying the 0,3% of transaction costs used in this work, the result is −0, 00596421. This is the value
of ROI(1)tx, and the percentage of invested money that goes to transaction costs in that quarter.
4.1.2 Drawdown
This metric evaluates the biggest peak-to-trough decline during a specific period of investment. It is
quoted as the percentage between the peak and the subsequent trough. Investors can use this metric
as a way to measure a portfolio volatility.
Drawdown = min(0, ROIi)→ i ∈ [0, Q] (4.5)
In equation 4.5 Q is the number of quarters in a portfolio, ROIi is the ROI in quarter i, and i is the
number of quarters passed since the beginning of the portfolio.
4.1.3 Sharpe Ratio
The Sharpe Ratio is one popular ratio, used to measure the risk associated with the return of a
portfolio. The excess of return of the portfolio over the risk free rate of return is standardized over the
standard deviation of the portfolio. The higher this ratio is, the better.
The risk free rate of return is a theoretical concept, that represents the amount of return an investment
without risk would have. These kind of investments do not exist, since there is always some risks
associated with investing, however, the United States Treasury Bills are usually used as references for
the risk free rate, since it is considered the less risky investment worldwide.
SharpeRatio =Portfolio Return− Risk free rate
δ(4.6)
In equation 4.6 δ is the standard deviation of the portfolio.
65
In this work is used a yearly risk free rate of 2% for 2012 and 2015, and 2,5% for 2013 and 2014.
4.1.4 Success rate of trades
A trade is considered as a buy and subsequent sell of a stock (including the selling of stocks at the
end of the period of the portfolios, in the fourth quarter of 2015). The Success rate of trades will be used
to measure the percentage of trades that obtained a positive return, and is described in equation 4.7.
Success Rate of Trades[%] =number of trades with positive return
number of total trades(4.7)
4.1.5 Average time in the market
This metric evaluates the average time each investment was on the market, making it possible to
conclude how long does the portfolio maintain its investments. It will be given in number of quarters, and
is given by:
Average time in the market =
∑Tt=1Mt
T(4.8)
In 4.8 T is the total number of trades done in a portfolio and M is the time spent by trade t in the market.
4.1.6 Rate of positive quarters
This metric evaluates if the portfolio is able to maintain a positive return in each quarter over the
evaluation period and can be used as a way to measure risk.
Rate of positive quarters =#Positivequarters
#Quarters(4.9)
4.2 Case Studies
In this section, the case studies of the models used are presented. The Dataset used for the models
is constituted by 272 companies of the S&P500 index. The application was tested with data obtained
from Yahoo finance API (as explained in Chapter 3) using the close quotes of stocks. Three constrains
are present in all of the case studies. They are the following:
• Only Long positions: The portfolios created allowed only the option of long positions. Since
this work measures the amount of time one has a position open in quarters, and it uses only
Fundamental Analysis, short position will not be contemplated. Short positions would require a
deeper technical analysis, since they are more risky than long positions.
66
• No dividends: In this work is used the stock’s closing quotes, with no adjustment to account
dividends (commissions paid by shareholders).
• Transaction Costs: This work assumes transaction costs with a value of 0, 3% of the stock quote
value, in every buy or sell.
Results will be compared with the B&H of the dataset in order to check if the proposed solution is
better than simply do a B&H on a pool of stocks. Results will also be compared with the S&P500 index1
to draw conclusions about how good a result really is.
Every portfolio will have its start in the first quarter of 2012, and end at the fourth quarter of 2015.
Even though the first two case studies could start in 2011, this decision was made to ease the compari-
son between every portfolio, including the ones in case study 3, that could only start at the beginning of
2012.
The 3 case studies concern only about the evolution of returns in the portfolios during the stipulated
time period, and do not concern about the portfolio composition problem.
Although a ranking selection was implemented in this work, the performance of this selection scheme
was worst than the performance of the roulette wheel selection, and so the presented results are con-
structed using only the roulette wheel selection.
4.2.1 Case Study I - User Input Classification
This case study presents the results of classifying stocks using user input data.
The strategy is to do a portfolio containing only a type of stock. This portfolio will buy stocks of a
certain type the quarter that stock is classified as belonging to that type, and sold in the quarter it stops
belonging to that type.
Since this classification is fixed, the values presented are the result of a single run (every other run
would have exactly the same results, since data did not change, and this is a deterministic classification).
We can see in figure 4.1 that the Fast Growers type of stocks have better returns overall, with a 4
year period ROI of 79, 32%, above the S&P500 returns of 62, 4% and the dataset B&H of 54, 25%.
In this time period we can also check that the Turnaround type of stocks have a worst result overall,
with a ROI of 30, 13%. Turnarounds and Slow Growers types were the only ones that underperformed
both the B&H and the S&P500. Potentials and Stalwarts got a result similar to that of the B&H.
This is an expected result since Turnarounds are all the companies that got a bad or very bad
classification at growth. It was also expectable that the type of stocks with better results was the Fast
Growers, since it contains the stocks with better revenue growth. The stalwarts return was a surprise,
as one would expect that they would perform better.
1data obtained from http://performance.morningstar.com/Performance/index-c/performance-return.action?t=SPX
67
2012 Q1 2012 Q2 2012 Q3 2012 Q4 2013 Q1 2013 Q2 2013 Q3 2013 Q4 2014 Q1 2014 Q2 2014 Q3 2014 Q4 2015 Q1 2015 Q2 2015 Q3 2015 Q4SG 7,91% 3,43% 10,52% 14,04% 24,27% 26,46% 32,62% 37,55% 43,75% 45,16% 41,50% 51,82% 54,16% 51,47% 45,49% 44,09%SW 8,85% 5,72% 10,58% 14,68% 24,83% 28,15% 35,00% 44,87% 51,04% 55,74% 52,32% 63,62% 66,37% 62,57% 52,60% 54,33%FG 14,07% 6,67% 15,88% 25,37% 28,56% 29,92% 35,68% 46,95% 56,97% 61,02% 53,34% 65,29% 78,10% 72,37% 68,14% 79,32%POT 10,73% 4,34% 8,27% 13,94% 21,30% 20,37% 29,24% 38,72% 42,89% 44,70% 41,67% 52,06% 60,15% 57,31% 53,34% 55,73%TA 6,11% 1,52% 6,40% 4,65% 14,36% 20,35% 29,90% 40,10% 45,02% 47,30% 43,27% 45,95% 45,32% 40,83% 27,82% 30,13%B&H 10,25% 5,35% 11,73% 15,50% 26,05% 29,50% 37,89% 48,41% 55,24% 59,29% 54,97% 65,68% 68,94% 64,51% 52,44% 54,25%S&P500 11,99% 8,31% 14,54% 13,39% 24,76% 27,70% 33,69% 46,96% 48,87% 55,85% 56,80% 63,68% 64,40% 64,03% 52,64% 62,49%
-20%
0%
20%
40%
60%
80%
100%
Retu
rns
Quarters
Type's Portfolios
Figure 4.1: User input classification portfolios’ accumulated returns. The key is the following: SG - Slow Growers;SW - Stalwarts; FG - Fast Growers; POT - Potentials; TA - Turnarounds
It is shown in figure 4.2 that Stalwarts was the type that had the biggest rate of trade success, and
Turnarounds the type with the worst rate of trade success by a large margin. Stalwarts was also the
type with the biggest gain in a trade, and surprisingly, Fast Growers type was the one with the smallest
biggest gain by a large margin, nevertheless it was also the one with the smallest biggest loss. The type
with the biggest loss in a trade was the Turnaround type. Both Fast Growers and Stalwarts obtained
more positive quarters than any other, and Turnarounds got the worst result once more, with only 62,5%
of positive quarters. Fast Growers got the biggest average return per quarter, more than twice the
Turnarounds average.
One of the surprising things here is that the Fast Growers got the best sharpe ratio of all portfolios,
even though it was the one with the biggest return, meaning not only was the portfolio that got the
best returns, it was also the portfolio carrying less risk. Fast Growers type did not had a big drawdown
(comparing with the other types), although the smallest drawdown was the one of the Slow Growers
type. The biggest drawdown was the one from the Turnaround type.
In figure 4.3 one can see that Fast Growers and Potentials were the only types (including the B&H
and the S&P500 index) that got positive results every year. It is also visible that Turnarounds got not
68
Number of Trades
Rate of Trade
Success
Biggest Trade Return
Biggest Trade Loss
Average Time on
Market (in
Average Return
Per trade
Rate of Positive Quarters
Average Return Per
Quarter
4 Year Return
Sharpe Ratio
Drawdown
SG 132 65,91% 213,04% -65,39% 4,95 24,46% 68,75% 2,39% 44,09% 0,83 10,07%
SW 243 78,19% 253,67% -74,54% 8,51 36,03% 75,00% 2,84% 54,33% 0,87 13,77%
FG 72 73,61% 151,01% -34,50% 3,93 18,67% 75,00% 3,87% 79,32% 2,15 9,96%
POT 85 72,94% 226,94% -41,83% 5,94 26,76% 68,75% 2,90% 55,73% 1,41 6,81%
TA 203 59,11% 221,01% -80,57% 4,09 10,83% 62,50% 1,78% 30,13% 0,36 18,13%
Figure 4.2: User input portfolios metrics table. The key is the following: SG - Slow Growers; SW - Stalwarts; FG -Fast Growers; POT - Potentials; TA - Turnarounds
only the biggest return in the time period (in 2013) but also got the biggest loss (in 2015), showing the
high volatility from that type.
In figure 4.4 is presented the quarterly, accumulated and yearly results of the portfolios, and the worst
quarter of each portfolio is marked in red (this does not mean the drawdown is in the same quarter, since
negative quarters have a bigger impact on portfolios that have bigger accumulated returns).
4.2.1.A Conclusion
The results of Fast Growers (the best type) and Turnarounds (the worst type) were unsurprising,
however, this can not be said about the other types, because without these results there was no way of
knowing what kind of result one could expect.
It needs to be mentioned that these results were obtained with thresholds that were hardwired from
the start. This means that the user that defines these thresholds have access to information about all
the data from the past, and can make a decision about the thresholds using that information. Unless the
user is a financial expert (or at least knows about the subject), the robustness of this classification (to
do it in real-time with the market) is uncertain, since it depends on how well the user understands the
market.
69
2012 2013 2014 2015SG 14,04% 20,61% 10,38% -5,10%SW 14,68% 26,33% 12,94% -5,68%FG 25,37% 17,22% 12,48% 8,49%POT 13,94% 21,75% 9,62% 2,41%TA 4,65% 33,88% 4,18% -10,84%B&H 15,51% 28,40% 11,65% -6,68%S&P500 13,41% 29,60% 11,39% -0,73%
-20%-10%
0%10%20%30%40%
Returns
Years
Yearly Returns
Figure 4.3: Types B&H - Yearly Returns. The key is the following: SG - Slow Growers; SW - Stalwarts; FG - FastGrowers; POT - Potentials; TA - Turnarounds
70
Qua
rter
1Q
uart
er 2
Qua
rter
3Q
uart
er 4
Qua
rter
1Q
uart
er 2
Qua
rter
3Q
uart
er 4
Qua
rter
1Q
uart
er 2
Qua
rter
3Q
uart
er 4
Qua
rter
1Q
uart
er 2
Qua
rter
3Q
uart
er 4
Qua
rter
ly
Retu
rns
12,0
0%-3
,29%
5,76
%-1
,01%
10,0
3%2,
36%
4,69
%9,
92%
1,30
%4,
69%
0,61
%4,
39%
0,44
%-0
,23%
-6,9
4%6,
45%
Accu
mul
ated
Re
turn
s11
,99%
8,31
%14
,54%
13,3
9%24
,76%
27,7
0%33
,69%
46,9
6%48
,87%
55,8
5%56
,80%
63,6
8%64
,40%
64,0
3%52
,64%
62,4
9%
Year
ly R
etur
ns
Qua
rter
ly
Retu
rns
10,2
5%-4
,44%
6,06
%3,
37%
9,13
%2,
74%
6,47
%7,
63%
4,60
%2,
61%
-2,7
1%6,
91%
1,97
%-2
,62%
-7,3
4%1,
19%
Accu
mul
ated
Re
turn
s10
,25%
5,35
%11
,73%
15,5
0%26
,05%
29,5
0%37
,89%
48,4
1%55
,24%
59,2
9%54
,97%
65,6
8%68
,94%
64,5
1%52
,44%
54,2
5%
Year
ly R
etur
ns
Qua
rter
ly
Retu
rns
7,91
%-4
,15%
6,86
%3,
19%
8,97
%1,
76%
4,87
%3,
71%
4,51
%0,
98%
-2,5
2%7,
29%
1,54
%-1
,75%
-3,9
5%-0
,97%
Accu
mul
ated
Re
turn
s7,
91%
3,43
%10
,52%
14,0
4%24
,27%
26,4
6%32
,62%
37,5
5%43
,75%
45,1
6%41
,50%
51,8
2%54
,16%
51,4
7%45
,49%
44,0
9%
Year
ly R
etur
ns
Qua
rter
ly
Retu
rns
8,85
%-2
,87%
4,59
%3,
71%
8,86
%2,
66%
5,34
%7,
32%
4,26
%3,
11%
-2,2
0%7,
41%
1,69
%-2
,29%
-6,1
3%1,
13%
Accu
mul
ated
Re
turn
s8,
85%
5,72
%10
,58%
14,6
8%24
,83%
28,1
5%35
,00%
44,8
7%51
,04%
55,7
4%52
,32%
63,6
2%66
,37%
62,5
7%52
,60%
54,3
3%
Year
ly R
etur
ns
Qua
rter
ly
Retu
rns
14,0
7%-6
,48%
8,63
%8,
18%
2,55
%1,
06%
4,44
%8,
31%
6,81
%2,
58%
-4,7
7%7,
80%
7,75
%-3
,22%
-2,4
5%6,
65%
Accu
mul
ated
Re
turn
s14
,07%
6,67
%15
,88%
25,3
7%28
,56%
29,9
2%35
,68%
46,9
5%56
,97%
61,0
2%53
,34%
65,2
9%78
,10%
72,3
7%68
,14%
79,3
2%
Year
ly R
etur
ns
Qua
rter
ly
Retu
rns
10,7
3%-5
,77%
3,76
%5,
24%
6,45
%-0
,77%
7,37
%7,
34%
3,01
%1,
27%
-2,1
0%7,
34%
5,32
%-1
,77%
-2,5
3%1,
56%
Accu
mul
ated
Re
turn
s10
,73%
4,34
%8,
27%
13,9
4%21
,30%
20,3
7%29
,24%
38,7
2%42
,89%
44,7
0%41
,67%
52,0
6%60
,15%
57,3
1%53
,34%
55,7
3%
Year
ly R
etur
ns
Qua
rter
ly
Retu
rns
6,11
%-4
,32%
4,80
%-1
,65%
9,28
%5,
24%
7,93
%7,
85%
3,51
%1,
58%
-2,7
4%1,
87%
-0,4
4%-3
,09%
-9,2
4%1,
81%
Accu
mul
ated
Re
turn
s6,
11%
1,52
%6,
40%
4,65
%14
,36%
20,3
5%29
,90%
40,1
0%45
,02%
47,3
0%43
,27%
45,9
5%45
,32%
40,8
3%27
,82%
30,1
3%
Year
ly R
etur
ns
TAB&H
SG SW FG POT
S&P5
00
13,9
4%21
,75%
25,3
7%17
,22%
13,4
1%29
,60%
15,3
9%28
,50%
14,6
8%26
,33%
12,4
8%8,
49%
11,3
9%-0
,73%
11,6
3%-6
,90%
12,9
4%-5
,68%
4,65
%33
,88%
4,18
%-1
0,84
%
9,62
%2,
41%
2012
2013
2014
2015
14,0
4%20
,61%
10,3
8%-5
,10%
Figu
re4.
4:In
vest
men
tby
Type
tabl
e.Th
eke
yis
the
follo
win
g:S
G-S
low
Gro
wer
s;S
W-S
talw
arts
;FG
-Fas
tGro
wer
s;P
OT
-Pot
entia
ls;T
A-T
urna
roun
ds
4.2.2 Case Study II - Clustering
This case study presents the results of clustering stocks by their revenue growth and size (total
assets), in the plane SizexGrowth, with the Clustering GA algorithm described in section 3.3.6.
The algorithm has run 10 times, and the solution that achieved the median value of the averages of
all quarters’ fitnesses was used, as shown in table 4.1.
Table 4.1: Average fitness per quarter of the clustering algorithm
Best Worst MediumCalinski-Harabasz 806,4 665,37 767,8
The strategy used in this case study is similar to the one from Case Study I: a portfolio is made with
each cluster of stocks. Just as in Case Study I, the portfolios will buy stocks of a certain cluster the
quarter that stock is classified as belonging to that cluster, and sold in the quarter it stops belonging to
that cluster.
Since clusters are ordered, cluster A is always the cluster nearest to the origin (point (0,0) ) and clus-
ter E is always the cluster further away from the origin. Also, remember that the normalization applied to
the plane Size×Growth is such that in terms of distances, 1.000.000.000$ in size = 1% in revenue growth
We can see in figure 4.5 that cluster E has better returns overall, with a 4 year period ROI of 77, 34%,
above the S&P500 returns of 62, 49% and the dataset B&H of 54, 54%. One should also notice that if
the period was 3 years, at the beginning of 2015 cluster E had returns of 100, 83%. In this time period
we can also check that clusters C and D have a worst result overall, with a ROI of 45, 02% and 34, 37%
respectively. All of the clusters with exception for cluster B and E underperformed the B&H. The fact that
cluster E got the best results was an expected, since the further away from the origin, the bigger the
company, and possibly with bigger revenue growth (although this was not the case in cluster E, that got
its classification exclusively by the value in the size axis).
In figure 4.6 is shown that cluster E has very few trades in this period, and has a significantly high
time on the market (comparing with other clusters), meaning that companies in cluster E do not transition
much to other clusters. In cluster B however, there is a high number of trades (relatively to cluster C, D
and E) and the average time in the market is low, meaning there are more companies that change from
cluster B into other clusters (and vice versa). Cluster A has almost the same amount of trades of cluster
B, but almost the double the amount of time in the market, meaning that the cluster is simply bigger than
B (there is not as much exchange of clusters in companies from cluster A as there is in companies from
cluster B).
The rate of trade success is high in every cluster, with 3 clusters above the 80% rate of trade success,
with cluster E having almost 90%. Cluster A has the biggest gain in a trade, and cluster D has the worst
biggest gain in a trade. Cluster D also has the biggest loss in a trade, while cluster E has the lowest
72
2012 Q1 2012 Q2 2012 Q3 2012 Q4 2013 Q1 2013 Q2 2013 Q3 2013 Q4 2014 Q1 2014 Q2 2014 Q3 2014 Q4 2015 Q1 2015 Q2 2015 Q3 2015 Q4A 9,13% 3,96% 10,33% 14,36% 24,87% 27,79% 36,86% 46,58% 53,08% 56,58% 52,68% 63,90% 66,93% 60,47% 51,00% 51,87%B 11,64% 7,31% 12,49% 17,81% 24,72% 23,45% 28,92% 38,72% 47,30% 54,08% 47,79% 53,25% 61,41% 60,19% 55,58% 56,06%C 10,06% 4,45% 7,59% 8,75% 19,24% 26,03% 30,88% 42,32% 48,97% 54,63% 51,91% 59,03% 57,79% 59,05% 43,94% 45,02%D 9,46% 15,49% 22,17% 19,05% 25,65% 30,77% 32,35% 43,80% 44,59% 40,37% 35,78% 43,74% 46,18% 47,82% 32,55% 34,37%E 28,17% 21,00% 36,22% 49,04% 59,20% 70,12% 78,24% 94,57% 91,90% 100,64% 97,29% 104,26% 96,95% 101,68% 77,85% 77,34%B&H 10,14% 5,25% 11,77% 15,51% 26,10% 29,49% 37,84% 48,32% 55,03% 59,04% 54,74% 65,60% 68,90% 64,51% 52,77% 54,54%S&P500 11,99% 8,31% 14,54% 13,39% 24,76% 27,70% 33,69% 46,96% 48,87% 55,85% 56,80% 63,68% 64,40% 64,03% 52,64% 62,49%
-10,00%
10,00%
30,00%
50,00%
70,00%
90,00%
110,00%
Retu
rns
Quarters
Cluster's Portfolios
Figure 4.5: Clustering classification portfolios’ accumulated returns. The key is the following: A - Cluster A; B -Cluster B; C - Cluster C; D - Cluster D; E - Cluster E.
biggest loss.
Cluster E has also the greatest drawdown by far, which made the returns of this cluster fall from
101,68% in the second quarter of 2015, to 77,34% in the fourth quarter of 2015, as seen in figure 4.5.
This explains the low value of the Sharpe ratio. Cluster B is the less risky portfolio, having the lowest
drawdown and the biggest Sharpe ratio.
In figure 4.7 is visible a yearly decrease in the returns of cluster E. One can also see that cluster B
was the only portfolio to have positive returns every year.
In figures 4.8 and 4.9 is shown how companies are divided into clusters. It is remarkable how data
is so sparse in the Size axis. Cluster A, as said before, is the biggest cluster, containing most of the
companies, and cluster E has only 4 companies in the fourth quarter of 2012 and 3 companies in the
fourth quarter of 2013. This is due to a shift to the right of the center of the cluster, which made the
leftmost company in cluster E move into cluster D.
In figures 4.10 and 4.11 is shown a zoom of the first 3 clusters. The main difference between cluster
A and cluster B is in the Growth axis, which explains why cluster B got its results. The fact that cluster
B got the best Sharpe ratio, and the smallest drawdown can be now related to the revenue growth. One
73
Number of Trades
Rate of Trade Success
Biggest Gain Biggest LossAverage Time on Market (in
Quarters)
Average Return Per
trade
Rate of Positive Quarters
Average Return Per
Quarter4 Year Return Sharpe Ratio Drawdown
Cluster A
104 84,62% 202,85% -48,88% 6,64 46,58% 75,00% 2,76% 51,87% 0,76 15,93%
Cluster B
107 74,77% 128,38% -50,11% 3,55 14,75% 68,75% 2,91% 56,06% 1,50 6,29%
Cluster C
65 73,85% 167,25% -57,45% 8,80 29,69% 75,00% 2,48% 45,02% 0,60 15,11%
Cluster D
31 80,65% 107,34% -84,16% 6,23 23,36% 75,00% 1,99% 34,37% 0,51 15,27%
Cluster E
9 88,89% 150,20% -36,59% 7,33 66,46% 62,50% 4,00% 77,34% 0,66 23,83%
Figure 4.6: Clustering portfolios metrics table. The key is the following: A - Cluster A; B - Cluster B; C - Cluster C;D - Cluster D; E - Cluster E.
can also see that not only the cluster moved, as the points belonging to cluster B seem to be in different
locations. This happens because revenue growth varies more that the total assets size, and this explains
why cluster B had the least average time on the market.
4.2.2.A Conclusion
One can conclude that the biggest companies of the dataset got huge returns on bull markets, and
huge losses on bear markets, having small Sharpe ratios, and huge drawdowns. Big companies, outside
this small set of huge companies, have the worst returns of all (the ones from cluster D). It can also be
concluded that revenue growth can be chosen as an indicator to pick stocks with relatively good Sharpe
ratios.
With the scale and number of clusters used there was a better grouping in size than in growth. Due
to the amount and sparsity of data, to get a better clustering in the growth axis (specially in high values
of size) it would be necessary more cluster points.
This classification is more reliable than the one made with user input, since it was made and opti-
mized automatically, being less prone to human error and because of that, probably more robust if used
in real-time with the market (its behavior is going to change less than the user input classification).
74
2012 2013 2014 2015A 14,36% 28,17% 11,81% -7,34%B 17,81% 17,75% 10,47% 1,83%C 8,75% 30,87% 11,75% -8,81%D 19,05% 20,79% -0,04% -6,52%E 49,04% 30,54% 4,98% -13,18%B&H 15,51% 28,40% 11,65% -6,68%S&P500 13,41% 29,60% 11,39% -0,73%
-20,00%-10,00%
0,00%10,00%20,00%30,00%40,00%50,00%60,00%
Retu
rns
Years
Yearly Returns
Figure 4.7: Cluster’s portfolios yearly returns. The key is the following: A - Cluster A; B - Cluster B; C - Cluster C;D - Cluster D; E - Cluster E.
-80
-60
-40
-20
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800 900 1000
Grow
th [%
]
Size [1.000.000.000$]
Clusters 2012 Q4
A
B
C
D
E
Cluster A
Cluster B
Cluster C
Cluster D
Cluster E
Figure 4.8: Cluster’s representation in the whole plane in the fourth quarter of 2012. Cluster centers have the key”Cluster X” where X is the name of the cluster (A, B, C, D or E), and the points belonging to a certaincluster have only the letter of that cluster.
75
-60
-40
-20
0
20
40
60
80
100
0 100 200 300 400 500 600 700 800 900 1000
Grow
th [%
]
Size [1.000.000.000$]
Clusters 2013 Q4
A
B
C
D
E
Cluster A
Cluster B
Cluster C
Cluster D
Cluster E
Figure 4.9: Cluster’s representation in the plane in the fourth quarter of 2013. Cluster centers have the key ”ClusterX” where X is the name of the cluster (A, B, C, D or E), and the points belonging to a certain clusterhave only the letter of that cluster.
-80
-60
-40
-20
0
20
40
60
80
100
0 20 40 60 80 100 120 140 160 180
Grow
th [%
]
Size [1.000.000.000$]
Clusters 2012 Q4
A
B
C
Cluster A
Cluster B
Cluster C
Figure 4.10: Cluster’s representation in the plane in the fourth quarter of 2012, zoomed in the first three clusterrepresentation. Cluster centers have the key ”Cluster X” where X is the name of the cluster (A, B, C),and the points belonging to a certain cluster have only the letter of that cluster.
76
-60
-40
-20
0
20
40
60
80
100
0 20 40 60 80 100 120 140 160 180 200
Grow
th [%
]
Size [1.000.000.000$]
Clusters 2013 Q4
A
B
C
Cluster A
Cluster B
Cluster C
Figure 4.11: Cluster’s representation in the plane in the fourth quarter of 2013, zoomed in the first three clusterrepresentation. Cluster centers have the key ”Cluster X” where X is the name of the cluster (A, B, C),and the points belonging to a certain cluster have only the letter of that cluster.
77
Qua
rter
1Q
uart
er 2
Qua
rter
3Q
uart
er 4
Qua
rter
1Q
uart
er 2
Qua
rter
3Q
uart
er 4
Qua
rter
1Q
uart
er 2
Qua
rter
3Q
uart
er 4
Qua
rter
1Q
uart
er 2
Qua
rter
3Q
uart
er 4
Qua
rter
ly
Retu
rns
12,0
0%-3
,29%
5,76
%-1
,01%
10,0
3%2,
36%
4,69
%9,
92%
1,30
%4,
69%
0,61
%4,
39%
0,44
%-0
,23%
-6,9
4%6,
45%
Accu
mul
ated
Re
turn
s11
,99%
8,31
%14
,54%
13,3
9%24
,76%
27,7
0%33
,69%
46,9
6%48
,87%
55,8
5%56
,80%
63,6
8%64
,40%
64,0
3%52
,64%
62,4
9%
Year
ly R
etur
nsQ
uart
erly
Re
turn
s10
,14%
-4,4
5%6,
20%
3,35
%9,
17%
2,69
%6,
45%
7,60
%4,
52%
2,59
%-2
,70%
7,02
%1,
99%
-2,6
0%-7
,14%
1,16
%
Accu
mul
ated
Re
turn
s10
,14%
5,25
%11
,77%
15,5
1%26
,10%
29,4
9%37
,84%
48,3
2%55
,03%
59,0
4%54
,74%
65,6
0%68
,90%
64,5
1%52
,77%
54,5
4%
Year
ly R
etur
nsQ
uart
erly
Re
turn
s9,
13%
-4,7
4%6,
13%
3,65
%9,
19%
2,33
%7,
10%
7,10
%4,
43%
2,29
%-2
,50%
7,35
%1,
85%
-3,8
7%-5
,90%
0,58
%
Accu
mul
ated
Re
turn
s9,
13%
3,96
%10
,33%
14,3
6%24
,87%
27,7
9%36
,86%
46,5
8%53
,08%
56,5
8%52
,68%
63,9
0%66
,93%
60,4
7%51
,00%
51,8
7%
Year
ly R
etur
nsQ
uart
erly
Re
turn
s11
,64%
-3,8
8%4,
82%
4,73
%5,
87%
-1,0
2%4,
43%
7,60
%6,
18%
4,60
%-4
,08%
3,70
%5,
32%
-0,7
5%-2
,88%
0,30
%
Accu
mul
ated
Re
turn
s11
,64%
7,31
%12
,49%
17,8
1%24
,72%
23,4
5%28
,92%
38,7
2%47
,30%
54,0
8%47
,79%
53,2
5%61
,41%
60,1
9%55
,58%
56,0
6%
Year
ly R
etur
nsQ
uart
erly
Re
turn
s10
,06%
-5,1
0%3,
01%
1,08
%9,
64%
5,69
%3,
85%
8,74
%4,
67%
3,80
%-1
,76%
4,69
%-0
,78%
0,80
%-9
,50%
0,75
%
Accu
mul
ated
Re
turn
s10
,06%
4,45
%7,
59%
8,75
%19
,24%
26,0
3%30
,88%
42,3
2%48
,97%
54,6
3%51
,91%
59,0
3%57
,79%
59,0
5%43
,94%
45,0
2%
Year
ly R
etur
nsQ
uart
erly
Re
turn
s9,
46%
5,51
%5,
78%
-2,5
6%5,
55%
4,08
%1,
20%
8,65
%0,
55%
-2,9
2%-3
,27%
5,86
%1,
70%
1,12
%-1
0,33
%1,
37%
Accu
mul
ated
Re
turn
s9,
46%
15,4
9%22
,17%
19,0
5%25
,65%
30,7
7%32
,35%
43,8
0%44
,59%
40,3
7%35
,78%
43,7
4%46
,18%
47,8
2%32
,55%
34,3
7%
Year
ly R
etur
nsQ
uart
erly
Re
turn
s28
,17%
-5,5
9%12
,58%
9,41
%6,
81%
6,86
%4,
77%
9,16
%-1
,37%
4,55
%-1
,67%
3,53
%-3
,58%
2,40
%-1
1,82
%-0
,29%
Accu
mul
ated
Re
turn
s28
,17%
21,0
0%36
,22%
49,0
4%59
,20%
70,1
2%78
,24%
94,5
7%91
,90%
100,
64%
97,2
9%10
4,26
%96
,95%
101,
68%
77,8
5%77
,34%
Year
ly R
etur
ns
Clus
ter A
-13,
18%
Clus
ter
D
19,0
5%20
,79%
-0,0
4%-6
,52%
Clus
ter E
49,0
4%30
,54%
4,98
%
-8,8
1%
Clus
ter B
17,8
1%17
,75%
10,4
7%1,
83%
Clus
ter C
8,75
%30
,87%
11,7
5%
B&H
15,5
1%28
,40%
11,6
5%
S&P5
00
13,4
1%29
,60%
11,3
9%
2014
2015
14,3
6%
2012
2013
-0,7
3%
-6,6
8%
28,1
7%11
,81%
-7,3
4%
Figu
re4.
12:
Inve
stm
entb
yTy
peta
ble.
The
key
isth
efo
llow
ing:
A-C
lust
erA
;B-C
lust
erB
;C-C
lust
erC
;D-C
lust
erD
;E-C
lust
erE
.
4.2.3 Case Study III - Using GAs to optimize FI
This case study presents the results of using the FA GA described in section 3.3.6 to optimize weights
given to Fundamental Indicators. It is studied the results of applying GAs to give buy and sell signals on
each of the groups of case studies 1 and 2, and the whole dataset.
Since the GA uses the last year of data on training, these portfolios will have a training period of 2
years (2010 and 2011). The first year of training is used to define the type/cluster of stocks, and the
second year is used to train the GA.
The algorithm ran 10 times for each model, and in the models that used the clustering algorithm, the
clustering algorithm has run twice, so for each portfolio that uses clustering, 5 runs used the first run of
the clustering algorithm, and the other 5 used the other run of the clustering algorithm. Results were
compared between them. In table 4.2 is shown the best, worst and median run of each portfolio.
Table 4.2: Best, worst and median run of each portfolio. For the type portfolios: SG - Slow Grower, SW - Stalwart,FG - Fast Growers, POT - Potential, TA - Turnarounds. The keys: A, B, C, D and E refer to the clusters.
Best Run Worst Run Median RunGA 43,41% 36,05% 41,30%
SG-GA 62,78% 22,72% 46,03%SW-GA 55,33% 39,21% 48,67%FG-GA 72,11% 27,18% 41,27%POT-GA 83,76% 30,54% 43,71%TA-GA 37,39% 10,65% 21,43%A-GA 49,26% 27,96% 42,09%B-GA 63,64% 37,72% 51,51%C-GA 51,52% 32,33% 42,13%D-GA 43,50% 18,64% 36,93%E-GA 112,33% 81,55% 104,68%
Only one portfolio using user input classifications (type Slow Growers) and two portfolios using clus-
tering showed improvements (clusters D and E). On the rest of the portfolios, the ones optimizing weights
based on the clusters of the stocks got closer results to that of the previous case studies. The results
present how the GA improved a cluster type portfolio (cluster E) and a classification type portfolio (Slow
Growers), from the original portfolios of case study 1 and 2.
In figure 4.13 the results of the median runs from cluster E and the Slow Growers type is shown.
These were the only groups (apart from cluster D) that had an increase in performance when compared
to the results obtained in case study 1 and 2. One can notice that specially in the portfolio E-GA there
was a 27,34% increase in performance, most of which in the year 2015 (see figure 4.15). The GA limited
the losses of cluster E in 2015, and consequentially, increased the Sharpe ratio.
One can also see in figure 4.14 that since companies were only sold with the signal given by the
GA, the E-GA portfolio achieved 100% on rate of trade success and increased the number of positive
quarters to 87,50%.
79
2012 Q1 2012 Q2 2012 Q3 2012 Q4 2013 Q1 2013 Q2 2013 Q3 2013 Q4 2014 Q1 2014 Q2 2014 Q3 2014 Q4 2015 Q1 2015 Q2 2015 Q3 2015 Q4SG-GA 6,65% 4,50% 12,73% 16,63% 23,62% 24,44% 31,57% 39,78% 45,98% 46,81% 44,12% 54,93% 62,32% 56,63% 45,87% 46,03%SG 7,91% 3,43% 10,52% 14,04% 24,27% 26,46% 32,62% 37,55% 43,75% 45,16% 41,50% 51,82% 54,16% 51,47% 45,49% 44,09%E-GA 29,50% 30,50% 41,65% 49,71% 61,30% 72,99% 74,92% 89,28% 94,96% 105,01% 93,62% 105,34% 118,52% 125,49% 103,61% 104,68%E 28,17% 21,00% 36,22% 49,04% 59,20% 70,12% 78,24% 94,57% 91,90% 100,64% 97,29% 104,26% 96,95% 101,68% 77,85% 77,34%GA 7,67% 3,89% 7,64% 10,12% 19,05% 20,35% 27,93% 36,93% 42,23% 46,18% 42,40% 51,94% 53,70% 50,07% 36,15% 41,30%
-20%
0%
20%
40%
60%
80%
100%
120%
140%
Retu
rns
Quarters
Type's Portfolios
Figure 4.13: GA optimization portfolios. The key ”SG” represents the portfolio of type Slow Growers and the key”E” represents the portfolio of cluster E, as constructed in case studies 1 and 2. The suffix ”-GA”represents the portfolios using GA to optimize FI weights. The key ”GA” represents the GA algorithmrunning over the whole dataset.
Also in figure 4.14 it is shown that the SG-GA portfolio increased in only 1,94% the returns, comparing
with the SG portfolio from case study 1. It reduced substantially the number of trades done, when
compared with portfolio SG (78,03% - from 132 trades in portfolio SG, to 29 trades in portfolio SG-GA).
The rate of positive quarters increased (from 68,75% to 75%), however the drawdown also increased
(from 10,07% to 16,45%). This is visible in figure 4.13 that the SG-GA portfolio rose more than the SG
portfolio in the fourth quarter of 2015, but fell right afterwards until the point it almost crossed below the
SG portfolio, and in figure 4.15 we can see how small the difference between the two portfolios is in
terms of yearly returns.
4.2.3.A Conclusions
One can conclude from this case study that Fundamental Indicators optimization has better results
when applied to small sets of companies (as there were cluster E and D).
Also, the GA used in this work greatly reduced the number of trades done in all portfolios, reducing
80
Number of
Trades
Rate of Trade
Success
Biggest Trade Return
Biggest Trade Loss
Average Time on Market
(in Quarters)
Average Return
Per trade
Rate of Positive Quarters
Average Return
Per Quarter
4 Year Return
Sharpe
Ratio
Drawdown
GA 368 72,83% 276,67% -69,44% 5,49 24,02% 75,00% 2,29% 41,30% 0,67 17,55%
SG-GA 29 68,97% 264,07% -35,46% 3,79 29,16% 75,00% 2,49% 46,03% 0,84 16,45%
E-GA 8 100,00% 117,26% - 3,50 45,94% 87,50% 4,87% 104,68% 0,98 21,89%
Figure 4.14: GA portfolios metrics table. ”GA” is the whole dataset using the FA GA algorithm, and SG-GA andE-GA is respectively the Slow Growers type and the cluster E using the FA GA algorithm.
potential losses, but also potential gains. In the case of the E-GA portfolio, the major difference from the
cluster E portfolio presented in case study 2 was the reduction of losses in 2015.
One can also conclude that clustering classification is better than the user input one, since it groups
companies with more similarities than those of the user input classification, which allows the FA GA
algorithm to optimize better trading rules.
81
2012 2013 2014 2015SG-GA 16,63% 19,85% 10,84% -5,74%SG 14,04% 20,61% 10,38% -5,10%E-GA 49,71% 26,43% 8,49% -0,32%E 49,04% 30,54% 4,98% -13,18%GA 10,12% 24,35% 10,96% -7,00%
-20%
-10%
0%
10%
20%
30%
40%
Returns
Years
Yearly Returns
Figure 4.15: FA GA portfolios’ yearly returns. ”GA” is the whole dataset using the FA GA algorithm, and SG-GAand E-GA is respectively the Slow Growers type and the cluster E using the FA GA algorithm.
82
5Conclusion
Contents
5.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2 Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.3 System limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
5.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
83
84
This work describes the implementation of a system that classifies stocks in the plane Size×Growth
with two methods, using in both the same metrics of Fundamental Analysis, and that uses Fundamental
Indicators for trading simulation. Various Artificial Intelligence methods and financial strategies were
analyzed and several related works were exposed. It was described the implementation of the system,
showing the methods used, decisions made and the results of the evaluation of the system. This chapter
does a conclusive statement about the results obtained, the system’s achievements and limitations, and
the possible future work to be developed.
5.1 Conclusions
The results obtained in this work allow the conclusion that genetic algorithms are suitable for large
amounts of computations and for the analysis of large amounts of data, having successfully classified
stocks according to their Fundamental Analysis. The classification methods used were successful, hav-
ing both achieved returns above the S&P500 index in the expected groups. The genetic algorithm for
weight optimization of fundamental indicators improved two cluster based portfolios, and one classifi-
cation based portfolio. It also showed great success when applied to the cluster with bigger returns,
improving the returns by 27,34%. To have better results in this optimization it would be necessary a
bigger number of clusters so the algorithm could optimize Fundamental Indicators weights in smaller
sets of stocks. These results allowed to conclude that automatic classification using genetic algorithms
is a better way of classifying stocks than using human input, since it is not prone to human error and
does a more careful analysis of the data, grouping companies with more similar behaviors.
5.2 Achievements
The main achievements in this work were the following:
• The implementation from scratch of a architecture of a system containing data processing, genetic
algorithms and a trading simulator.
• A classification method using Fundamental Analysis and Genetic Algorithm to optimize clusters’
positions, with a fixed number of clusters.
• The comparison between an user input classification method and an automatic classification
method.
• A long term investment strategy based on the implemented classification methods.
• The exclusive use of Fundamental Analysis and Fundamental Indicators with Genetic Algorithms.
85
5.3 System limitations
• The main limitation of this work is the fact that only trades quarterly, ignoring the period in between,
which does not take into account the best time to buy or sell.
• Other limitation is the use of only 272 out of 500 stocks of the S&P500 index, due to data availability.
The 500 companies of the index should be included, and a dynamic tracking of the companies that
leave or enter the index should be added to this work.
5.4 Future Work
• Try different number of clusters, and optimize the number of clusters created to maximize the
Calinski-Harabasz index.
• Use Technical Analysis to give the buy and sell signals of a portfolio. If TA buy/sell signals is trained
within a specific group it should be more effective than when trained with the whole dataset.
• Use macroeconomic data and cycle study in order to know which group has more potential within
a certain economical context.
• Tackle the Markwoitz Portfolio composition problem, in a way that can create a portfolio that opti-
mizes the number of positions open in each group.
• Implement more sophisticated Adaptive Genetic Algorithms techniques to allow the system to
change more quickly to obtain better results, either in trading and classification.
86
Bibliography
[1] G. S. Atsalakis and K. P. Valavanis, “Surveying stock market forecasting techniques – part ii: Soft
computing methods,” Expert Systems with Applications, vol. 36, no. 3, Part 2, pp. 5932 – 5941,
2009. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0957417408004417
[2] S. B. Achelis, Technical Analysis from A to Z. McGraw Hill New York, 2001.
[3] B. Greenwald, J. Kahn, P. Sonkin, and M. van Biema, Value Investing: From Graham
to Buffett and Beyond, ser. Wiley finance. John Wiley & Sons, 2004. [Online]. Available:
https://books.google.pt/books?id=gvCzlskpZxoC
[4] M. Buffett and D. Clark, Warren Buffett and the Interpretation of Financial Statements: The Search
for the Company with a Durable Competitive Advantage. Scribner, 2008. [Online]. Available:
https://books.google.pt/books?id=7iqO6rGdrAYC
[5] A. Islam, H. Zaman, and R. Ahmed, “Automated fundamental analysis for stock ranking and growth
prediction,” in Computers and Information Technology, 2009. ICCIT’09. 12th International Confer-
ence on. IEEE, 2009, pp. 145–150.
[6] S.-S. Chen, “Predicting the bear stock market: Macroeconomic variables as leading indicators,”
Journal of Banking & Finance, vol. 33, no. 2, pp. 211–223, 2009.
[7] M. Blejer, “Central banks and price stability: Is a single objective enough?” Journal of Applied
Economics, vol. 1, no. 1, pp. 105–122, 1998.
[8] A. Silva, R. Neves, and N. Horta, “A hybrid approach to portfolio composition based on fundamental
and technical indicators,” Expert Systems with Applications, vol. 42, no. 4, pp. 2036 – 2048, 2015.
[Online]. Available: http://www.sciencedirect.com/science/article/pii/S0957417414006113
[9] D. Chandwani and M. S. Saluja, “Stock direction forecasting techniques: an empirical study com-
bining machine learning system with market indicators in the indian context,” International Journal
of Computer Applications, vol. 92, no. 11, 2014.
87
[10] W. Wu and J. Xu, “Fundamental analysis of stock price by artificial neural networks model based
on rough set theory,” World Journal of Modelling and Simulation, vol. 1, no. 2, pp. 36–44, 2006.
[11] A. Silva, R. Neves, and N. Horta, “Portfolio optimization using fundamental indicators based on
multi-objective ea,” in 2014 IEEE Conference on Computational Intelligence for Financial Engineer-
ing & Economics (CIFEr). IEEE, 2014, pp. 158–165.
[12] R. D. Edwards, J. Magee, and W. C. Bassetti, Technical analysis of stock trends. CRC Press,
2007.
[13] A. Gorgulho, R. Neves, and N. Horta, “Applying a {GA} kernel on optimizing technical analysis
rules for stock picking and portfolio composition,” Expert Systems with Applications, vol. 38, no. 11,
pp. 14 072 – 14 085, 2011. [Online]. Available: http://www.sciencedirect.com/science/article/pii/
S0957417411007433
[14] T. Chavarnakul and D. Enke, “A hybrid stock trading system for intelligent technical analysis-
based equivolume charting,” Neurocomputing, vol. 72, no. 16–18, pp. 3517 – 3528, 2009,
financial EngineeringComputational and Ambient Intelligence (IWANN 2007). [Online]. Available:
http://www.sciencedirect.com/science/article/pii/S0925231209001878
[15] Y. Hu, K. Liu, X. Zhang, L. Su, E. Ngai, and M. Liu, “Application of evolutionary computation
for rule discovery in stock algorithmic trading: A literature review,” Applied Soft Computing,
vol. 36, pp. 534 – 551, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/
S156849461500438X
[16] L. Wang, H. An, X. Liu, and X. Huang, “Selecting dynamic moving average trading rules in the
crude oil futures market using a genetic approach,” Applied Energy, vol. 162, pp. 1608 – 1618,
2016. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0306261915010685
[17] N. T. Vu, “Stock market volatility and international business cycle dynamics: Evidence from
{OECD} economies,” Journal of International Money and Finance, vol. 50, pp. 1 – 15, 2015.
[Online]. Available: http://www.sciencedirect.com/science/article/pii/S0261560614001338
[18] P. K. Narayan and K. S. Thuraisamy, “Common trends and common cycles in stock
markets,” Economic Modelling, vol. 35, pp. 472 – 476, 2013. [Online]. Available: http:
//www.sciencedirect.com/science/article/pii/S0264999313003179
[19] B. A. Blonigen, J. Piger, and N. Sly, “Comovement in {GDP} trends and cycles among trading
partners,” Journal of International Economics, vol. 94, no. 2, pp. 239 – 247, 2014. [Online].
Available: http://www.sciencedirect.com/science/article/pii/S0022199614000919
88
[20] S. J. Brown, W. N. Goetzmann, and A. Kumar, “The dow theory: William peter hamilton’s track
record reconsidered,” The Journal of finance, vol. 53, no. 4, pp. 1311–1333, 1998.
[21] Z. Tan, C. Quek, and P. Y. Cheng, “Stock trading with cycles: A financial application of {ANFIS}
and reinforcement learning,” Expert Systems with Applications, vol. 38, no. 5, pp. 4741 – 4755,
2011. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S095741741000905X
[22] X. Yu and M. Gen, Introduction to evolutionary algorithms. Springer Science & Business Media,
2010.
[23] N. K. Kasabov, Foundations of neural networks, fuzzy systems, and knowledge engineering. Mar-
cel Alencar, 1996.
[24] D. Simon, Evolutionary optimization algorithms. John Wiley & Sons, 2013.
[25] C. C. Aranha and H. Iba, “A tree-based ga representation for the portfolio optimization problem,” in
Proceedings of the 10th annual conference on Genetic and evolutionary computation. ACM, 2008,
pp. 873–880.
[26] S. Sivanandam and S. Deepa, Introduction to genetic algorithms. Springer Science & Business
Media, 2007.
[27] B. L. Miller and D. E. Goldberg, “Genetic algorithms, tournament selection, and the effects of noise,”
Complex systems, vol. 9, no. 3, pp. 193–212, 1995.
[28] S. Yang, “Evolutionary computation for dynamic optimization problems,” in Proceedings of the Com-
panion Publication of the 2015 on Genetic and Evolutionary Computation Conference. ACM, 2015,
pp. 629–649.
[29] L. Fausett, Fundamentals of neural networks: architectures, algorithms, and applications.
Prentice-Hall, Inc., 1994.
[30] T. Kimoto, K. Asakawa, M. Yoda, and M. Takeoka, “Stock market prediction system with modular
neural networks,” in Neural Networks, 1990., 1990 IJCNN International Joint Conference on, June
1990, pp. 1–6 vol.1.
[31] A. Senanayake, “Automated neural-ware system for stock market prediction,” in Cybernetics and
Intelligent Systems, 2004 IEEE Conference on, vol. 2. IEEE, 2004, pp. 1166–1171.
[32] S. Yao, M. Pasquier, and C. Quek, “A foreign exchange portfolio management mechanism based
on fuzzy neural networks.” in IEEE Congress on Evolutionary Computation. IEEE, 2007, pp.
2576–2583. [Online]. Available: http://dblp.uni-trier.de/db/conf/cec/cec2007.html#YaoPQ07
89
[33] M. Lam, “Neural network techniques for financial performance prediction: integrating fundamental
and technical analysis,” Decision support systems, vol. 37, no. 4, pp. 567–581, 2004.
[34] W. Leigh, R. Purvis, and J. M. Ragusa, “Forecasting the nyse composite index with technical anal-
ysis, pattern recognizer, neural network, and genetic algorithm: a case study in romantic decision
support,” Decision support systems, vol. 32, no. 4, pp. 361–377, 2002.
[35] L. A. Zadeh, “Fuzzy sets,” Information and control, vol. 8, no. 3, pp. 338–353, 1965.
[36] R. Fuller, “Neural fuzzy systems,” 1995.
[37] R. Kuo, L. Lee, and C. Lee, “Integration of artificial neural networks and fuzzy delphi for stock
market forecasting,” in Systems, Man, and Cybernetics, 1996., IEEE International Conference on,
vol. 2. IEEE, 1996, pp. 1073–1078.
[38] O. Maimon and L. Rokach, Data mining and knowledge discovery handbook. Springer, 2005,
vol. 2.
[39] J. Han, J. Pei, and M. Kamber, Data mining: concepts and techniques. Elsevier, 2011.
[40] X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, S. Y.
Philip et al., “Top 10 algorithms in data mining,” Knowledge and information systems, vol. 14, no. 1,
pp. 1–37, 2008.
[41] E. Hajizadeh, H. D. Ardakani, and J. Shahrabi, “Application of data mining techniques in stock
markets: A survey,” Journal of Economics and International Finance, vol. 2, no. 7, p. 109, 2010.
[42] I. H. Witten and E. Frank, Data Mining: Practical machine learning tools and techniques. Morgan
Kaufmann, 2005.
[43] P. Berkhin, “A survey of clustering data mining techniques,” in Grouping multidimensional data.
Springer, 2006, pp. 25–71.
[44] C. Raposo, C. H. Antunes, and J. P. Barreto, Automatic Clustering Using a Genetic Algorithm
with New Solution Encoding and Operators. Cham: Springer International Publishing, 2014, pp.
92–103. [Online]. Available: http://dx.doi.org/10.1007/978-3-319-09129-7 7
[45] J. C. Dunn, “A fuzzy relative of the isodata process and its use in detecting compact well-separated
clusters,” 1973.
[46] J. C. Bezdek, Pattern recognition with fuzzy objective function algorithms. Springer Science &
Business Media, 2013.
[47] H. Markowitz, “Portfolio selection,” The journal of finance, vol. 7, no. 1, pp. 77–91, 1952.
90
[48] T. Weise, “Global optimization algorithms-theory and application,” Self-Published,, pp. 25–26, 2009.
[49] G. Hassan, “Multiobjective robustness for portfolio optimization in volatile environments,” in In Proc.
GECCO ’08. ACM, 2008, pp. 1507–1514.
[50] G. Hassan and C. D. Clack, “Robustness of multiple objective gp stock-picking in unstable
financial markets: Real-world applications track,” in Proceedings of the 11th Annual Conference on
Genetic and Evolutionary Computation, ser. GECCO ’09. New York, NY, USA: ACM, 2009, pp.
1513–1520. [Online]. Available: http://doi.acm.org/10.1145/1569901.1570104
[51] Y. L. Becker, H. Fox, and P. Fei, “An empirical study of multi-objective algorithms for stock ranking,”
in Genetic Programming Theory and Practice V. Springer, 2008, pp. 239–259.
[52] P. Skolpadungket, K. Dahal, and N. Harnpornchai, “Portfolio optimization using multi-objective ge-
netic algorithms,” in Evolutionary Computation, 2007. CEC 2007. IEEE Congress on, Sept 2007,
pp. 516–523.
[53] D. Lohpetch and D. Corne, “Multiobjective algorithms for financial trading: Multiobjective out-trades
single-objective,” in 2011 IEEE Congress of Evolutionary Computation (CEC). IEEE, 2011, pp.
192–199.
[54] P. Lynch and J. Rothchild, One Up On Wall Street: How To Use What You Already Know To
Make Money In. Simon & Schuster, 2012. [Online]. Available: https://books.google.pt/books?id=
TYOdIrFJ2SkC
[55] R. Peachavanish, “Stock selection and trading based on cluster analysis of trend and momentum
indicators,” in Proceedings of the International MultiConference of Engineers and Computer Scien-
tists, vol. 1, 2016.
[56] L. A. Teixeira and A. L. I. De Oliveira, “A method for automatic stock trading combining technical
analysis and nearest neighbor classification,” Expert systems with applications, vol. 37, no. 10, pp.
6885–6890, 2010.
[57] H. Park, “Emerging market hedge funds in the united states,” Emerging Markets Review,
vol. 22, pp. 25 – 42, 2015. [Online]. Available: http://www.sciencedirect.com/science/article/pii/
S1566014114000788
[58] M. Ehrgott, K. Klamroth, and C. Schwehm, “An {MCDM} approach to portfolio optimization,”
European Journal of Operational Research, vol. 155, no. 3, pp. 752 – 770, 2004, traffic and
Transportation Systems Analysis. [Online]. Available: http://www.sciencedirect.com/science/article/
pii/S0377221702008810
91
[59] D. Lohpetch and D. Corne, “Discovering effective technical trading rules with genetic programming:
Towards robustly outperforming buy-and-hold,” in Nature & Biologically Inspired Computing, 2009.
NaBIC 2009. World Congress on. IEEE, 2009, pp. 439–444.
[60] F. Allen and R. Karjalainen, “Using genetic algorithms to find technical trading rules,” Journal of
financial Economics, vol. 51, no. 2, pp. 245–271, 1999.
[61] M. Radeerom, “Automatic trading system based on genetic algorithm and technical analysis for
stock index,” International Journal of Information Processing and Management, vol. 5, no. 4, p. 124,
2014.
[62] C.-H. Cheng, T.-L. Chen, and L.-Y. Wei, “A hybrid model based on rough sets theory and genetic
algorithms for stock price forecasting,” Information Sciences, vol. 180, no. 9, pp. 1610 – 1629,
2010. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S0020025510000289
[63] A. Fernandes, An Evolutionary Computing Approach to Financial Portfolio Management Based on
Growth Stocks & Sector/Industry Distribution. Instituto Superior Tecnico, May 2016.
92
AList of Stocks used
This work used a subset of 272 out of 500 companies of the S&P500 index. The stock tickers of these
companies are the following:
• MMM
• ABT
• ADBE
• AES
• AET
• AFL
• A
• APD
• AKAM
93
• AA
• ALL
• MO
• AMZN
• AEE
• AEP
• AIG
• AMP
• ABC
• AMGN
• APH
• ADI
• AON
• AAPL
• AMAT
• ADM
• AIZ
• T
• ADSK
• ADP
• AZO
• AVB
• AVY
• BHI
• BLL
94
• BCR
• BAX
• BDX
• BBBY
• BBY
• BIIB
• HRB
• BA
• BSX
• CHRW
• CPB
• COF
• CAH
• CBG
• CBS
• CELG
• CNP
• CTL
• CF
• SCHW
• CVX
• CTAS
• CSCO
• CLX
• COH
95
• CTSH
• CL
• CMA
• CAG
• COP
• ED
• STZ
• GLW
• COST
• CSX
• CMI
• CVS
• DHI
• DRI
• DE
• XRAY
• DOV
• DOW
• DTE
• DUK
• DNB
• ETFC
• EMN
• ETN
• EBAY
96
• ECL
• EA
• EMR
• EQT
• EQR
• EL
• ES
• EXC
• EXPE
• EXPD
• XOM
• FDX
• FITB
• FSLR
• FLIR
• FLS
• FLR
• FMC
• FTI
• F
• BEN
• FCX
• GPS
• GIS
• GPC
97
• GILD
• GS
• GT
• GWW
• HAL
• HAR
• HAS
• HCP
• HP
• HON
• HRL
• HPQ
• HUM
• HBAN
• ITW
• IR
• INTC
• IBM
• IPG
• IFF
• INTU
• ISRG
• IVZ
• JCI
• JNPR
98
• K
• KEY
• KMB
• KLAC
• LH
• LM
• LEG
• LEN
• LLY
• LLTC
• L
• MTB
• MRO
• MAR
• MAS
• MAT
• MKC
• MCD
• MCK
• MJN
• MDT
• MRK
• MET
• MCHP
• MSFT
99
• TAP
• MDLZ
• MON
• MYL
• NDAQ
• NOV
• NTAP
• NWL
• NEM
• NKE
• NI
• JWN
• NSC
• NOC
• NRG
• NVDA
• ORLY
• OXY
• OMC
• OKE
• ORCL
• OI
• PCAR
• PH
• PDCO
100
• PAYX
• PEP
• PKI
• PCG
• PM
• PBI
• PNC
• PPG
• PPL
• PX
• PCLN
• PFG
• PG
• PGR
• PEG
• PHM
• PWR
• QCOM
• DGX
• RRC
• RTN
• RHT
• RSG
• RAI
• RHI
101
• ROK
• COL
• ROP
• R
• CRM
• SLB
• SNI
• SRE
• SHW
• SPG
• SJM
• SNA
• SO
• SWN
• STJ
• SWK
• SPLS
• SBUX
• STT
• SRCL
• SYK
• STI
• SYY
• TROW
• TDC
102
• TSO
• TXN
• TXT
• HSY
• TRV
• TMO
• TIF
• TWX
• TJX
• TMK
• TSS
• TSN
• UNH
• UPS
• UTX
• UNM
• URBN
• VFC
• VAR
• VTR
• VRSN
• VZ
• VNO
• VMC
• WMT
103
• DIS
• WM
• ANTM
• WU
• WHR
• WFM
• WEC
• WYN
• WYNN
• XEL
• XRX
• XL
• YHOO
104
105