Trading decision trees ( Elaborated by Mohamed DHAOUI )
-
Upload
mohamed-dhaoui -
Category
Data & Analytics
-
view
126 -
download
0
Transcript of Trading decision trees ( Elaborated by Mohamed DHAOUI )
Tunisia Polytechnic School
Mini-Project Data Analysis
How to Use a Decision Tree in trading
ELABORATED BY:
•MOHAMED DHAOUI
SUPERVISED BY:
•MR WAJDI TEKAYA
Plan
Method used : Decision Tree
The Database construction
R code & interpretations
2
Method used: Decision Tree
A Visual Representation of Choices, Consequences, Probabilities, and Opportunities
A Way of Breaking Down Complicated Situations Down to Easier-to-Understand Scenarios
A simple representation for classifying examples
To create a model that predicts the value of a target
based on several input variables
3
Example: How it works
Predict if John will play tennis
9 Yes / 5 No
New Data D15 Rain High Weak ?
Day Outlook Humidity Wind Play
D1 Sunny High Weak No
D2 Sunny High Strong No
D3 Overcast High Weak Yes
D4 Rain High Weak Yes
D5 Rain Normal Weak Yes
D6 Rain Normal Strong No
D7 Overcast Normal Strong Yes
D8 Sunny High Weak No
D9 Sunny Normal Weak Yes
D10 Rain Normal Weak Yes
D11 Sunny Normal Strong Yes
D12 Overcast High Strong Yes
D13 Overcast Normal Weak Yes
D14 Rain High Strong No
4
Example: How it works
Outlook
9 Yes / 5 No
Sunny
Overcast
Rain
2 Yes / 3 No Split further 4 Yes / 0 No
Pure subset
3 Yes / 2 No Split further
Day Outlook Humid WindD1 Sunny High WeakD2 Sunny High StrongD8 Sunny High WeakD9 Sunny Normal WeakD11 Sunny Normal Strong
Day Outlook Humid WindD3 Overcast High WeakD7 Overcast Normal StrongD12 Overcast High StrongD13 Overcast Normal Weak
Day Outlook Humid WindD4 Rain High WeakD5 Rain Normal WeakD6 Rain Normal StrongD10 Rain Normal WeakD14 Rain High Strong
5
Example: How it worksOutlook
Humidity
Sunny
OvercastWind
Rain
High Normal Weak Strong
0 Yes / 3 No Pure subset
2 Yes / 0 No Pure subset
3 Yes / 0 No Pure subset
0 Yes / 2 No Pure subset
NO NOYESYESNew Data D15 Rain High Weak YES
Day Humid WindD1 High WeakD2 High StrongD8 High Weak
Day Humid WindD9 Normal WeakD11 Normal Strong
Day Humid WindD4 High WeakD5 Normal WeakD10 Normal Weak
Day Humid WindD6 Normal StrongD14 High Strong
6
ID3 algorithmSplit (nod, {examples})
1/ A: the best attribute for splitting the {examples}
2/ Decision attribute for this node
3/ for each value of A, create a child node
4/ for each child node/subset
if subset is pure stop
else : split (child_node, {subset})
7
How we select the best attribute?
Outlook
Sunny RainOvercast
Wind
Weak Strong2 Yes / 3 No4 Yes / 0 No
3 Yes / 2 No
6 Yes / 2 No 3 Yes / 3 No
9 Yes / 5 No 9 Yes / 5 No
Which one is better?
8
Entropy• S is a sample of training examples
• p+ is the proportion of positive examples in S
• p- is the proportion of negative examples in S
• Entropy measures the impurity of S
• Entropy(S) = H(S) = - p+ log2( p+ )- p- log2( p- )
• H(S) = 0 if sample is pure (all + or all -), H(S) = 1 if p+ = p- = 0,5
• Impure set (3 Yes / 3 No)
H(S) = - (3/6) * log2(3/6) – (3/6) * log2(3/6) = 1
• Pure set (4 Yes / 0 No)
H(S) = -(4/4) * log2(4/4) – (0/4) * log2(0/4) = 0
9
Information gain
• Gain (S,A) = H(S) - 𝑉∈𝑉𝑎𝑙𝑢𝑒𝑠(𝐴)𝑆𝑣
𝑆𝐻(𝑆𝑣)
Wind
Weak Strong
6 Yes / 2 No 3 Yes / 3 No
9 Yes / 5 NoH(S) = - (9/14) * log2(9/14) – (5/14) * log2(5/14) = 0,94
H(Sweak) = - (6/8) * log2(6/8) – (2/8) * log2(2/8) = 0,81
H(Sstrong) = - (3/6) * log2(3/6) –(3/6) * log2(3/6) = 1
Gain(S,wind) = H(S) – (8/14) * H(Sweak) – (6/14) * H(Sstrong)= 0,049
Gain(S,A)
Outlook 0,25
Humidity 0,15
Wind 0,049
10
Advantages and disadvantages
• Are simple to understand and interpret
• Allow the addition of new possible scenarios
• Help determine worst, best and expected values for different scenarios
o For data including categorical variables with different number of levels, information gain in decision trees are biased in favor of those attributes with more levels.
o A greedy algorithm: making the locally optimal choice at each stage but in general it does not produce an optimal solution
o Calculations can get very complex particularly if many values are uncertain and/or if many outcomes are linked.
11
Plan
Method used : Decision Tree
The Database construction
R code & interpretations
12
The Database used is a panel of daily OHLCV of Bank of America's stock Retrievedfrom Yahoo FINANCE.
The Database construction
13
O Opening price: The price at which a security first trades upon the opening of an exchangeon a given trading day.
H Today's high is the highest price at which a stock traded during the course of the day.Today's high is typically higher than the closing or opening price.
L Today's low is the lowest price at which a stock trades over the course of a trading day.Today's low is typically lower than the opening or closing price.
C Closing price : The final price at which a security is traded on a given trading day.
V The number of shares or contracts traded in a security or an entire market on a giventrading day. It is the amount of shares that trade hands from sellers to buyers as a measure ofactivity.
The Database construction
14
The Database constructionThe analysis of the Stock exchange data request the calculation of some specific ratios:
Relative Strength Index - RSI
Exponential Moving Average - EMA
Moving Average Convergence Divergence – MACD
Smart money index - SMI
15
Relative Strength Index - RSI
A technical momentum indicator that compares the magnitude of recent gains to recentlosses in an attempt to determine overbought and oversold conditions of an asset. It iscalculated using the following formula:
RSI = 100 - 100/(1 + RS*)
*Where RS = Average of x days' up closes / Average of x days' down closes. (x = 3)
16
Relative Strength Index - RSI
An asset is deemed to be overbought once the RSIapproaches the 70 level, meaning that it may begetting overvalued and is a good candidate for apullback. Likewise, if the RSI approaches 30, it is anindication that the asset may be getting oversold andtherefore likely to become undervalued.
17
Exponential Moving Average - EMAA type of moving average thatis similar to a simple movingaverage, except that moreweight is given to the latestdata. The exponential movingaverage is also known as"exponentially weightedmoving average". The 12- and26-day EMAs are the mostpopular short-term averages.
18
Moving Average Convergence Divergence – MACD
A trend-following momentumindicator that shows the relationshipbetween two moving averages ofprices. The MACD is calculated bysubtracting the 26-day exponentialmoving average (EMA) from the 12-day EMA. A nine-day EMA of theMACD, called the "signal line", is thenplotted on top of the MACD,functioning as a trigger for buy andsell signals.
19
Smart money index - SMI
or smart money flow index: is a technical analysis indicator demonstrating investors'sentiment. The indicator is based on intra-day price patterns.The main idea is that the majority of traders (emotional, news-driven) overreact at thebeginning of the trading day because of the overnight news and economic data. There is alsoa lot of buying on market orders and short covering at the opening.
The basic formula for SMI is:
Today's SMI reading = yesterday's SMI – opening gain or loss + last hour change
20
If, the SMI rises sharply when themarket falls, this fact would meanthat smart money is buying, andthe market is to revert to anuptrend soon.
Smart money index - SMI
The opposite situation is alsotrue. A rapidly falling SMI duringa bullish market means thatsmart money is selling and thatmarket is to revert to adowntrend soon.
21
Plan
Method used : Decision Tree
The Database construction
R code & interpretations
22
R code & interpretationsLibraries:
quantmod Package that helps get the data from Yahoo Finance.
rpart Package containing algorithms related to decision trees.
rpart.plot Package that helps best visualize the decision tree.
23
R code & interpretationsGetting the data:
startDate = as.Date("2012-01-01")
endDate = as.Date("2014-01-01")
getSymbols("BAC",src="yahoo",from = startDate,to=endDate)
Get the Open High Close Low Volume data from startDate to endDate
24
R code & interpretationsCalculating the indicators
RSI3 <- RSI(Op(BAC),n = 3) #Relative Strength Indicator
EMA5 <- EMA(Op(BAC),n = 5) #Exponential Moving Average
EMAcross <- Op(BAC) - EMA5 #Difference between the open price and the 5-EMA
MACDsignal <- MACD(Op(BAC),fast = 12, slow = 26, signal = 9)[,2]
SMI <- SMI(Op(BAC),n=13,slow=25,fast=2,signal=9)[,1]
PriceChange <- Cl(BAC) - Op(BAC)
25
R code & interpretationsConstructing the database
Class<-ifelse(PriceChange>0,"UP","DOWN") #Create a binary classification variable
DataSet<-data.frame(RSI3,EMAcross,MACDsignal,SMI,Class) #Create our data set
colnames(DataSet)<-c("RSI3","EMAcross","MACDsignal","Stochastic","Class") #Name the columns
DataSet<-DataSet[-c(1:33),] #Keep the good data (not NA)
TrainingSet<-DataSet[1:312,] #Use 2/3 of the data to build the tree
TestSet<-DataSet[313:469,] #Use 1/3 of the data as testing data
26
R code & interpretationsThe decision tree
DecisionTree<-rpart(Class~RSI3+EMAcross+MACDsignal+Stochastic,data=TrainingSet, cp=.001)
Predict the Class attribute
Use indicators: RSI3, EMAcross, MACDsignal, Stochastic
Specify the data used to build the tree: TrainingSet
Specify the minimum information gain to justify the split
prp(DecisionTree,type=2,extra=8)
Plot the decision tree
27
R code & interpretationsThe first decision tree
15 splits
4 indicators
28
R code & interpretationsPruning the tree
printcp(DecisionTree) #shows the minimal cp for each trees of each size.
The minimum xerror value the best cp to use
cp=0.0272109
0272109
29
R code & interpretationsThe pruned decision tree
PrunedDecisionTree<-prune(DecisionTree,cp=0.0272109)
Set the parameter cp to the value that gives the minimum cross-validated error
prp(PrunedDecisionTree, type=2, extra=8)
Plot the decision tree
30
R code & interpretationsValidating the tree
table(predict(PrunedDecisionTree,TestSet,type="class"),TestSet[,5],dnn=list('predicted','actual'))
81 correct predictions over 157 52% accuracy
31
Thank you for your attention
32