Learning to Trade via Direct Reinforcement
description
Transcript of Learning to Trade via Direct Reinforcement
Learning to Trade viaLearning to Trade via Direct ReinforcementDirect Reinforcement
John MoodyInternational Computer Science Institute,
Berkeley&
J E Moody & Company LLC, Portland
[email protected]@JEMoody.Com
Global Derivatives Trading & Risk ManagementParis, May 2008
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tWhat is Reinforcement What is Reinforcement Learning?Learning?
RL Considers:• A Goal-Directed “Learning” Agent • interacting with an Uncertain Environment• that attempts to maximize Reward / Utility
RL is an Active Paradigm:• Agent “Learns” by “Trial & Error” Discovery• Actions result in Reinforcement
RL Paradigms:• Value Function Learning (Dynamic
Programming)• Direct Reinforcement (Adaptive Control)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
I. Why Direct Reinforcement?I. Why Direct Reinforcement?
Direct Reinforcement Learning:
Finds predictive structure in financial data
Integrates Forecasting w/ Decision Making
Balances Risk vs. RewardIncorporates Transaction Costs
Discover Trading Strategies!
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tOptimizing Trades based on Optimizing Trades based on ForecastsForecasts
Indirect Approach:• Two sets of parameters• Forecast error is not Utility • Forecaster ignores transaction costs• Information bottleneck
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Learning to Trade via Direct ReinforcementLearning to Trade via Direct Reinforcement
Trader Properties:• One set of parameters• A single utility function • U includes transaction costs• Direct mapping from inputs to actions
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Direct RL TraderDirect RL Trader (USD/GBP):(USD/GBP): ReturnReturnAA=15%,=15%, SR SRAA=2.3,=2.3, DDR DDRAA=3.3=3.3
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
II. II. Direct Reinforcement:Direct Reinforcement: Algorithms & Algorithms &
IllustrationsIllustrations
Algorithms:Recurrent Reinforcement Learning (RRL)Stochastic Direct Reinforcement (SDR)
Illustrations:Sensitivity to Transaction CostsRisk-Averse Reinforcement
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tLearning to Trade via Direct Learning to Trade via Direct ReinforcementReinforcement
DR Trader:
• Recurrent policy (Trading signals, Portfolio weights)
• Takes action, Receives reward (Trading Return w/ Transaction Costs)
• Causal performance function(Generally path-dependent)
• Learn policy by varying GOAL: Maximize performance
or marginal performance
1( ; , )t t t tF F F I
1, ; ,t t t tR F F S
1 1( , ,..., )t tU R R R
1t t t tD U U U
1( ; , )t t tF F I t
TU
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tRecurrent Reinforcement Learning (RRL)Recurrent Reinforcement Learning (RRL)(Moody & Wu 1997)(Moody & Wu 1997)
Deterministic gradient (batch):
with recursion:
Stochastic gradient (on-line):
stochastic recursion:
Stochastic parameter update (on-line):
Constant : adaptive learning. Declining : stochastic approx.
1
1 1
TT t t t tT
t t t t
dU dR dF dR dFdU
d dR dF d dF d
1
1 1
t t t t t t
t t t t t t
dU dU dR dF dR dF
d dR dF d dF d
t tt
t
dU
d
1
1
t t t t
t
dF F dF dF
d dF d
1
1 1
t t t t
t t t t
dF F dF dF
d dF d
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Structure of TradersStructure of Traders
• Single Asset- Price series
- Return series
• Traders - Discrete position size - Recurrent policy
• Observations:
– Full system State is not known
• Simple Trading Returns and Profit:
• Transaction Costs: represented by .
tz1t t tr z z
1,0,1tF
1( ; , )t t t tF F F I
1 2 1 2, , ,...; , , ,...t t t t t t tI z z z y y y
1 1
1
t t t t t
T
t tt
R F r F F
P R
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Risk-Averse Reinforcement:Risk-Averse Reinforcement:Financial Performance MeasuresFinancial Performance Measures
Performance Functions:• Path independent: (Standard Utility Functions)• Path dependent:
Performance Ratios:• Sharpe Ratio:
• Downside Deviation Ratio:
For Learning:• Per-Period Returns: • Marginal Performance:
e.g. Differential Sharpe Ratio .
( )t tU U W
1 0( , ,..., )t t tU U R R W
Average( )
Standard Deviation( )t
t
R
R
Average( )
Downside Deviation( )t
t
R
R
tR
1t t t tD U U U
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Long / Short Trader SimulationLong / Short Trader SimulationSensitivity to Transaction CostsSensitivity to Transaction Costs
• Learns from scratch and on-line
• Moving average Sharpe Ratio with = 0.01
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Trader SimulationTrader SimulationTransaction Costs vs. Performance
100 Runs; Costs = 0.2%, 0.5%, and 1.0%
SharpeRatio
TradingFrequency
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Minimizing Downside Risk:Minimizing Downside Risk:Artificial Price Series w/ Artificial Price Series w/ Heavy TailsHeavy Tails
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Comparison of Risk-Averse Comparison of Risk-Averse TradersTraders Underwater Curves Underwater Curves
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tComparison of Risk-Averse Traders: Comparison of Risk-Averse Traders: Draw-DownsDraw-Downs
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
III. III. Direct Reinforcement vs.Direct Reinforcement vs. Dynamic Dynamic ProgrammingProgramming
Algorithms:Value Function Method (Q-Learning)Direct Reinforcement Learning (RRL)
Illustration:Asset Allocation: S&P 500 & T-BillsRRL vs. Q-Learning
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
RL Paradigms ComparedRL Paradigms Compared
Value Function Learning
• Origins: Dynamic Programming
• Learn “optimal” Q-Function
Q: state action value
• Solve Bellman’s Equation
Action:
“Indirect”
Direct Reinforcement
• Origins: Adaptive Control• Learn “good” Policy P
P: observations p(action)
• Optimize “Policy Gradient”
Action:
“Direct”
ˆ( , )P obsa b
ˆargmax ( , , )Q x ba
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
S&P-500 / T-Bill Asset Allocation:S&P-500 / T-Bill Asset Allocation:Maximizing the Differential Sharpe RatioMaximizing the Differential Sharpe Ratio
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
S&P-500: Opening Up the Black BoxS&P-500: Opening Up the Black Box85 series: Learned relationships are nonstationary over
time
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Closing RemarksClosing Remarks• Direct Reinforcement Learning:
– Discovers Trading Opportunities in Markets– Integrates Forecasting w/ Trading– Maximizes Risk-Adjusted Returns– Optimizes Trading w/ Transaction Costs
• Direct Reinforcement Offers Advantages Over:– Trading based on Forecasts (Supervised Learning)– Dynamic Programming RL (Value Function Methods)
• Illustrations:– Controlled Simulations– FX Currency Trader– Asset Allocation: S&P 500 vs. Cash
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Selected ReferencesSelected References::
[1] John Moody and Lizhong Wu. Optimization of trading systems and portfolios. Decision Technologies for Financial Engineering, 1997.
[2] John Moody, Lizhong Wu, Yuansong Liao, and Matthew Saffell. Performance functions and reinforcement learning for trading systems and portfolios. Journal of Forecasting, 17:441-470, 1998.
[3] Jonathan Baxter and Peter L. Bartlett. Direct gradient-based reinforcement learning: Gradient estimation algorithms. 2001.
[4] John Moody and Matthew Saffell. Learning to trade via direct reinforcement. IEEE Transactions on Neural Networks, 12(4):875-889, July 2001.
[5] Carl Gold. FX Trading via Recurrent Reinforcement Learning. Proceedings of IEEE CIFEr Conference, Hong Kong, 2003.
[6] John Moody, Y. Liu, M. Saffell and K.J. Youn. Stochastic Direct Reinforcement: Application to Simple Games with Recurrence. In Artificial Multiagent Learning, Sean Luke et al. eds, AAAI Press, 2004.
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Supplemental SlidesSupplemental Slides
• Differential Sharpe Ratio
• Portfolio Optimization
• Stochastic Direct Reinforcement (SDR)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Maximizing the Sharpe RatioMaximizing the Sharpe Ratio
Sharpe Ratio:
Exponential Moving Average Sharpe Ratio:
with time scale and
Motivation:• EMA Sharpe ratio emphasizes recent patterns;• can be updated incrementally.
Average( )
Standard Deviation( )t
Tt
RS
R
2 1 2( )
( )t
t t
AS t
K B A
1 1( )t t t tA A R A 2
1 1( )t t t tB B R B 1 2
1 2
1K
1
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Differential Sharpe RatioDifferential Sharpe Ratio for Adaptive Optimizationfor Adaptive Optimization
Expand to first order in :
Define Differential Sharpe Ratio as:
where
1 1
2 3 21 1
1( ) 2( )
( )
t t t t
t t
B A A BdS tD t
d B A
20
( )( ) ( 1) | ( ).
dS tS t S t O
d
1t t tA R A 2
1t t tB R B
( )S t
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Learning with the Differential SRLearning with the Differential SR
Evaluate “Marginal Utility” Gradient:
Motivation for DSR:• isolates contribution of to (“marginal utility” );• provides interpretability;• adapts to changing market conditions;• facilitates efficient on-line learning (stochastic
optimization).
1 12 3 2
1 1
( )
( )t t t
t t t
dD t B A R
dR B A
tR tU
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Trader SimulationTransaction costs vs. Performance
100 runs; Costs = 0.2%, 0.5%, and 1.0%
TradingFrequency
CumulativeProfit
SharpeRatio
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Portfolio Optimization (3 Securities)Portfolio Optimization (3 Securities)
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tStochastic Direct Reinforcement: Stochastic Direct Reinforcement:
Probabilistic PoliciesProbabilistic Policies
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Learning to TradeLearning to Trade
• Single Asset- Price series
- Return series
• Trader - Discrete position size - Recurrent policy
• Observations:
– Full system State is not known
• Simple Trading Returns and Profit:
• Transaction cost rate .
tz1t t tr z z
1,0,1ta
1( ; , )t t t tP Ia a 1 2 1 2, , ,...; , , ,...t t t t t t tI r r r i i i
1 1
1
t t t t t
T
t tt
R r
P R
a a a
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Consider a learning agent with stochastic policy function
whose inputs include recent observations o and actions a :
Why should past actions (recurrence) be included?
Examples:Games (observations o are opponent’s actions)
Trading financial markets
In General:
Why does Reinforcement need Why does Reinforcement need Recurrence? Recurrence?
1 2 1 2( ; ; )t t t tt o o a aP a
Model opponent’s responses o to previous actions a
Minimize transaction costs, market impact
Recurrence enables discovery of better policiesthat capture an agent’s impact on the world !!
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Stochastic Direct Reinforcement (SDR):Stochastic Direct Reinforcement (SDR):Maximize PerformanceMaximize Performance
Expected total performance of a sequence of T actions
Maximize performance via direct gradient ascent
Must evaluate total policy gradient
for a policy represented by
1 11
( ) ( | ) ( )T t
T
T t t t t tH t a
U u a p a H p H
t
t
dU
d
1
1 1( ) ( | ) ( )t
t t t tH
d dp a p a H p H
d d
( , ) ( , ) ( ) ( )1 1 1 1with( ) ( )n m n m n m
t t t t tP a H H O A
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Stochastic Direct Reinforcement (SDR):Stochastic Direct Reinforcement (SDR):Maximize PerformanceMaximize Performance
The goal of SDR is to maximize expected total performance
of a sequence of T actions
via direct gradient ascent
Must evaluate
for a policy represented by
1 11
( ) ( | ) ( )T t
T
T t t t t tH t a
U u a p a H p H
1 1( ) ( | ) ( )T t
Tt t t t t
H t a
dU du a p a H p H
d d
Notation: The complete history is denoted . is a partial history of length (n,m) .
( , ) ( ) ( )( )n m n mt t tH O A
( )t t tH O A
1
1 1( ) ( | ) ( )t
t t t tH
d dp a p a H p H
d d
( ) ( )
1 1( , )n mt t tP a O A
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Stochastic Direct Reinforcement:Stochastic Direct Reinforcement:First Order Recurrent Policy GradientFirst Order Recurrent Policy Gradient
For first order recurrence (m=1), conditional action probability is given by the policy:
The probabilities of current actions depend upon the probabilities of prior actions:
The total (recurrent) policy gradient is computed as :
with partial (naïve) policy gradient :
1
1 1( ) ( ) ( )t
t t t ta
p a p a a p a
1
1 11 1
( ) ( ) ( )( ) ( )
t
t t t tt t t
a
dp a p a a dp ap a p a a
d d
( )1 1 1( ) ( )n
t t t t tp a a P a O a
( )1 1 1( ;...) ( )n
t t t t tp a a P a O a
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
SDR Trader SimulationSDR Trader Simulation w/ Transaction Costsw/ Transaction Costs
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
tTrading Frequency vs. Transaction Trading Frequency vs. Transaction CostsCosts
Recurrent SDR Non-Recurrent
Global Derivatives Trading & Risk Management – May 2008Global Derivatives Trading & Risk Management – May 2008Learn
ing t
o T
rad
e v
iaLe
arn
ing t
o T
rad
e v
ia
Dir
ect
Rein
forc
em
en
tD
irect
Rein
forc
em
en
t
Sharpe Ratio vs. Transaction CostsSharpe Ratio vs. Transaction Costs
Recurrent SDR Non-Recurrent