Predictive Analytics in the Land of the Vampire Squid · Predictive Analytics in the Land of the...
-
Upload
truongminh -
Category
Documents
-
view
220 -
download
0
Transcript of Predictive Analytics in the Land of the Vampire Squid · Predictive Analytics in the Land of the...
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Predictive Analytics
in the Land of the Vampire Squid
Dr. David Andre
CEO
Cerebellum Capital
dandre at cerebellumcapital.com
Predictive Analytics World, 2011 San Francisco
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Overview
• Perils and opportunities of Wall Street
• Seven ways to get it very wrong
• Our approach
• Takeaways
4
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Computers tend to win in the end when the contests are heavily
influenced by speed, memory, and probability calculations
Watson AI has come a long way:
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
So…Is The Finance Domain Worth the Effort?
• Asset Management is a $20 Trillion - $50 Trillion Domain,
depending on how you count.
• Machines are taking over trading at a fantastic rate.
• Many excellent algorithms exist for high-frequency trading,
many using predictive analytics in a relatively simple way.
• Yet, there a widespread belief that computers can’t
design meaningful new strategies – this belief is oddly
ubiquitous in the field of quantitative finance.
• Furthermore, Finance is nearly a perfect test-bed for AI.
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Finance seems deep & rich for AI
• Huge datasets in both time & companies
• Huge amounts of data about companies is flooding the web
• Easily measured metrics for success
• Huge rewards if you get it right
• Nearly every aspect of AI can be useful:
– Machine Learning (prediction & estimation)
– Planning (trade planning)
– Optimization (Portfolio optimization)
– Knowledge Engineering
– Text & Speech understanding
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Wait – aren’t the markets efficient? Maybe, unless:
• You have better info or can use the
info to make better predictions
• You can reach the right answer
faster or you can trade faster
• Markets or investors allow a bookie
to be a middle man, making both
sides happy and taking a cut
• You’re first to the inefficiency!
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Four ways being smart helps funds win
• Deducing (sometimes using data available only to a few)
who’s buying/selling what/when allows high-frequency
front-running.
• Predicting who will buy/sell based on recent price moves
is technical trading.
• Predicting relationships between assets that are semi-
constant yields statistical arbitrage and pairs trades.
• Estimating better a company’s true value yields longer-
term profits.
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Seven traps of using predictive analytics on Wall St.
1. Lemons & Butter
2. Know thyself (or how to overfit without really trying)
3. Broker is not a noun, it’s an adjective
4. The test set must be IID and drawn from the same
distribution…
5. What, my reality isn’t the worlds?!?
6. The big kahuna
7. Time moves forward one moment at a time
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Know thyself
• Most Predictive Analytics researchers know well to test
out-of-sample
• However, you have to test the whole system out-of-
sample (including the humans) to really get a fair test
• Canonical examples are to run the whole process more
than once (even if it includes cross-validation and out-of-
sample testing).
• Another is to throw away ideas that don’t work.
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Train/Learn
In Sample
Test Out of
Sample
Discard Most
Recently Added
Input Stream
Keep Most
Recently Added
Input Stream Does this Out of
Sample Result Look
Good?
Add A New
Input Stream
YES
NO
Do the Out of
Sample Results
Always Look Good?
NO
Continue
YES
Done
Great way to fool yourself
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Broker?
• Simple trading algorithms offered by most brokers are
easily manipulated by the HFT community.
• If you aren’t the shark, you are getting eaten.
• The rules are set up to favor the bankers.
• Brokers all have different symbology.
• Trades do get messed up, and usually not in your favor.
• The big brokers won’t take you if you’re not big, and it’s
hard to get big without the advantages they offer.
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
The same distribution…
• For the guarantees of theoretical ML to hold, the test set
must be drawn from the same distribution the training set.
• This is seldom less true than in finance.
• Sarbanes-Oxley?
• New scrutiny and regulation on shorting?
• Context matters, e.g. tightness of credit:
• Other funds/traders are figuring things out so alpha
changes over time in a very complex way…
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
What? I can’t trade for free?
• Most academic or online sources present alpha without
taking into account real costs
• Costs to borrow money and obtain leverage
• Minimum commission costs
• Slippage is real, especially in non S&P 100 names
• Costs to borrow (and some assets can’t be borrowed at
all) can be very high
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
What? I can’t trade for free?
Without realistic trading constraints
With realistic trading constraints
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
The Big Kahuna
• You may not be as clever as you think, and lots of other
hedge funds might be trading on very similar names.
• If they have to get out of their positions (say, due to a
liquidity crisis), your positions can get pounded.
• If you can manage to hang in there, it can be worth it – but
you have to be willing to accept 20 or 30% down!
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Time moves forward one moment at a time
• It is remarkably easy to time-travel.
• And, despite there being “so much data”, there is no way
to get it faster – it comes only one minute a minute, one
day a day.
• Time travel is letting any information from the future get
into the simulated past.
• This can include even things like processor speed, so it’s
very difficult to police.
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Some examples of time-travel
• Survivor bias – can’t throw out Enron or Lehman.
• Using a modern computer? They didn’t that in 1995.
• Using a cleaned dataset? When did it get cleaned?
• Stock universes? Are you selecting stocks based on
inclusion in the SP500 today and looking at their past?
• Throwing out models that used to look great
• Trading like you could know the price now, now.
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Our approach
• Focus on the long tail and automate discovery, not just
tuning
• Search over strategy space
• Hybrid of humans and computers
• Time safety
• Bogosity detectors.
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
A Decomposition of the Challenge
Feature Y of
Data Stream Z
Predictor N Strategy W
Feature K of
Data Stream T
Predictor K Strategy K
Feature 1 of
Data Stream 1
Predictor 1 Strategy 1
Raw
Data
Str
eam
s
…
…
…
…
…
…
Allocate to the
best strategies
so as to
maximize
returns,
minimize risk,
and keep
portfolio
balanced
within risk
constraints
Tra
de
Hundreds Hundreds Hundreds
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Long Tail
26
Cost-
effe
ctive f
or
most
Quant
funds
The other 99% of the market inefficiencies!
Getting the cost to launch 1 additional strategy/leg toward zero
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Long Tail
27
Cost-
effe
ctive f
or
most
Quant
funds
The other 99% of the market inefficiencies!
Getting the cost to launch 1 additional strategy/leg toward zero
• Learn/discover the strategies
automatically
• Humans find useful data sets
• Advanced program search &
evaluation finds features and
combines them
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Search in program space…
• We focus on looking for diverse strategies, not just simple
variants of a human-derived strategy.
• Essentially, we’re automating the process of science with
respect to financial data.
• Key to this is good input, as garbage-in, garbage-out.
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Process Is What Matters
X X
! X
World Class
Human Chess Players
Weak
Human Chess Players
World Class
Computer Chess
Players
Weak
Computer Chess
Players
(2005 “Freestyle
Tournament”
Playchess.com
Winner was
2 weak humans
+
3 weak laptops +
Innovative
Process)
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Roll Forward Cross Validation
Learn from past
examples
Time
Pick best predictors
and strategies to use
Out of Sample Test
Pick best predictors
and strategies to use
Learn from past
examples
Out of Sample Test Out of Sample Test
Pick best predictors
and strategies to use
Learn from past
examples
Pick best predictors
and strategies to use
Learn from past
examples
Pick best predictors
and strategies to use
Learn from past
examples
Out of Sample Test
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Time safety in the programming language
• Our strategy language and software infrastructure was
built from time-safety first.
• All strategies are coded in the same framework, with a
“self-aware” representation so that the system can reason
about the strategies.
• The code can’t “see ahead” in the data structures, so time
safety is guaranteed.
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Detecting Bogosity
Candidate
strategy
SPY
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Detecting Bogosity
Candidate
strategy
12% of random
strategies with the
same structure
beat it in returns,
29% in Sharpe!
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Some Strategies that ran in February
34
Confidential and Proprietary - © Cerebellum Capital, 2009-2011
Takeaways
1. Be rigorous to avoiding time-travel and fool yourself
2. Learn/discover the right features
3. Use randomization to find the right complexity of model
4. The hybrid person/machine solution works best
5. Look where everyone else isn’t looking