Python Machine Learning Step-by-Step: Modeling Financial Time Series Data · 2017-10-08 · Python...
Transcript of Python Machine Learning Step-by-Step: Modeling Financial Time Series Data · 2017-10-08 · Python...
Python Machine Learning Step-by-Step:Modeling Financial Time Series Data
Reece Heineke
Director of Big DataCredibly
February 27, 2017
What is Machine Learning?
Data PreparationOverviewPython ToolboxTrade Ideas to DataConclusion
Exploratory Data AnalysisOverviewScatter PlotPrincipal Component Analysis (PCA)Conclusion
Fitting ModelsOverviewModels and PipelinesLearning CurvesInterpretabilityConclusion
A Fitted Model
What is Machine Learning?
1. Machine learning is a subfield of computer science thatprovides computers with the ability to learn without beingexplicitly programmed.
2. There are two sides to every machine learning problem:
2.1 The learning2.2 Model produced from the learning
What is Machine Learning?
1. Machine learning is a subfield of computer science thatprovides computers with the ability to learn without beingexplicitly programmed.
2. There are two sides to every machine learning problem:
2.1 The learning2.2 Model produced from the learning
What is Machine Learning?
1. Machine learning is a subfield of computer science thatprovides computers with the ability to learn without beingexplicitly programmed.
2. There are two sides to every machine learning problem:
2.1 The learning2.2 Model produced from the learning
What is Machine Learning?
1. Machine learning is a subfield of computer science thatprovides computers with the ability to learn without beingexplicitly programmed.
2. There are two sides to every machine learning problem:
2.1 The learning
2.2 Model produced from the learning
What is Machine Learning?
1. Machine learning is a subfield of computer science thatprovides computers with the ability to learn without beingexplicitly programmed.
2. There are two sides to every machine learning problem:
2.1 The learning2.2 Model produced from the learning
Data Preparation: Overview
I Review the Python software stack
I Motivate the problem
I Discuss some issues specific to time series modeling
Data Preparation: Overview
I Review the Python software stack
I Motivate the problem
I Discuss some issues specific to time series modeling
Data Preparation: Overview
I Review the Python software stack
I Motivate the problem
I Discuss some issues specific to time series modeling
Python Toolbox
1
1 Scientific Python by Eueung Mulyana
Trump2Cash
2
2 Trump2Cash GitHub Project
Input: Trump criticizes Toyota on Twitter
Output: Toyota stock opens lower
3
3 Toyota Stock on Yahoo Finance’s Interactive Chart
WSJ Analysis of Trump Tweets
4
4 by Akane Otani and Shane Shifflett
IPython: A Data Scientist’s Best Friend
Jupyter Notebook
Data Preparation: Conclusion
We now have a illustrative data set to work with
I Data set has 10 numeric dimensions: 9 inputs, 1 output
I Data set is large (˜400MB compressed)
Data Preparation: Conclusion
We now have a illustrative data set to work with
I Data set has 10 numeric dimensions: 9 inputs, 1 output
I Data set is large (˜400MB compressed)
Exploratory Data Analysis: Overview
I Covariance and Correlation Matrices
I Scatter plots
I Principal Component Analysis (PCA)
I Kernel PCA
Exploratory Data Analysis: Overview
I Covariance and Correlation Matrices
I Scatter plots
I Principal Component Analysis (PCA)
I Kernel PCA
Exploratory Data Analysis: Overview
I Covariance and Correlation Matrices
I Scatter plots
I Principal Component Analysis (PCA)
I Kernel PCA
Exploratory Data Analysis: Overview
I Covariance and Correlation Matrices
I Scatter plots
I Principal Component Analysis (PCA)
I Kernel PCA
Scatter Plot: What can we say about the data?
scikit-learn Algorithm Cheat-Sheet: Just looking
5
5 scikit-learn Cheat-Sheet
Principal Component Analysis (PCA)
Kernel PCA with Radial Basis Function (RBF)
Exploratory Data Analysis: Conclusion
I Nonlinear relationship with (0, 9), (2, 9), (6, 9)
I All other dimensions are quite random
Exploratory Data Analysis: Conclusion
I Nonlinear relationship with (0, 9), (2, 9), (6, 9)
I All other dimensions are quite random
Fitting Models: Overview
I Scikit learn’s model and pipelines
I Illustrative learning curves
Fitting Models: Overview
I Scikit learn’s model and pipelines
I Illustrative learning curves
scikit-learn Revisited
6
6 scikit-learn Cheat-Sheet
scikit-learn Pipeline
7
7 Python Machine Learning by Sebastian Raschka
Holdout Method
8
8 Python Machine Learning by Sebastian Raschka
Cross-Validation
9
9 Python Machine Learning by Sebastian Raschka
Learning Curves: What does it tell us?
10
10 Python Machine Learning by Sebastian Raschka
Poor fit: Linear Regression even with (K)PCA
Good fits: SVR (RBF) and Decision Tree Learning Curves
Classic Overfitting: Random Forest Regressor
Decision Trees: Easy to understand
Fitting Models: Conclusion
I Support Vector Machine (SVR) with Radial Basis Function(RBF) Kernel has a higher accuracy
I Decision Tree is easier to understand
I Choice involves our own priors on the underlying structure
Fitting Models: Conclusion
I Support Vector Machine (SVR) with Radial Basis Function(RBF) Kernel has a higher accuracy
I Decision Tree is easier to understand
I Choice involves our own priors on the underlying structure
Fitting Models: Conclusion
I Support Vector Machine (SVR) with Radial Basis Function(RBF) Kernel has a higher accuracy
I Decision Tree is easier to understand
I Choice involves our own priors on the underlying structure
Second Half of Machine Learning: A Persistent Model
Jupyter Notebook
Thanks for listening: Q&A
https://github.com/rheineke/time series modeling