Predictive Analytics of Digital Marketing and Sales Pipeline
What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
-
Upload
work-bench -
Category
Data & Analytics
-
view
7.509 -
download
1
Transcript of What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline
What We Learned Building an R-Python Hybrid Analytics Pipeline
Niels Bantilan, Pegged Software
NY R Conference April 8th 2016
Core Activities
● Build, evaluate, refine, and deploy predictive models
● Work with Engineering to ingest, validate, and store data
● Work with Product Management to develop data-driven feature sets
How might we build a predictive analytics pipeline that is
reproducible, maintainable, and statistically rigorous?
Anchor Yourself to Problem Statements / Use Cases
1. Define Problem statement
2. Scope out solution space and trade-offs
3. Make decision, justify it, document it
4. Implement chosen solution
5. Evaluate working solution against problem statement
6. Rinse and repeat
Problem-solving Heuristic
Data Science Stack
Need R PythonMaintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
Need R PythonMaintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
Need R PythonMaintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
Dependency ManagementWhy Pip + Pyenv?
1. Easily sync Python package dependencies
2. Easily manage multiple Python versions
3. Create and manage virtual environments
Why Packrat? From RStudio
1. Isolated: separate system environment and repo environment
2. Portable: easily sync dependencies across data science team
3. Reproducible: easily add/remove/upgrade/downgrade as needed.
Dependency Management
Packrat Internals
datascience_repo├─ project_folder_a├─ project_folder_b├─ datascience_repo.Rproj...├─ .Rprofile # points R to packrat└─ packrat ├─ init.R # initialize script ├─ packrat.lock # package deps ├─ packrat.opts # options config ├─ lib # repo private library └─ src # repo source files
Understanding packrat
PackratFormat: 1.4PackratVersion: 0.4.6.1RVersion: 3.2.3Repos:CRAN=https://cran.rstudio.com/...
Package: ggplot2Source: CRANVersion: 2.0.0Hash: 5befb1e7a9c7d0692d6c35fa02a29dbfRequires: MASS, digest, gtable, plyr, reshape2, scales
datascience_repo├─ project_folder_a├─ project_folder_b├─ datascience_repo.Rproj...├─ .Rprofile └─ packrat ├─ init.R ├─ packrat.lock # package deps ├─ packrat.opts ├─ lib └─ src
packrat.lock: package version and deps
Packrat Internals
auto.snapshot: TRUEuse.cache: FALSEprint.banner.on.startup: autovcs.ignore.lib: TRUEvcs.ignore.src: TRUEload.external.packages.on.startup: TRUEquiet.package.installation: TRUEsnapshot.recommended.packages: FALSE
packrat.opts: project-specific configuration
Packrat Internals
datascience_repo├─ project_folder_a├─ project_folder_b├─ datascience_repo.Rproj...├─ .Rprofile └─ packrat ├─ init.R ├─ packrat.lock ├─ packrat.opts # options config ├─ lib └─ src
● Initialize packrat with packrat::init()
● Toggle packrat in R session with packrat::on() / off()
● Save current state of project with packrat::snapshot()
● Reconstitute your project with packrat::restore()
● Removing unused libraries with packrat::clean()
Packrat Workflow
Problem: Unable to find source packages when restoring
Happens when there is a new version of a package on an R package repository like CRAN
Packrat Issues
> packrat::restore()Installing knitr (1.11) ...FAILEDError in getSourceForPkgRecord(pkgRecord, srcDir(project), availablePkgs, : Couldn't find source for version 1.11 of knitr (1.10.5 is current)
Solution 1: Use R’s Installation Procedure
Packrat Issues
> install.packages(<package_name>)> packrat::snapshot()
Solution 2: Manually Download Source File
$ wget -P repo/packrat/src <package_source_url> > packrat::restore()
Need R PythonMaintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
# model_builder.Rcmdargs <- commandArgs(trailingOnly = TRUE)data_filepath <- cmdargs[1]model_type <- cmdargs[2]formula <- cmdargs[3]
build.model <- function(data_filepath, model_type, formula) {df <- read.data(data_filepath)model <- train.model(df, model_type, formula)model
}
Call R from Python: Example
# model_pipeline.pyimport subprocesssubprocess.call([‘path/to/R/executable’,
'path/to/model_builder.R’, data_filepath, model_type, formula])
Why subprocess?
1. Python for control flow, data manipulation, IO handling
2. R for model build and evaluation computations
3. main.R script (model_builder.R) as the entry point into R layer
4. No need for tight Python-R integration
Call R from Python
Need R PythonMaintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
Tolerance to Change
Are we confident that a modification to the codebase will not silently introduce new bugs?
Automated Testing
Working Effectively with Legacy Code - Michael Feathers
1. Identify change points
2. Break dependencies
3. Write tests
4. Make changes
5. Refactor
Automated Testing
Need R PythonMaintainable codebase Git Git
Sync package dependencies Packrat Pip, Pyenv
Call R from Python - subprocess
Automated Testing Testthat Nose
Reproducible pipeline Makefile Makefile
Make is a language-agnostic utility for *nix● Enables reproducible workflow
● Serves as lightweight documentation for repo
# makefilebuild-model: python model_pipeline.py \ -i ‘model_input’ \ -m_type ‘glm’ \ -formula ‘y ~ x1 + x2’ \
# command-line$ make build-model
Build Management: Make
$ python model_pipeline.py \-i input_fp \-m_type ‘glm’ \-formula ‘y ~ x1 + x2’ \
VS
By adopting the above practices, we:
1. Can maintain the codebase more easily
2. Reduce cognitive load and context switching
3. Improve code quality and correctness
4. Facilitate knowledge transfer among team members
5. Encourage reproducible workflows
Big Wins
Necessary Time Investment
1. The learning curve
2. Breaking old habits
3. Create fixes for issues that come with chosen solutions
Costs
How might we build a predictive analytics pipeline that is
reproducible, maintainable, and statistically rigorous?