What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline

What We Learned Building an R-Python Hybrid Analytics Pipeline

Niels Bantilan, Pegged Software

NY R Conference April 8th 2016

Help healthcare organizations recruit better

Pegged Software’s Mission:

Core Activities

● Build, evaluate, refine, and deploy predictive models

● Work with Engineering to ingest, validate, and store data

● Work with Product Management to develop data-driven feature sets

How might we build a predictive analytics pipeline that is

reproducible, maintainable, and statistically rigorous?

Anchor Yourself to Problem Statements / Use Cases

1. Define Problem statement

2. Scope out solution space and trade-offs

3. Make decision, justify it, document it

4. Implement chosen solution

5. Evaluate working solution against problem statement

6. Rinse and repeat

Problem-solving Heuristic

R-Python Pipeline

Read Data Preprocess Build Model Evaluate Deploy

Data Science Stack

Need R PythonMaintainable codebase Git Git

Sync package dependencies Packrat Pip, Pyenv

Call R from Python - subprocess

Automated Testing Testthat Nose

Reproducible pipeline Makefile Makefile

● Code quality

● Incremental Knowledge Transfer

● Sanity check

GitWhy? Because Version Control

Dependency ManagementWhy Pip + Pyenv?

1. Easily sync Python package dependencies

2. Easily manage multiple Python versions

3. Create and manage virtual environments

Why Packrat? From RStudio

1. Isolated: separate system environment and repo environment

2. Portable: easily sync dependencies across data science team

3. Reproducible: easily add/remove/upgrade/downgrade as needed.

Dependency Management

https://rstudio.github.io/packrat/

Packrat Internals

datascience_repo├─ project_folder_a├─ project_folder_b├─ datascience_repo.Rproj...├─ .Rprofile # points R to packrat└─ packrat ├─ init.R # initialize script ├─ packrat.lock # package deps ├─ packrat.opts # options config ├─ lib # repo private library └─ src # repo source files

Understanding packrat

PackratFormat: 1.4PackratVersion: 0.4.6.1RVersion: 3.2.3Repos:CRAN=https://cran.rstudio.com/...

Package: ggplot2Source: CRANVersion: 2.0.0Hash: 5befb1e7a9c7d0692d6c35fa02a29dbfRequires: MASS, digest, gtable, plyr, reshape2, scales

datascience_repo├─ project_folder_a├─ project_folder_b├─ datascience_repo.Rproj...├─ .Rprofile └─ packrat ├─ init.R ├─ packrat.lock # package deps ├─ packrat.opts ├─ lib └─ src

packrat.lock: package version and deps

Packrat Internals

auto.snapshot: TRUEuse.cache: FALSEprint.banner.on.startup: autovcs.ignore.lib: TRUEvcs.ignore.src: TRUEload.external.packages.on.startup: TRUEquiet.package.installation: TRUEsnapshot.recommended.packages: FALSE

packrat.opts: project-specific configuration

Packrat Internals

datascience_repo├─ project_folder_a├─ project_folder_b├─ datascience_repo.Rproj...├─ .Rprofile └─ packrat ├─ init.R ├─ packrat.lock ├─ packrat.opts # options config ├─ lib └─ src

● Initialize packrat with packrat::init()

● Toggle packrat in R session with packrat::on() / off()

● Save current state of project with packrat::snapshot()

● Reconstitute your project with packrat::restore()

● Removing unused libraries with packrat::clean()

Packrat Workflow

Problem: Unable to find source packages when restoring

Happens when there is a new version of a package on an R package repository like CRAN

Packrat Issues

> packrat::restore()Installing knitr (1.11) ...FAILEDError in getSourceForPkgRecord(pkgRecord, srcDir(project), availablePkgs, : Couldn't find source for version 1.11 of knitr (1.10.5 is current)

Solution 1: Use R’s Installation Procedure

Packrat Issues

> install.packages(<package_name>)> packrat::snapshot()

Solution 2: Manually Download Source File

$ wget -P repo/packrat/src <package_source_url> > packrat::restore()

Call R from Python: Data Pipeline

Read Data Preprocess Build Model Evaluate Deploy

# model_builder.Rcmdargs <- commandArgs(trailingOnly = TRUE)data_filepath <- cmdargs[1]model_type <- cmdargs[2]formula <- cmdargs[3]

build.model <- function(data_filepath, model_type, formula) {df <- read.data(data_filepath)model <- train.model(df, model_type, formula)model

}

Call R from Python: Example

# model_pipeline.pyimport subprocesssubprocess.call([‘path/to/R/executable’,

'path/to/model_builder.R’, data_filepath, model_type, formula])

Why subprocess?

1. Python for control flow, data manipulation, IO handling

2. R for model build and evaluation computations

3. main.R script (model_builder.R) as the entry point into R layer

4. No need for tight Python-R integration

Call R from Python

Tolerance to Change

Are we confident that a modification to the codebase will not silently introduce new bugs?

Automated Testing

Working Effectively with Legacy Code - Michael Feathers

1. Identify change points

2. Break dependencies

3. Write tests

4. Make changes

5. Refactor

Automated Testing

http://www.amazon.com/Working-Effectively-Legacy-Michael-Feathers/dp/0131177052

http://www.amazon.com/Working-Effectively-Legacy-Michael-Feathers/dp/0131177052

Make is a language-agnostic utility for *nix● Enables reproducible workflow

● Serves as lightweight documentation for repo

# makefilebuild-model: python model_pipeline.py \ -i ‘model_input’ \ -m_type ‘glm’ \ -formula ‘y ~ x1 + x2’ \

# command-line$ make build-model

Build Management: Make

$ python model_pipeline.py \-i input_fp \-m_type ‘glm’ \-formula ‘y ~ x1 + x2’ \

VS

By adopting the above practices, we:

1. Can maintain the codebase more easily

2. Reduce cognitive load and context switching

3. Improve code quality and correctness

4. Facilitate knowledge transfer among team members

5. Encourage reproducible workflows

Big Wins

Necessary Time Investment

1. The learning curve

2. Breaking old habits

3. Create fixes for issues that come with chosen solutions

Costs

How might we build a predictive analytics pipeline that is

reproducible, maintainable, and statistically rigorous?

[email protected]

@cosmicbboy

mailto:[email protected]

mailto:[email protected]

What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline

Data & Analytics

Transcript of What We Learned Building an R-Python Hybrid Predictive Analytics Pipeline