Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Rapid Productionalization of Predictive Models

In-database Modeling with Revolution Analytics on Teradata

Skylar Lyon

Accenture Analytics

Copyright © 2014 Accenture. All rights reserved. 2

• 7 years of experience with focus on big data and predictive analytics - using discrete choice modeling, random forest classification, ensemble modeling, and clustering

• Technology experience includes: Hadoop, Accumulo, PostgreSQL, qGIS, JBoss, Tomcat, R, GeoMesa, and more

• Worked from Army installations across the nation and also had the opportunity to travel twice to Baghdad to deploy solutions downrange.

Skylar Lyon

Accenture Analytics

Introduction


• New Customer Analytics team for Silicon Valley Internet eCommerce giant

• Data scientists developing predictive models

• Deferred focus on productionalization

• Joined as Big Data Infrastructure and Analytics Lead

Project background and my involvement

How we got here


• 50+ Independent variables including categorical with indicator variables

• Train from small sample (many thousands) – not a problem in and of itself

• Scoring across entire corpus (many hundred millions) – slightly more challenging

Binomial logistic regression

Colleague‘s CRAN R model


We moved compute to data

We optimized the current productionalization process

Before After

Reduced 5+ hour process to 40 seconds


5+ hours to 40 seconds: Recommendation is that this now become the defacto productionalization process

Benchmarking our optimized process

rows

min

ute

s


Beforetrainit <- glm(as.formula(specs[[i]]), data = training.data, family='binomial', maxit=iters)fits <- predict(trainit, newdata=test.data, type='response')Aftertrainit <- rxGlm(as.formula(specs[[i]]), data = training.data, family='binomial', maxIterations=iters)fits <- rxPredict(trainit, newdata=test.data, type='response')

Recode CRAN R to Rx R

Optimization process


• Train in-database on much larger set – reduces need to sample

• Nearly “native” R language – decrease deploy time

• Hadoop support – score in multiple data warehouses

Technology is increasing data science team’s options and opportunities

Additional benefits to new process


• Technical Considerations

Table of Contents

Appendix


• Teradata environment – 4 node, 1700 series appliance server

• Revolution R Enterprise – version 7.1, running R 3.0.2

Environment setup

Technical considerations

Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata

Technology

Transcript of Quick and Dirty: Scaling Out Predictive Models Using Revolution Analytics on Teradata