Post on 08-Oct-2020
Build your own super-computer with cloudyr and AWS
with(aws.ec2,
assign(“furrr”,
future + purrr))
Laurens Geffert
@JanLauGe
https://janlauge.github.io
laurensgeffert@gmail.com
Outline
• Introduction• How I came to use and love R• How the tidyverse makes everything better
• Motivation• The prevalence of embarrassingly parallel problems in applied data science• Scaling up with open-source solutions
• Demo• The base-R single-threat approach• The parallelized cloud approach
Audience Survey
• Who uses the tidyverse?• Who uses AWS?• Who has heard of the future package?
During my PhD
Species distribution models• X = Remote sensing data• Y = Species occurrence data
In the “real world”
Audience lookalike models• X = Web event data• Y = Panel data
The BaseR way
X <- 'my predictors’Y <- 'variables to predict'results <- list()
# loop over response vector # to fit one model eachfor (i in 1:length(Ys)) {
y <- Y[[i]]model <- cv.glmnet(X, y)
results[[i]] <- model}
The Tidyverse way
X <- 'my predictors’
Y <- 'variables to predict'
# map apply over
# all elements in Y
map(Y, X = X,
~ cv.glmnet(X, .x))
The furrr way
X <- 'my predictors’
Y <- 'variables to predict'
# map apply over
# all elements in Y
plan(multicore)
future_map(Y, X = X,
~ cv.glmnet(X, .x))
Setup
local
worker
worker
worker
worker
furrr
aws.ec2
aws.ec2
aws.ec2
aws.ec2
Requied
• Active AWS account.• Amazon Machine Image (AMI) with• R• ssh• remoter• tidyverse• future• furrr
• Working ssh key pair• On local machine• On AMI (public AND private!)
Furrr::ther reading
• cloudyR, ssh by rOpenSci, remoter by Drew Schmidt, and last but not least furrr by Davis Vaughan.• https://davisvaughan.github.io/furrr/articl
es/advanced-furrr-remote-connections.html• https://github.com/JanLauGe/ds-personal-
projects/tree/master/ds-computing-cluster• https://janlauge.github.io/• Call to action: Support the cloudyr project