LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and...

36
Parallel computing with R Thibault LAURENT General informations Master program How coding a program in parallel? Random Forest algorithm Balancing the distribution of tasks Reproduccing the results: choice of the seed Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT [email protected] Toulouse School of Economics, CNRS October 24, 2019

Transcript of LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and...

Page 1: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Parallel computing with RR course,

Master 2 Statistics and Econometrics, TSEFormation OMP

Thibault [email protected]

Toulouse School of Economics, CNRS

October 24, 2019

Page 2: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

1 General informations

2 Master program

3 How coding a program in parallel?

4 Random Forest algorithm

5 Balancing the distribution of tasks

6 Reproduccing the results: choice of the seed

Page 3: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Packages and software versions

> install.packages(c("parallel", "snow", "snowFT", "VGAM"),

dependencies = TRUE)

> require("parallel")

> require("snow")

> require("snowFT")

> require("VGAM")

> sessionInfo()

R version 3.6.1 (2019-07-05)

Platform: x86_64-pc-linux-gnu (64-bit)

Running under: Ubuntu 16.04.6 LTS

Matrix products: default

BLAS: /usr/lib/openblas-base/libblas.so.3

LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:

[1] LC_CTYPE=fr_FR.UTF-8 LC_NUMERIC=C

[3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=fr_FR.UTF-8

[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=fr_FR.UTF-8

[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C

[9] LC_ADDRESS=C LC_TELEPHONE=C

[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C

attached base packages:

[1] splines stats4 parallel stats graphics grDevices

[7] utils datasets methods base

other attached packages:

[1] VGAM_1.1-1 snowFT_1.6-0 rlecuyer_0.3-4 snow_0.4-3

loaded via a namespace (and not attached):

[1] compiler_3.6.1 tools_3.6.1

Page 4: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Principle of parallel computing

Initialize a vector of N seeds:

initialize.rng(...)

Repeat N times a function with a different seed for eachsimulation:

for (iteration in 1:N) {

result[iteration] <- myfunc(...)

}

Summarize the N results obtained:

process(result,...)

Page 5: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Sequential to parallel

Sequential: execute iteration 1, then iteration 2, etc.sequentially

Parallel: split the N iterations into groups of size equal tothe number of cores available and execute the iterations inthe cores in parallel

Page 6: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Example of algorithm: Random forest

Repeat N regression or classification trees with respect to theN samples:

Page 7: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

How many cores are available on my machine?

What is difference between CPU and cores ?

> library("parallel")

> detectCores(logical = FALSE) # number of cores

[1] 40

> detectCores() # number of logical cores

[1] 40

Page 8: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

1 General informations

2 Master program

3 How coding a program in parallel?

4 Random Forest algorithm

5 Balancing the distribution of tasks

6 Reproduccing the results: choice of the seed

Page 9: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

What is a master program?

Definition: The master program specifies the division oftasks and summarize the parallelized results.

Example: in the previous algorithm, we have N process toreplicate. Let suppose that N = 100 and the number ofcores available is equal to 4. The master program mustindicate how to divide tasks across the cores. For example:

The core 1 will execute iterations 1, 5, 9, ..., 93, 97The core 2 will execute iterations 2, 6, 10, ..., 94, 98The core 3 will execute iterations 3, 7, 11, ..., 95, 99The core 4 will execute iterations 4, 8, 12, ..., 96, 100

Once the tasks computed, the master program has tosummarize the results obtained.

Page 10: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Summary of a master program

Page 11: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

1 General informations

2 Master program

3 How coding a program in parallel?

4 Random Forest algorithm

5 Balancing the distribution of tasks

6 Reproduccing the results: choice of the seed

Page 12: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Example

We consider the function myfun() which calculates the averagemean of a sample of size r simulated according to a normal lawof parameter mean and sd.

> myfun <- function(r, mean = 0, sd = 1) {

mean(rnorm(r, mean = mean, sd = sd))

}

We would like to repeat 100 times this function with respect tothis different values of r :

25 times when r=10

25 times when r=1000

25 times when r=100000

25 times when r=10000000

with mean=5 and sd=10. The objective is to compare thestandard deviation of the results with respect to the values of r .

Page 13: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Non parallel case

In the non parallel case we can use the function sapply()

where the 1st argument corresponds to the values of r whichare evaluated by the function myfun()

> r_values <- rep(c(10, 1000, 100000, 10000000), each = 25)

> system.time(

res_non_par <- sapply(r_values, FUN = myfun,

mean = 5, sd = 10) # options of function myfun

)

Then, to compute the standard deviation of the results.

> tapply(res_non_par, r_values, sd)

Page 14: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Parallel case

In the parallel case, we first define the number of cores:

> P <- 4

> cl <- makeCluster(P) # allocate 4 cores

Use the function clusterApply() where the 1st argument givesthe cluster to use and the 2nd argument the function to send inthe different cores:

> system.time(

res_par <- clusterApply(cl, r_values, fun = myfun, # evaluate myfun on r_values

mean = 5, sd = 10) # options of myfun

)

To finish, it is necessary to free the cores:

> stopCluster(cl)

Then, to compute the standard deviation of the results.

> tapply(unlist(res_par), r_values, sd)

Page 15: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Computational time with respect to cores

Remark 1: the function is decreasing but not linearly.

Remark 2: in this example the use of 10 cores among 40available is enough.

Remark 3: in parallel computing, the computationnaltime depends depends both on the efficiency of thealgorithm and the quantity of informations that flowsbetween the master program and the cores.

Page 16: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

How to call packages across the cores?

Parallel computing on B cores ⇐⇒ opening of B new Rsession: Packages need to be loaded in each core.

Solution 1: use library() or require() inside thefunction to parallelize> myfun_pareto <- function(r, scale = 1, shape = 10) {

library("VGAM")

mean(rpareto(r, scale = scale, shape = shape))

}

Solution 2: use the name package:: syntax which permitsto avoid to load all the functions of the package> myfun_pareto <- function(r, scale = 1, shape = 10) {

mean(VGAM::rpareto(r, scale = scale, shape = shape))

}

Application:> r_values <- rep(c(10, 1000, 100000), each = 25)

> cl <- makeCluster(P)

> res_par <- clusterApply(cl, r_values, fun = myfun_pareto,

scale = 1, shape = 10)

> stopCluster(cl)

Page 17: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

How to call packages across the cores?

Solution 3: indicating in the master program the codeswhich have to been evaluated in each core, thanks to thefunction clusterEvalQ()

> myfun_pareto <- function(r, scale = 1, shape = 10) {

mean(rpareto(r, scale = scale, shape = shape))

}

> cl <- makeCluster(P)

> clusterEvalQ(cl, library("VGAM"))

> res_par <- clusterApply(cl, r_values, fun = myfun_pareto,

scale = 1, shape = 10)

> stopCluster(cl)

Page 18: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

How to share R objects across the cores?

Here the functin myfun pareto() calls two objects which arenot defined in the function.

> myfun_pareto <- function(r) {

mean(rpareto(r, scale = scale, shape = shape))

}

As for the packages, objects scale and shape can be evaluatedin each core thanks to the function clusterEvalQ()

> cl <- makeCluster(P)

> clusterEvalQ(cl, {

library("VGAM")

scale <- 1

shape <- 10

})

> res_par <- clusterApply(cl, r_values, fun = myfun_pareto)

> stopCluster(cl)

Page 19: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

How to share R objects across the cores?

It can also be executed in the master program and thenexported to the different cores:

> scale <- 1

> shape <- 10

> cl <- makeCluster(P)

> clusterExport(cl, c("scale", "shape"))

> clusterEvalQ(cl, library("VGAM"))

> res_par <- clusterApply(cl, r_values, fun = myfun_pareto)

> stopCluster(cl)

Page 20: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

lapply(), sapply(), apply(), mapply()

It exits parallelized versions of l-s-m-apply() functions :

parLapply(),

parSapply(),

parApply(),

clusterMap().

Non parallel:

> res_non_par <- sapply(r_values, FUN = myfun,

mean = 5, sd = 10)

Parallel:

> P <- 4 # definir le nombre de coeurs

> cl <- makeCluster(P) # reserve 4 coeurs - debut du calcul

> system.time(

res_par <- parSapply(cl, r_values, FUN = myfun,

mean = 5, sd = 10)

)

> stopCluster(cl)

Page 21: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Other packages

See :https://rawgit.com/PPgp/useR2017public/master/tutorial.html

snowFT: useful for randomizing the seed,

foreach and doParallel: based on for syntax.> require("doParallel")

> P <- 4 # definir le nombre de coeurs

> registerDoParallel(cores = P) # alloue le nombre de coeurs souhaite

> getDoParWorkers() # Pour verifer qu'on utilise bien tous les coeurs

> system.time(

res_par_foreach <- foreach(i = r_values)

%dopar% myfun(i, mean = 5, sd = 10)

)

> # pour revenir au nombre de coeur initial

> registerDoParallel(cores = 1)

> # presentation des resultats

> unlist(res_par_foreach)

doMPI: MPI

Page 22: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

1 General informations

2 Master program

3 How coding a program in parallel?

4 Random Forest algorithm

5 Balancing the distribution of tasks

6 Reproduccing the results: choice of the seed

Page 23: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

RF algorithm

Input:

test sets: Test sets

train sets: Training sets

B: Number of replications

m: Number of variables to keep for splitting a node

Algorithm: Repeat B times:

1 Sample the trainning sample from train sets

2 Buid a model CART where each cut is chosen byminimizing the cost function CART among m variableschosen among the p initial variables. We note h(., θk) thetree

Output: estimates h(x) = 1B

∑Bk=1 h(x , θk) on test sets

Page 24: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Application

We consider the data iris; the objective is to predict thevariable Species (coded into 3 different levels) with respect tothe numeric variables (Sepal.Length, Sepal.Width,Petal.Length and Petal.Width). First, we build the trainningand the test sets We choose to sample 25 observations in thetest sets

> set.seed(1)

> id_pred <- sample(1:150, 25, replace = F)

> test_sets <- iris[id_pred, ]

> train_sets<- iris[-id_pred,]

We choose m=2 variables at each iteration. The possibilities:> list_model <- list(

f1 = formula(Species ~ Sepal.Length + Sepal.Width),

f2 = formula(Species ~ Sepal.Length + Petal.Length),

f3 = formula(Species ~ Sepal.Length + Petal.Width),

f4 = formula(Species ~ Sepal.Width + Petal.Length),

f5 = formula(Species ~ Sepal.Width + Petal.Width),

f6 = formula(Species ~ Petal.Length + Petal.Width)

)

Page 25: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Basic function to replicate

We define the function to be replicated B times

> class_tree <- function(k) {

set.seed(k)

# sample the observations

train_sets_bootstrap <- train_sets[sample(1:125, replace = T), ]

# sample the variable

res_rf <- rpart(sample(list_model, 1)[[1]],

data = train_sets_bootstrap)

# prediction

max.col(predict(res_rf, newdata = test_sets))

}

Application one time:

> require("rpart")

> class_tree(1)

Page 26: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Replications

We repeat 100 times the function by using parallel computing:

> cl <- makeCluster(P)

> clusterExport(cl, c("test_sets", "train_sets", "list_model"))

> clusterEvalQ(cl, library("rpart"))

> res_par <- clusterApply(cl, 1:100, fun = class_tree)

> stopCluster(cl)

Summarize the results:

> tab_res_par <- sapply(res_par, function(x) x)

> prop_res_par <- apply(tab_res_par, 1, function(x)

c(length(which(x == 1)), length(which(x == 2)),

length(which(x == 3))))

> prop_res_par

Page 27: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Predictions

For each observation, we predict by the most predicted value inthe B replicates:

> pred_vecteur <- apply(prop_res_par, 2,

function(x) which.max(x))

Confusion matrix:

> (tab_res <- table(pred_vecteur, ech_test$Species))

Percentage of good predictions:

> sum(diag(tab_res))/sum(tab_res)

Page 28: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

1 General informations

2 Master program

3 How coding a program in parallel?

4 Random Forest algorithm

5 Balancing the distribution of tasks

6 Reproduccing the results: choice of the seed

Page 29: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Problematic

When splitting the tasks between cores, it could happen thatsome tasks take more time than others.Example: Function rnmean() returns the average mean of agaussian sample of size r where r is the input argument.

> rnmean <- function(r, mean = 0, sd = 1) {

mean(rnorm(r, mean = mean, sd = sd))

}

We create heterogeneous values of r so that there is animbalance between tasks with respect to the values of r :

> N <- 40

> set.seed(50)

> r.seq <- sample(ceiling(exp(seq(10, 14, length = 50))), N)

Page 30: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Applications on 4 cores

If we parallelize the tasks between 4 cores with respect to thevector r.seq, the division of tasks will be done like this:

core 1 will make the computations for the following valuesof r.seq: 903730 (1st position), 679133 (5th position),12439 (9th position), etc.

core 2 will make the computations for the following valuesof r.seq: 4576 (2nd position), 1460 (6th position), 44995(10th position), etc.

core 3 will make the computations for the following valuesof r.seq: 79676 (3rd position), 2981 (7th position),19095 (11th position), etc.

core 4 will make the computations for the following valuesof r.seq: 1202605 (4th position), 9348 (8th position),332464 (12th position), etc.

Page 31: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Computational time per core

> library("snow")

> cl <- makeCluster(P)

> ctime <- snow.time(clusterApply(cl, r.seq, fun = rnmean))

> plot(ctime, title = "Usage with clusterApply")

> stopCluster(cl)

0.0 0.2 0.4 0.6 0.8 1.0

Elapsed Time

Nod

e

01

23

4

Usage with clusterApply

Page 32: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Computational time per core (2)

When using clusterApplyLB() function, the execution of theparallelized functions start independently between cores.

> cl <- makeCluster(P)

> ctimeLB <- snow.time(clusterApplyLB(cl, r.seq, fun = rnmean))

> plot(ctimeLB, title = "Usage with clusterApplyLB")

> stopCluster(cl)

0.0 0.2 0.4 0.6

Elapsed Time

Nod

e

01

23

4

Usage with clusterApplyLB

Page 33: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

1 General informations

2 Master program

3 How coding a program in parallel?

4 Random Forest algorithm

5 Balancing the distribution of tasks

6 Reproduccing the results: choice of the seed

Page 34: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Problematic

How finding the same results when working on differentcomputers and at different dates?Example: In the following example, each iteration generates asample which canno’t be repoducced

> cl <- makeCluster(P)

> res_par <- parSapply(cl, 1:100, function(x) mean(rnorm(100)))

> stopCluster(cl)

Solution 1: Fix a seed inside the function

> rnmean <- function(x, r, mean = 0, sd = 1) {

set.seed(x)

mean(rnorm(r, mean = mean, sd = sd))

}

> cl <- makeCluster(P)

> res_par <- parSapply(cl, 1:100, rnmean, r = 100,

mean = 0, sd = 1)

> stopCluster(cl)

Page 35: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Package snowFT (1)

Solution 2: Use function performParallel() in packagesnowFT which permits to determine the different seeds in thereplicates with respect to one seed given in the master program

> rnmean <- function(r, mean = 0, sd = 1) {

mean(rnorm(r, mean = mean, sd = sd))

}

> library("snowFT")

> seed <- 1

> r_values <- rep(c(10, 1000, 100000, 10000000), each = 10)

> res <- performParallel(P, r.seq, fun = rnmean,

seed = seed)

> unlist(res)

Page 36: LAURENT Parallel computing with R · Parallel computing with R R course, Master 2 Statistics and Econometrics, TSE Formation OMP Thibault LAURENT thibault.laurent@tse-fr.eu Toulouse

Parallelcomputingwith R

ThibaultLAURENT

Generalinformations

Masterprogram

How coding aprogram inparallel?

RandomForestalgorithm

Balancing thedistribution oftasks

Reproduccingthe results:choice of theseed

Package snowFT (2)

When using package snowFT, the objects to export andpackages to load in the cores are given in the argumentsinitexpr and export. For example:

> myfun_pareto <- function(r) {

mean(rpareto(r, scale = scale, shape = shape))

}

> seed <- 1

> scale <- 1

> shape <- 10

> r_values <- rep(c(10, 1000, 100000, 10000000), each = 10)

> res <- performParallel(P, r.seq, fun = myfun_pareto,

seed = seed,

initexpr = require("VGAM"),

export = c("scale", "shape"))

> unlist(res)