R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel...
Transcript of R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel...
![Page 1: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/1.jpg)
R on BioHPCRstudio, Parallel R and BioconductoR
1 Updated for 2016-04-19
![Page 2: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/2.jpg)
Today we’ll be looking at…
2
![Page 3: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/3.jpg)
Why R?
3
• The dominant statistics environment in academia
• Large number of packages to do a lot of different analyses
• Excellent uptake in Bioinformatics – specialist packages
• (Relatively) easy to accomplish complex stats work
• Very active development right nowR Foundation, R Consortium, Revolution Analytics, RStudio, Microsoft…
![Page 4: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/4.jpg)
Why not R?
4
• Quirky language – painful for e.g. Python programmers
• Generally thought to be quite slow – except for optimized linear algebra
• Complex ‘old-fashioned’ documentation
• Parallelization packages can be complex / outdated
… but it’s getting better quickly….
![Page 5: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/5.jpg)
Exciting Recent Developments in R
5
![Page 6: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/6.jpg)
RStudio – An IDE for R, on the web
6
http://rstudio.biohpc.swmed.edu
BioHPC optimized R, access to cluster storage, persistent sessions
![Page 7: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/7.jpg)
When to use RStudio
7
• Development work with small datasets
• Creating R Markdown documents
• Working with Shiny for dataset visualizations
• Any small, short-running data analysis tasks
Large datasets, very long running jobs, parallel code?
Must use R on the cluster…
![Page 8: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/8.jpg)
Using R on the cluster / clients
8
module load R/3.2.1-intel
Latest version, optimized, same as used by rstudio.biohpc.swmed.edu
Use ‘R’ for command line R, or run scripts with ‘Rscript’
![Page 9: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/9.jpg)
Rstudio in a GUI Session
9
Start a webGUI Session
$ module load R/3.2.1-Intel
$ module load rstudio
$ rstudio
Standard 20 hr limit
Whole node to yourself
![Page 10: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/10.jpg)
Installing Packages
10
We have a set of common packages pre-installed in the R module.
You can install your own into your home directory (~/R)
install.packages(c("microbenchmark", "data.table"))
Some packages need additional libraries, won’t compile successfully.- Ask us to install them for you ([email protected])
This is for packages from CRAN – BioconductoR packages install differentlySee later!
![Page 11: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/11.jpg)
Our R is faster than standard downloads
11
Compiled using Intel compiler and Intel Math Kernel Library
Task Standard R BioHPC R Speedup
Matrix Multiplication 139.15 1.80 77x
Cholesky Decomposition 19.53 0.32 61x
SVD 45.66 1.95 23x
PCA 201.30 6.25 32x
LDA 135.37 17.60 7x
This is on a cluster node – speedup is less on clients with fewer CPU cores
For your own Mac or PC see http://www.revolutionanalytics.com/revolution-r-open
mkl_test.R
![Page 12: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/12.jpg)
Benchmarking functions in R (and compiling them)
12
Compiling a function that is called often can increase speedThe microbenchmark package allows you to benchmark functions
library(compiler)f <- function(n, x) for (i in 1:n) x = (1 + sin(x))^(cos(x))g <- cmpfun(f)
library(microbenchmark)compare <- microbenchmark(f(1000, 1), g(1000, 1), times = 1000)
library(ggplot2)autoplot(compare)
functions.R
![Page 13: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/13.jpg)
For speed – always vectorize!
13
54x speedup!
Using a function compilation improved median some (< 2x)Using vector form was much faster
distnorm <- function(){
x <- seq(-5, 5, 0.01)y <- rep(NA,length(x))
for(i in 1:length(x)) {y[i] <- stdnorm(x[i])
}
return(list(x=x,y=y))}
vdistnorm <- function(){
x <- seq(-5, 5, 0.01)y <- stdnorm(x)
return(list(x=x, y=y))
}
functions.R
![Page 14: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/14.jpg)
Our Example Application
14
# Define a function that performs a random walk with a# specified bias that decaysrw2d <- function(n, mu, sigma){
steps=matrix(, nrow=n, ncol=2)for (i in 1:n){
steps[i,1] <- rnorm(1, mean=mu, sd=sigma )steps[i,2] <- rnorm(1, mean=mu, sd=sigma )mu <- mu/2
}return( apply(steps, 2, cumsum) )
}
mc_parallel.R
![Page 15: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/15.jpg)
A bigger task…
15
# Generate random walks of lengths between 1000 and 5000# foreach loopsystem.time(
results <- foreach(l=1000:5000) %do% rw2d(l, 3, 1))# user system elapsed# 85.872 0.145 86.242
# Applysystem.time(
results <- lapply( 1000:5000, rw2d, 3, 1))# user system elapsed# 81.175 0.114 81.511
mc_parallel.R
![Page 16: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/16.jpg)
Start a cluster (of R slave workers on a single machine)
16
Single node, multiple cores running multiple R slaves
#Parallel Single nodelibrary(parallel)library(doParallel)
# Create a cluster of workers using all corescl <- makeCluster( detectCores() )# Tell foreach with %dopar% to use this clusterregisterDoParallel(cl)
…
stopCluster(cl)
mc_parallel.R
![Page 17: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/17.jpg)
Explicit Parallelization in R
17
Our optimized R automatically parallelizes linear algebra on a single machine- enough in a lot of cases!
Always prefer using vector/matrix form over for loops and apply functions to get the most out of these optimizations.
If you need more options you can control the parallelization:
library(parallel) # Single-node and cluster parallelization# apply functions and explicit execution
library(doParallel) # Simple parallel foreach loops
Can run parallel code on a single node (multicore) or across nodes (MPI)
![Page 18: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/18.jpg)
R parallel vs MKL conflict
18
Intel MKL tries to use all cores for every linear algebra operationR is running multiple iterations of a loop in parallel using all cores
If used together too many threads/processes are launched – far more than cores!
export OMP_NUM_THREADS=1 # on terminal before running R
sys.setenv(OMP_NUM_THREADS="1") # within R
~ 5% improvement by disabling MKL multi-threading
![Page 19: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/19.jpg)
This time in parallel!
19
cl <- makeCluster( detectCores() )RegisterDoParallel(cl)Sys.setenv(OMP_NUM_THREADS="1")
# Generate 1000 random walks of increasing length# Parallel foreach loopsystem.time(
results <- foreach(l=1000:5000) %dopar% rw2d(l, 3, 1))# user system elapsed# 2.928 0.441 17.374
# Parallel applysystem.time(
results <- parLapply( cl, 1000:5000, rw2d, 3, 1))# user system elapsed# 0.339 0.171 8.460
stopCluster(cl)
5x Speedup
9x Speedup
mc_parallel.sh
![Page 20: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/20.jpg)
MPI parallelization – for really big jobs
20
MPI is available on R/3.1.2-intel only
We will continue to use the simple parallel and doParallel packages
Lots online about ‘snow’ – this is now behind the scenes in new versions of R
Please join us for coffee to discuss MPI projectsusing R
Work in progress optimizations with your help
![Page 21: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/21.jpg)
MPI parallelization – easy!
21
cl <- makeCluster( 128, type="MPI" )
Number of MPI tasks
cores per node * nodes (or less if RAM limited)
48 cores per node for 256GB partition32 cores per node for other partitions
mpi_parallel.R
mpi.exit()
Add to bottom of your R code to ensure tidy exit
![Page 22: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/22.jpg)
MPI parallelization – submitting the job
22
#!/bin/bash
#SBATCH --job-name R_MPI_TEST
# Number of nodes required to run this job#SBATCH -N 4# Distribute n tasks per node#SBATCH --ntasks-per-node=32
#SBATCH -t 0-2:0:0#SBATCH -o job_%j.out#SBATCH -e job_%j.err#SBATCH --mail-type ALL#SBATCH --mail-user [email protected]
module load R/3.2.1-intel
ulimit -l unlimitedmpirun R --vanilla < mpi_parallel.R
# END OF SCRIPT
No mpirun!
mpi_parallel.sh
![Page 23: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/23.jpg)
MPI Performance
23
# Sequential (with MKL multi-threading)system.time(
results <- lapply( 1000:10000, rw2d, 3, 1))# user system elapsed # 329.173 0.610 330.607
# Parallel apply, 4 nodes, 128 MPI taskssystem.time(
results <- parLapply( cl, 1000:10000, rw2d, 3, 1))# user system elapsed # 18.815 0.951 19.848 16x Speedup
![Page 24: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/24.jpg)
Rmarkdown / Knitr
24
Write R code inside markdown documents
Create attractive HTML, PDF, Word output that includes the code and output
![Page 25: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/25.jpg)
BioconductoR
25
A comprehensive set of Bioinformatics related packages for R
Software and datasets
![Page 26: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/26.jpg)
Bioconductor
26
Base packages installed, plus some commonly used extras
Install additional packages to home directory:
source("http://bioconductor.org/biocLite.R")biocLite('limma')
Ask [email protected] for packages that fail to compile
![Page 27: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/27.jpg)
BioconductoR
27
Bioconductor workflows are fantastic tutorials
http://www.bioconductor.org/help/workflows/
![Page 28: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/28.jpg)
BioconductoR Example
28
DEMO
RNA-Seq Analysis&
UCSC Genome Browser
See bioconductor.Rmd
![Page 29: R on BioHPC...20 MPI is available on R/3.1.2-intel only We will continue to use the simple parallel and doParallel packages Lots online about Zsnow [ –this is now behind the scenes](https://reader036.fdocuments.us/reader036/viewer/2022081617/6045089d601118224f2b5674/html5/thumbnails/29.jpg)
Dallas R Users Group
29
http://www.meetup.com/Dallas-R-Users-Group/
University of Dallas, Irving, Saturdays