Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming...

59
Parallel Compu,ng in R BioC 2009, Sea7le, July 2009

Transcript of Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming...

Page 1: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

ParallelCompu,nginR

BioC2009,Sea7le,July2009

Page 2: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

WhoareREvolu,onCompu,ng?

REvolu,onCompu,ng:–  Isacommercialopen‐sourcecompany,foundedin2007–  ProvidesservicesandproductsbasedonR

•  The“RedHat”®forR–  Producesfreeandsubscrip,on‐basedhigh‐performance,enhanceddistribu,onsofR

–  Offerssupport,training,valida,onandotherservicesaroundR

–  Hasexper,seinhigh‐performanceanddistributedcompu,ng

–  IsafinancialandtechnicalcontributortotheRcommunity–  Hasopera,onsinNewHaven,Sea7le,andSanFrancisco

Page 3: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

The People of REvolution

•  Mar$nSchultz,ChiefScien,ficOfficer(ArthurKWatsonProfessorofComputerScience,YaleUniversity;founderofScien,ficCompu,ngAssociates;researchinalgorithmdesign,parallelprogrammingenvironmentsandarchitectures)

•  DavidSmith,DirectorofCommunity&Rblogger(co‐authorofAnIntroduc+ontoR,ESS)

•  BryanLewis,AmbassadorofCool(akaDirectorofSystemsEngineering;appliedmathinterestsinnumericalanalysisofinverseproblems;formerCEOofRocketcalc)

•  DaneseCooper,OpenSourceDiva(boardofdirectors,OpenSourceIni,a,ve;member,ApacheSo`wareFounda,on;advisoryboard,Mozilla.org;previouslyseniordirector,opensourcestrategiesatIntelandSun)

•  SteveWeston,SeniorResearchScien,st,DirectorofEngineering(REvolu,onandScien,ficCompu,ngAssociates;developmentofNetWorkSpaces–parallelprogrammingwithR,Python,Ruby,andMatlab–NetworkLinda,Paradise,andPiranha).

•  JayEmerson,DeptSta,s,cs,YaleUniversity(authorofbigmemorypackage)

Page 4: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

WhatisREvolu,onR?

•  REvolu,onRisthefreedistribu,onofR–  Op,mizedforspeed–  Usesmul,pleCPUs/coresforperformance–  ForWindowsandMacOS(soon:Ubuntu)–  Supportviacommunityforums

•  REvolu,onREnterpriseisourenhanced,subscrip,on‐onlydistribu,onofR–  Telephone/emailsupportfromrealRexperts–  Suitableforuseinregulated/validatedenvironments–  IncludesproprietaryParallelRpackagesforreliabledistributedcompu,ngwithR

•  onclustersorinthecloud–  Supportedon64‐bitWindows,Linux

Page 5: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Suppor,ngtheRCommunity

Weareanopensourcecompanysuppor,ngtheRcommunity:

•  BenefactorofRFounda$on•  FinancialsupporterofRconferencesandusergroups•  Newfunc$onalitydevelopedincoreRtocontributedunderGPL

•  64‐bitWindowssupport•  Step‐debuggingsupport

•  REvangelism

“Revolu$ons”Blog:blog.revolu,on‐compu,ng.comDailyNewsaboutR,sta1s1cs,andopen‐source

5

Page 6: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Intoday’slab:

•  Introduc,ontoParallelProcessing•  Mul,‐ThreadedProcessing

– Compu,ngontheGPU

•  Iterators•  Theforeachloop•  Usingmul,plecores:SMP

•  ClusterCompu,ng

•  Mul,‐Stratumparallelism

•  Q&A/Exercises6

Page 7: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

GemngStarted

•  R2.10.x,and2.9.xonWindows:install.packages("foreach",type="source")

install.packages("iterators",type="source")

•  R2.9.x,Mac/Linuxonly:install.packages("doMC")

require(doMC)

•  Windows/Mac:–  InstallREvolu,onREnterprise2.0(R2.7.2)require(doNWS)

7

Page 8: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

8

Introduc,ontoParallelProcessing

WithanasidetoHigh‐PerformanceCompu,ng

Page 9: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

HPCo`enmeansefficientlyexploi,ngspecializedhardware

ImagescopyrightCray,Xlinix,NVIDIAfromupper‐left,clockwise.

WhatisHigh‐PerformanceCompu,ng(HPC)?

Page 10: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Imagefromncbr.sdsc.edu

WhatisHigh‐PerformanceCompu,ng(HPC)?

•  Thesedays,HPCisfrequentlyassociatedwithCOTS*clustercompu,ngandwithSIMDvectoriza,onandpipelining(GPUs)

*Commodity,offtheshelf

•  New:cloudcompu,ng

Page 11: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

•  HPCiso`enconcernedwithmul,‐processing(parallel

processing),thecoördina,onof

mul,ple,simultaneously

running(sub)programs

–  Threads

–  Processes

–  Clusters

ImageCopyrightLawrenceLivermoreNationalLab

WhatisHigh‐PerformanceCompu,ng(HPC)?

Page 12: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

HPCo`eninvolveseffec,velymanaginghugedatasets

–  Parallelfilesystems(GPFS,PVFS2,Lustre,GFS2,S3…)

–  Paralleldataopera,ons(map‐reduce)

–  Workingwithhigh‐performancedatabases

–  bigmemorypackageinR

ImageCopyrightHP(amulti‐petabytestoragesystem)

WhatisHigh‐PerformanceCompu,ng(HPC)?

Page 13: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

ATaxonomyofParallelProcessing

•  Mul$‐node/cluster/cloudcompu$ng(heavyweightprocesses)–  Memorydistributedacrossnetwork–  Examples:foreach,SNOW,Rmpi,batchprocessing

•  Mul$‐core/mul$‐processorcompu$ng(heavyweightprocesses)–  SMP:SymmetricMul,‐Processing–  Independentmemoryinsharedspace–  Naturallyscalestomul,‐nodeprocessing–  Examples:mul,core(Windows/Unix),foreach

•  Mul$‐threadedprocessing(lightweightprocesses)–  Usuallysharedmemory–  Hardertoscaleoutacrossnetworks–  Examples:threadedlinear‐algebralibrariesforR(ATLAS,MKL);GPU

processors(CUDA/NVIDIA;ct/INTEL)

13

Page 14: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

14

Mul,‐ThreadedProcessing

Page 15: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Whatisthreadedprogramming?

•  Athreadisakindofprocessthatsharesaddressspacewithitsparentprocess

•  Created,destroyed,managedandsynchronizedinCcode– POSIXthreads– OpenMPthreads

•  Fast,butdifficulttoprogram– Easytooverwritevariables– Needtoworryaboutsynchroniza,on

15

Page 16: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Exploi,ngthreadswithR

•  RlinkstoBLAS(BasicLinearAlgebraSubprograms)librariesforefficientvector/matrixopera,ons

•  Linux:NeedtocompileandlinkwiththreadedBLAS(ATLAS)

•  Windows/Mac:REvolu,onRlinkedtoIntelMKLlibraries,usesasmanythreadsascores– Manyhigher‐levelopera,onsop,mizedaswell

•  MacOS:CRANbinaryusesveclibBLAS–  threaded,pre7ygoodperformance

16

Page 17: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

REvolution R SVD Performance

Exampledatamatrix150,000x500fast.svd

Quad‐coreIntelCore2CPU,WindowsVista64‐bitWorkstation

Revolu,onRPerformance

Mul,‐ThreadedProcessing

17

Page 18: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

18

GPUProgramming

Page 19: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

WhatisaGPU?

•  Dedicatedprocessingchip(orcard)dedicatedtofastfloa,ng‐pointopera,ons– Originallyfor3‐Dgraphicscalcula,ons

•  Highlyparallel:100’sofprocessorsonasinglechip,capableofrunning1000’softhreads

•  Usuallyincludesdedicatedhigh‐speedRAM,accessibleonlybyGPU– Needtotransferdatain/out

•  ProgrammeddirectlyusingcustomCdialect/compilers

•  >90%ofnewdesktops/laptopshaveanintegratedGPU

19

Page 20: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

GeForce8800GT

•  LaunchedOct29,2007•  512Mbof256‐bitmemory•  128processors•  512simultaneousthreads•  <$200

•  DownloadNVIDIACUDATools:– h7p://www.nvidia.com/object/cuda_home.html

•  Tutorial– h7p://www.ddj.com/architect/207200659

20

Page 21: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

PerformanceComparison

Method Time(seconds)

BaseRconvolvefunc,on 9.89

AMDACML 6.29

FFTW(8threads) 3.75

CUDAonGeForce8800GT 1.88(singleprecision)

21

  Convolve2vectorsoflength2^22  60Mbofdata

  Quaddual‐coreprocessor/GPU

Page 22: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

22

IntroducingIterators

Page 23: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

ThoughtExperiment:DrawingCards

•  Youaretheteacherof10grade‐schoolpupils.•  Classproject:draweachofthe52playingcardsasaposter.

•  Eachchildhassuppliesofposterpaperandcrayons,butrequiresareferencecardtocopy.

•  Howtoorganizethepupils?

23

Page 24: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Iterators

> require(iterators) •  Generalizedloopvariable•  Valueneednotbeatomic

– Rowofamatrix– Randomdataset– Chunkofadatafile– Recordfromadatabase

•  Createwith:iter •  Getvalueswith:nextElem •  Usedasindexingargumentwithforeach

Page 25: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Iteratorsarememoryfriendly

•  Allowdatatobesplitintomanageablepiecesonthefly

•  Helpsalleviateproblemswithprocessinglargedatastructures

•  Piecescanbeprocessedinparallel

Page 26: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Iteratorsactasadaptors

•  Allowsyourdatatobeprocessedbyforeachwithoutbeingconverted

•  Caniterateovermatricesanddataframesbyroworbycolumn:

it <- iter(Boston, by="row") nextElem(it)

Page 27: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

NumericIterator

> i <- iter(1:3) > nextElem(i) [1] 1

> nextElem(i) [1] 2 > nextElem(i) [1] 3 > nextElem(i) Error: StopIteration

Page 28: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Longsequences

> i <- icount(1e9) > nextElem(i) [1] 1 > nextElem(i) [1] 2 > nextElem(i) [1] 3 > nextElem(i) [1] 4 > nextElem(i) [1] 5

Page 29: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Matrixdimensions

> M <- matrix(1:25,ncol=5) > r <- iter(M,by="row") > nextElem(r) [,1] [,2] [,3] [,4] [,5] [1,] 1 6 11 16 21 > nextElem(r) [,1] [,2] [,3] [,4] [,5] [1,] 2 7 12 17 22 > nextElem(r) [,1] [,2] [,3] [,4] [,5] [1,] 3 8 13 18 23

Page 30: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

DataFile

> rec <- iread.table("MSFT.csv",sep=",", header=T, row.names=NULL) > nextElem(rec) MSFT.Open MSFT.High MSFT.Low MSFT.Close MSFT.Volume MSFT.Adjusted 1 29.91 30.25 29.4 29.86 76935100 28.73 > nextElem(rec) MSFT.Open MSFT.High MSFT.Low MSFT.Close MSFT.Volume MSFT.Adjusted 1 29.7 29.97 29.44 29.81 45774500 28.68 > nextElem(rec) MSFT.Open MSFT.High MSFT.Low MSFT.Close MSFT.Volume MSFT.Adjusted 1 29.63 29.75 29.45 29.64 44607200 28.52 > nextElem(rec) MSFT.Open MSFT.High MSFT.Low MSFT.Close MSFT.Volume MSFT.Adjusted 1 29.65 30.1 29.53 29.93 50220200 28.8

30

Page 31: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Database

> library(RSQLite) > m <- dbDriver('SQLite') > con <- dbConnect(m, dbname="arrests") > it <- iquery(con, 'select * from USArrests', n=10) > nextElem(it) Murder Assault UrbanPop Rape Alabama 13.2 236 58 21.2 Alaska 10.0 263 48 44.5 Arizona 8.1 294 80 31.0 Arkansas 8.8 190 50 19.5 California 9.0 276 91 40.6 Colorado 7.9 204 78 38.7 Connecticut 3.3 110 77 11.1 Delaware 5.9 238 72 15.8 Florida 15.4 335 80 31.9 Georgia 17.4 211 60 25.8

31

Page 32: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Infinite&Irregularsequences

iprime <- function() { lastPrime <- 1 nextEl <- function() { lastPrime <<- as.numeric(nextprime(lastPrime)) lastPrime } it <- list(nextElem=nextEl) class(it) <- c('abstractiter','iter') it}

> require(gmp) > p <- iprime() > nextElem(p) [1] 2 > nextElem(p) [1] 3

Page 33: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

33

Loopingwithforeach

Page 34: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Loopingwithforeach

foreach (var=iterator) %dopar% { statements }

  Evaluatestatementsuntiliteratorterminates  statementswillreferencevariablevar   Valuesof{ … }blockcollectedintoalist

  Runssequentially(bydefault)(orforcewith%do% )

34

Page 35: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

> foreach (j=1:4) %dopar% sqrt (j)

[[1]] [1] 1

[[2]] [1] 1.414214

[[3]] [1] 1.732051

[[4]] [1] 2

Warning message: executing %dopar% sequentially: no parallel backend registered

35

Page 36: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

CombiningResults> foreach(j=1:4, .combine=c) %dopar% sqrt(j) [1] 1.000000 1.414214 1.732051 2.000000

> foreach(j=1:4, .combine='+’, .inorder=FALSE) %dopar% sqrt(j)

[1] 6.146264

  Whenorderofevaluationisunimportant,use.inorder=FALSE

36

Page 37: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Referencingglobalvariables> z <- 2 > f <- function (x) sqrt (x + z)

> foreach (j=1:4, .combine='+') %dopar% f(j)

[1] 8.417609

  foreachautomaticallyinspectscodeandensuresunboundobjectsarepropagatedtotheevaluationenvironment

37

Page 38: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Nestedforeachexecu,on

•  foreach opera,onscanbenestedusing%:%operator

•  Allowsparallelexecu,onacrossmul,pleloopslevels,“unrolling”theinnerloops

foreach(i=1:3, .combine=cbind) %:% foreach(j=1:3, .combine=c) %dopar% (i + j)

Page 39: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

39

Speedingupcodewithforeach

SMPProcessing

Page 40: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Quickreviewofparallelanddistributedcompu,nginR

•  NetWorkSpaces(packagenws;SMP,distributed)–  GPL,alsocommerciallysupportedbyREvolu,onCompu,ng–  Verycross‐pla�orm,distributedshared‐memoryparadigm–  Fault‐tolerant

•  Mul,Core(packagemulticore;SMPonly)–  Linux/MacOS(requiresPOSIX)–  UsesforktocreatenewRprocesses

•  Rmpi(packageRmpi;SMP,distributed)–  Fine‐grainedcontrolallowsveryhigh‐performancecalcula,ons–  Canbetrickytoconfigure–  LimitedWindowsandheterogeneousclustersupport

•  SNOW(packagesnow;SMP,distributed*)–  LimitedWindowssupport(*singlemachineonly)–  Meta‐package:supportsMPI,sockets,NWS,PVM

40

Page 41: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Parallelbackendsforforeach

•  %dopar%behaviourdependsoncurrent“registered”parallelbackend

•  Modularparallelbackends

•  registerDoSEQ(default)•  registerDoNWS(NetWorkSpaces)•  registerDoMC(mul,core,MacOS/Windows)

•  FromTerminal/ESSonly!(R.appGUIwillcrash.)•  registerDoSNOW•  registerDoRMPI

41

Page 42: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

GemngStarted:Mul,‐coreProcessing

•  R2.10.x–waitun,lofficialrelease•  R2.9.x

require(doMC)

registerDoMC(cores=2)

•  REvolu,onREnterpriserequire(doNWS)

s <- sleigh(workerCount=2) registerDoNWS()

42

Page 43: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Asimplesimulation:

birthday <- function(n) { ntests <- 1000 pop <- 1:365 anydup <- function(i) any(duplicated( sample(pop, n, replace=TRUE)))

sum(sapply(seq(ntests), anydup)) / ntests }

x <- foreach (j=1:100) %dopar% birthday (j)

43

Page 44: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

BirthdayExample‐,mings

Backend Time(s)registerDoSEQ() 41sregisterDoMC()#2cores 28sregisterDoNWS()#2workers 26s(*)

44

Dual‐core2.4GHzIntelMacBook:

system.time{ x <- foreach (j=1:100) %dopar% birthday (j) } # Elapsed

Page 45: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

BirthdaySimula,on:Mul,core/NWS

> x <- foreach (j=1:100) %dopar% birthday (j) > plot(1:100, unlist(x),type="l")

Page 46: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

46

UsingclusterswithNetworkSpaces

Page 47: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

SemngUpaCluster

1.  Iden,fymachinestoformnodesoncluster–  EasiestwithLinux/MacOS

–  PossiblewithWindows

2.  Selectaservermachine–  OKforthisonetobeonWindows

3.  Makesurepasswordlesssshenabledoneachworkernode

–  ssh nodename Revo --versionshouldwork

47

Page 48: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Semngupacluster,part2

4.  Logintoserver,startREvolu,onR5.  Createasleigh

require(doNWS) s <- sleigh(nodeList=c( rep("localhost",2), rep("thor",8), rep("loki",4)), launch=sshcmd) registerDoNWS(s)

6.  Useforeachasbefore7.  (op,onal)usejoinSleightoaddnewnodes

48

Page 49: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

ParallelRandomForest

# a simple parallel random forest

library(randomForest) x <- matrix(runif(500), 100)

y <- gl(2, 50) wc <- 2

n <- ceiling(1000 / wc)

registerDoNWS(s) foreach(ntree=rep(n, wc), .combine=combine,

.packages='randomForest') %dopar% randomForest(x, y, ntree=ntree)

•  Easier: randomShrubberyNWS()

49

Page 50: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Conver,ngexis,ngcode

•  Converttheseloopstoforeach:– for:makebodyreturnitera,onvalueand.combine

– apply:useiter(X, by="row”)and.combine •  Oriapply(X,1)

– lapply:useiter(mylist)

50

Page 51: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

ppiStatsExample

•  Sequen,al:bpMats1 <- lapply(bpList, function(x) { bpMatrix(x, symMat = TRUE, homodimer = FALSE, baitAsPrey = FALSE, unWeighted = FALSE, onlyRecip = FALSE, baitsOnly = FALSE) })

•  Parallel:bpMats1 <- foreach(x=iter(bpList), .packages = "ppiStats") %dopar% { bpMatrix(x, sysMat = TRUE, homodimer = FALSE, baitAsPrey = FALSE, unWeighted = FALSE, onlyRecip = FALSE, baitsOnly = FALSE) }

51

Page 52: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

ppiStatsExample

•  Sequen,al:bpGraphs <- lapply(bpMats1, function(x) {

genBPGraph(x, directed = TRUE, bp = FALSE)

})

•  Parallel:bpGraph <- foreach(x=iter(bpMat1),

.packages = "ppiStats") %dopar% {

genBPGraph(x, directed = TRUE, bp = FALSE) }

52

Page 53: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Excercise

•  Findother“embarassinglyparallel”BioConductorexamples,andconverttoparallelwithforeach.

53

Page 54: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

54

Mul,‐StratumParallelism

Page 55: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

foreach(iterator)%dopar%{tasks}

foreach…

task task

foreach…

task task

CLUSTER

SMP

Anexampleofexplicitmulti‐stratum||ism

55

Page 56: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

require ("doNWS") require ("foreach") require ("doMC")

s <- sleigh(nodelist=c(rep("localhost",2), rep("bladeserver",8)) registerDoNWS(s)

foreach (iterator_i, .packages=c("foreach", "doMC"))%dopar% { registerDoMC() foreach (iteratorj_) %dopar% { tasks… } }

Mul,‐stratumtemplate:NWS/Mul,core

56

Page 57: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Pi�allstoavoid

•  Sequen,alvsParallelProgramming•  RandomNumberGenera,on

–  library(sprngNWS) –  sleigh(workerCount=8, rngType=‘sprngLFG’)

•  Nodefailure•  CosmicRays

57

Page 58: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

Conclusions

•  Parallelcompu,ngiseasy!•  Writeloopswithforeach/%dopar%

– Worksfineinasingle‐processorenvironment

– Third‐partyuserscanregisterbackendsformul,processororclusterprocessing

– Speedbenefitswithoutmodifyingcode

•  Easyperformancegainsonmodernlaptops/desktops

•  Expandtoclustersformeatyjobs

58

Page 59: Parallel Programing in R - Revolutions ParallelR.pdf · algorithm design, parallel programming environments and architectures) • David Smith, Director of Community & R blogger (co‐author

ThankYou!

•  DavidSmith–  david@revolu,on‐compu,ng.com,@revodavid

•  REvolu,onCompu,ng– www.revolu,on‐compu,ng.com

•  Revolu+ons,theRblog– blog.revolu,on‐compu,ng.com

•  Downloads:– Slides:h7p://,nyurl.com/R‐Bioc‐slides– Script:h7p://,nyurl.com/R‐Bioc‐script