Karl Broman · script to combine results combined results ReadMe describing the point 16. One...

Big jobs/simulations

Karl Broman

Biostatistics & Medical Informatics, UW–Madison

kbroman.orggithub.com/kbroman

@kwbromanCourse web: kbroman.org/AdvData

But first…

Suppose I’ve just written an R function and it seems to work, and suppose Inoticed a simple way to speed it up.

What should I do first?

But first…

▶ Commit it to a git repository

But first…

▶ Commit it to a git repository▶ Make it an R package

But first…

▶ Commit it to a git repository▶ Make it an R package▶ Write a test or two

So what’s the big deal?

▶ You don’t want knitr running for a year.▶ You don’t want to re-run things if you don’t have to.

Unix basicsnice +19 R CMD BATCH input.R output.txt &

fgctrl-zbg

ps uxtop

killkill -9pkill

Disk thrashing

In computer science, thrashing occurs when a computer’s virtual memorysubsystem is in a constant state of paging

, rapidly exchanging data inmemory for data on disk, to the exclusion of most application-levelprocessing.

– Wikipedia

Disk thrashing

In computer science, thrashing occurs when a computer’s virtual memorysubsystem is in a constant state of paging, rapidly exchanging data inmemory for data on disk, to the exclusion of most application-levelprocessing.

– Wikipedia

Biggish jobs in knitr

▶ Manual caching▶ Built-in cache=TRUE

▶ Split the work and write a Makefile

Manual caching

```{r a_code_chunk}file <- "cache/myfile.RData"

if(file.exists(file)) {load(file)

} else{

save(object1 , object2 , object3 , file=file)}```

Manual caching

```{r a_code_chunk}file <- "cache/myfile.rds"

if(file.exists(file)) {object <- readRDS(file)

} else{

saveRDS(object, file)}```

Chunk references

```{r not_shown , eval=FALSE}code_here <- 0```

```{r a_code_chunk , echo=FALSE}file <- "cache/myfile.RData"

if(file.exists(file)) {load(file)

} else{<<not_shown >>

save(code_here , file=file)}```

A cache gone bad

Knitr’s cache system

```{r chunk_name , cache=TRUE}load("a_big_file.RData")med <- apply(object, 2, median, na.rm=TRUE)```

▶ Chunk is re-run if edited.▶ Otherwise, objects from previous run are loaded.▶ Don’t cache things with side effects

e.g., options(), par()

Cache dependencies

Manual dependencies

```{r chunkA, cache=TRUE}Sys.sleep(2)x <- 5```

```{r chunkB, cache=TRUE, dependson="chunkA"}Sys.sleep(2)y <- x + 1```

```{r chunkC, cache=TRUE, dependson="chunkB"}Sys.sleep(2)z <- y + 1```

Cache dependencies

Automatic dependencies

```{r setup, include=FALSE}opts_chunk$set(autodep = TRUE)dep_auto()```

Parallel computing

If your computer has multiple processors, use library(parallel) to makeuse of them.

▶ detectCores()

▶ RNGkind("L'Ecuyer-CMRG") and mclapply (Unix/Mac)▶ makeCluster, clustersetRNGStream, clusterApply, and stopCluster

(Windows)

Systems for distributed computing

▶ HTCondor and the UW-Madison CHTC▶ Other condor-like systems▶ “By hand”

– e.g., perl script + template R script

Simulations

▶ Computer simulations require RNG seeds (.Random.seed in R).▶ Multiple parallel jobs need different seeds.▶ Don’t rely on the current seed, or on having it generated from the clock.▶ Use something like set.seed(91820205 + i)

▶ An alternative is create a big batch of simulated data sets in advance.

Save everything

▶ RNG seeds▶ input▶ output▶ version numbers, with sessionInfo()

▶ raw results▶ script to combine results▶ combined results▶ ReadMe describing the point

One Makefile to rule them all

▶ Separate directory for each batch of big computations.▶ Makefile that controls the combination of the results (and everything

else).▶ KnitR-based documents for the analysis/use of those results.

Potential problems

▶ Forgetting save() in your distributed jobs▶ A bug in the save() command▶ make clobbers some important results

Scripts should refuse to overwrite output files

Summary

▶ Careful organization and modularization.▶ Save everything.▶ Document everything.▶ Learn the basic skills for distributed computing.

Karl Broman · script to combine results combined results ReadMe describing the point 16. One...

Documents

Transcript of Karl Broman · script to combine results combined results ReadMe describing the point 16. One...

Curriculum Vitae 2/2013 Marja-Liisa Manka (née Broman)€¦ · Curriculum Vitae 2/2013 Marja-Liisa Manka (née Broman) EDUCATIONAL BACKGROUND Doctor of Philosophy 1999 Major: Education,

Eclipse, SVN, Makefile examples - Pacific Uzeus.cs.pacificu.edu/chadd/cs460s10/Lectures/EclipseEtc.pdfPacific University Eclipse, SVN, Makefile examples January 20, 2010 Eclipse version

Linux Makefile

Programming Environment · Programming Environment Cornell Center for Advanced Computing ... $ cat Makefile Read over the Makefile $ make Compile the program, generate a.out

Html Karl W Broman Department of Biostatistics Johns Hopkins University

Makefile Generation From Autotools

Patrick broman mo sync

Makefile Martial Arts - Chapter 1. The morning of creation

Introduction aux Makefileeric.bachard.free.fr/.../C/make/intro_makefile.pdf · 2007. 4. 17. · Par défaut ce programme prend en compte le fichier makefile ou Makefile du répertoire

make and makefile - cs.uic.edu · makefile Format: separated by blank line targetFile1: dependencyFile1a dependecyFile1b command1a command1b---- this must be a blank line ----targetFile2:

Code Smart - Makefile (special release)

Lars Broman, Strömstad Academy, lars.broman@stromstadakademi.se.

Emacs, Compilation, and Makefile C151 Multi-User Operating Systems.

Introduction to Makefile

1 The Makefile Utility ABC – Chapter 11, 483-489.

Linux/UNIX Programming Compile & Makefile 최미정 강원대학교 컴퓨터과학전공

Makefiles and ROOTbrisbane/Teaching/Make... · First Makefile (1) • By default, the ‘make’ tool looks in the local directory for files named Makefile • The core component

Precision Timed Infrastructure - University of California, …...Precision Timed Infrastructure David Broman broman@eecs.berkeley.edu UC Berkeley and Linköping University IHI Meeting

Extensible Modeling Languages and Precision …...2012/05/28 · Extensible Modeling Languages and Precision Timed Infrastructures for Cyber-Physical Systems David Broman broman@eecs.berkeley.edu

Proceedings - Broman