130926(Ex 25)-Mr G. H. Schorel-Hlavka O.W.B. to Mr Tony Abbott Re Debts - Etc
CMYK 0/100/100/20 66/54/42/17 34/21/10/0 Why is R slow ...hlavka/vyuka/past/... · Currently...
Transcript of CMYK 0/100/100/20 66/54/42/17 34/21/10/0 Why is R slow ...hlavka/vyuka/past/... · Currently...
Why is R slow?How to run R programs faster?
Tomas Kalibera
CMYK 0/100/100/20 66/54/42/17 34/21/10/0
Tomas Kalibera
My Background
Virtual machines, runtimes for programming languages
Real-time JavaAutomatic memory management
Evaluating software performance
R User
BenchmarksUsing statistical methods
Currently working on: FastR
A new, experimental virtual machine for (a subset of) R language. Discovering optimizations that can speed-up R.
CMYK 0/100/100/20 66/54/42/17 34/21/10/0
Core team
Jan VitekTomas KaliberaPetr Maj Floreal Morandat
Wider team
Community: Dynamic Languages for Scalable Data Analytics
Use one dynamic, high level language for data analytics tasks running on platforms from a tablet to the cloud.
R, Matlab, Python, Julia
Large software companies interested in R
NSF Funded Workshop at SPLASH 2013Software Infrastructure for Sustained Innovation
Virtual Machines, R & FastR
int main(int argc, char **argv) {
if (argc != 2) { fprintf(stderr, "tm n\n"); return 1; } int n = atoi(argv[1]); printf("n = %d\n", n);
Source code
main
if
decl
call
!=
argc
2
call
ret
Parse tree
parsing
main
if
decl
call
!=
argc
2
call
ret
Parse tree executed directly by
(AST) Interpreter
Class If Node Condition, TrueBranch, FalseBranch;
Result execute() { If (Condition.execute() == TRUE) { TrueBranch.execute() } else { FalseBranch.execute() } Return NULL; }
GNU R works like this.
Interpreter
Easy to develop, maintain.
compilationlinking
Compiler
Ahead of time: C/C++/FortranJust-in-time: Java/C#
0000000000400580 <main>: 400580: 41 54 push %r12 400582: 83 ff 02 cmp $0x2,%edi 400585: 55 push %rbp 400586: 53 push %rbx 400587: 74 25 je 4005ae <main+0x2e> 400589: 48 8b 0d c8 0a 20 00 mov 0x200ac8(%rip),%rcx 400590: ba 05 00 00 00 mov $0x5,%edx 400595: be 01 00 00 00 mov $0x1,%esi 40059a: bf 04 08 40 00 mov $0x400804,%edi 40059f: e8 cc ff ff ff callq 400570 <fwrite@plt> 4005a4: b8 01 00 00 00 mov $0x1,%eax 4005a9: 5b pop %rbx 4005aa: 5d pop %rbp 4005ab: 41 5c pop %r12 4005ad: c3 retq
Machine codemain
if
decl
call
!=
argc
2
call
ret
Parse tree
Fast.
FastR● Self-optimizing AST interpreter
– Aims to be still easy to develop, maintain
– But fast
● The AST (tree) rewrites as the program executes– Speculative rewrites, recovery
● Runs on a JVM– High-performance garbage collector
– Just-in-Time compilation improves speed
CMYK 0/100/100/20 66/54/42/17 34/21/10/0
Understanding why GNU-R is slowSpeeding-up R programs
Toeplitz MatrixIn AT&T R Benchmarks 2.5 (Simon Urbanek)
Initializing a square matrix
ai , j=∣i− j∣+1
1 2 3 4 5
2 1 2 3 4
3 2 1 2 3
4 3 2 1 2
5 4 3 2 1
TM using For Loop(as included in AT&T R Benchmarks 2.5)
tmFor <- function(n) { b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b }
ai , j=∣i− j∣+1
TM using For Loop(as included in AT&T R Benchmarks 2.5 )
tmFor <- function(n) { b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b }
ai , j=∣i− j∣+1
N = 500 650 ms
N = 1000 2610 ms
N = 1500 5910 ms
This is very slow!
TM in C
int *b = (int *)malloc(n * n * sizeof(int));
for(j = 1; j <= n; j++) { for(k = 1; k <= n; k++) { b[(k - 1) + (j - 1) * n] = abs(j - k) + 1; } }
N = 500 650 ms
N = 1000 2610 ms
N = 1500 5910 ms
In RN = 500 0.2 ms
N = 1000 0.9 ms
N = 1500 2.1 ms
In C
R slowdowns is hundreds of fold.
Toeplitz MatrixUnderstanding why GNU-R is slow
TM: Checking with a profiler
> Rprof()> dummy <- tmFor(5000)> Rprof(NULL)> summaryRProf()
$by.self self.time self.pct total.time total.pct"tmFor" 51.42 86.36 59.54 100.00"abs" 2.80 4.70 2.80 4.70"-" 2.76 4.64 2.76 4.64"+" 2.42 4.06 2.42 4.06"matrix" 0.12 0.20 0.12 0.20":" 0.02 0.03 0.02 0.03
$by.total total.time total.pct self.time self.pct"tmFor" 59.54 100.00 51.42 86.36"abs" 2.80 4.70 2.80 4.70"-" 2.76 4.64 2.76 4.64"+" 2.42 4.06 2.42 4.06"matrix" 0.12 0.20 0.12 0.20":" 0.02 0.03 0.02 0.03
TM: R profiler does not help
tmFor <- function(n) { b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b }
Performancecriticalpart.
TM: Checking with a system profilerenv CFLAGS=-g ./configure --with-blas --with-lapack
--enable-R-static-lib –disable-BLAS-shlibmake
source("tm.r")dummy <- tmFor(5000)
perf record -g -- ~/work/R/R-3.0.2/R-3.0.2-dbg/bin/R --slave < runtm.rperf report -g
+ 1.08% R R [.] real_binary+ 0.75% R R [.] integer_binary+ 0.74% R R [.] do_abs
+ 9.91% R R [.] Rf_eval + 9.53% R R [.] Rf_cons - 6.67% R R [.] Rf_findVarInFrame3 - Rf_findVarInFrame3 + 29.17% Rf_findVar + 7.84% EnsureLocal + 2.21% Rf_eval
TM: Checking with a system profiler
+ 9.91% R R [.] Rf_eval + 9.53% R R [.] Rf_cons - 6.67% R R [.] Rf_findVarInFrame3 - Rf_findVarInFrame3 + 29.17% Rf_findVar + 7.84% EnsureLocal + 2.21% Rf_eval Variable look-up
R built-in functions can be changed
for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } }
abs is a built-in function
abs can be changed at any time
> abs <- function(x) { x * x }> abs(-10)[1] 100
> for(i in 11:13) { if (i==12) { abs <- sqrt } ; print(abs(i)) }[1] 11[1] 3.464102[1] 3.605551
Variable look-up
R built-in functions can be changed
tmFor <- function(n) { b <- matrix(nrow = n, ncol = n) for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } } b }
tmFor
nn
nb
nn
nj
nk
GlobalEnv
nnntmFor
nnnabs
BaseNamespaceEnv
.Primitive("abs")
Variable look-up
R built-in functions can be changed
for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } }
abs is a built-in function+ - ( [ { ← for :
are all built-in functions> `:` <- sum> 1:10[1] 11
> `<-` <- function(x,val) { eval.parent( assign(deparse(substitute(x)), 100)) }> z <- 10[1] 100
Variable look-up
Variables can be deleted
for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1 } }
> x <- 10> rm(x)> xError: object 'x' not found
> for(i in 1:3) { if (i==2) { rm(i) } else print(i) }[1] 1[1] 3
> for(i in 1:3) { if (i==2) { rm(i) } ; print(i) }[1] 1Error in print(i) : object 'i' not found
variable look-up is needed
Loop control variable can be deleted
Variable look-up
TM: Checking with a system profiler
Linked-list allocation anduse
+ 9.91% R R [.] Rf_eval- 9.53% R R [.] Rf_cons - Rf_cons + 29.87% Rf_allocList + 24.96% Rf_evalList + 14.35% Rf_evalListKeepMissing + 6.04% Rf_lcons + 5.90% Rf_DispatchOrEval + 5.29% Rf_list2 + 3.85% evalseq + 3.26% Rf_defineVar + 3.04% Rf_list1 + 1.18% Rf_eval + 0.75% replaceCall + 0.52% evalArgs+ 6.67% R R [.] Rf_findVarInFrame3
Arguments passed as linked-listLinked-list allocation and use
for (j in 1:n) { for (k in 1:n) { b[k,j] <- abs(j - k) + 1
Converted to a general replacement call of formF(X) ← Y
The replacement call is then transformed
F(X) ← Y TMP ← XX ← “F<-”( TMP, value = Y )
b[k,j] ← Y TMP ← bb ← “[<-”( TMP, k, j, value = Y )
Replacement call is expensive
Linked-list allocation and use
b[k,j] ← Y
TMP ← bb ← “[<-”( TMP, k, j, value = Y )
nn
nTMP
n[<-
nk
nj
nY
nn
nb
n<-
This linked list allocated in eachiteration
Toeplitz MatrixSpeeding-up R programs
R Byte-code compiler
env R_ENABLE_JIT=3 R
AST Bytecode
N = 500 650 ms 130 ms
N = 1000 2610 ms 530 ms
N = 1500 5910 ms 1150 ms
Always use byte-code compiler!
> require(compiler)Loading required package: compiler> help(cmpfun)
TM: Sapply
tmSapply <- function(n) { sapply(1:n, function(j) { sapply(1:n, function(k) { abs(j - k) + 1 }) })}
TM: Sapply
tmSapply <- function(n) { sapply(1:n, function(j) { sapply(1:n, function(k) { abs(j - k) + 1 }) })}
For Sapply
N = 500 130 ms 320 ms
N = 1000 530 ms 1300 ms
N = 1500 1150 ms 2960 ms
Using sapply instead of for sometimeshelps. Not now...
TM: Rows Algo
tmRows <- function(n) { b <- matrix(nrow = n, ncol = n) b[1,] <- 1:n if (n >= 2) { for(r in 2:n) { b[r,] <- c(r, b[r-1,-n]) } } b}
1 2 3 4 5
2 1 2 3 4
3 2 1 2 3
4 3 2 1 2
5 4 3 2 1
TM: Rows Algo
tmRows <- function(n) { b <- matrix(nrow = n, ncol = n) b[1,] <- 1:n if (n >= n) { for(r in 2:n) { b[r,] <- c(r, b[r-1,-n]) } } b}
For Rows
N = 500 130 ms 13 ms
N = 1000 530 ms 59 ms
N = 1500 1150 ms 169 ms
Much faster. Reduced calls, lookups.
TM: Cols Algo
tmCols <- function(n) { b <- matrix(nrow = n, ncol = n) b[,1] <- 1:n if (n >= 2) { for(col in 2:n) { b[,col] <- c(col, b[-n, col-1]) } } b}
1 2 3 4 5
2 1 2 3 4
3 2 1 2 3
4 3 2 1 2
5 4 3 2 1
TM: Cols2 Algo
tmByCols <- function(n) { if (n >= 2) { sapply(1:n, function(col) { if (col < n) { c( col:1, 2:(n-col+1) ) } else { n:1 } }) } else { 1 } }
1 2 3 4 5
2 1 2 3 4
3 2 1 2 3
4 3 2 1 2
5 4 3 2 1
TM: Cols2 Algo
tmByCols <- function(n) { if (n >= 2) { sapply(1:n, function(col) { if (col < n) { c( col:1, 2:(n-col+1) ) } else { n:1 } }) } else { 1 } }
Rows Cols2
N = 500 13 ms 5 ms
N = 1000 59 ms 39 ms
N = 1500 169 ms 58 ms
Much faster. Reduced calls, lookups.
TM: Outer Algo
tmOuter <- function(n) {
outer(X = 1:n, Y = 1:n, FUN = function(j,k) { abs(j - k) + 1
})
}1 2 3 4 5
2 1 2 3 4
3 2 1 2 3
4 3 2 1 2
5 4 3 2 1
TM: Outer Algo
tmOuter <- function(n) {
outer(X = 1:n, Y = 1:n, FUN = function(j,k) { abs(j - k) + 1
})
}Cols2 Outer C
N = 500 5 ms 2 ms 0.2 ms
N = 1000 39 ms 27 ms 0.9 ms
N = 1500 58 ms 47 ms 2.1 ms
Yet faster. Vectorized.Also easy to read.
TM: Summary
For Outer C For-FastR
N = 500 130 ms 2 ms 0.2 ms 13 ms
N = 1000 530 ms 27 ms 0.9 ms 47 ms
N = 1500 1150 ms 47 ms 2.1 ms 101 ms
Summary
● Use byte-code compiler● Vectorize● Use built-ins (sum, prod, cumsum, outer)● Use simplest data structure possible
– Matrix instead of data.frame
– Avoid data.frame indexing
● Save and re-use intermediate results
Please consider donating your code/data in form ofbenchmarks.