Computational Techniques for the Statistical Analysis of Big Data in R
-
Upload
herbps10 -
Category
Technology
-
view
480 -
download
3
description
Transcript of Computational Techniques for the Statistical Analysis of Big Data in R
![Page 1: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/1.jpg)
Computational Techniques for the StatisticalAnalysis of Big Data in R
A Case Study of the rlme Package
Herb Susmann, Yusuf Bilgic
April 12, 2014
![Page 2: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/2.jpg)
WorkflowIdentifyRewriteBenchmarkTest
Case Study: rlmeIdentifyWilcoxon Tau EstimatorPairupCovariance Estimator
Summary
Keeping Ahead
![Page 3: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/3.jpg)
Motivation
I Case study: rlme package
I Rank based regression and estimation of two- and three- levelnested effects models.
I Goals: faster, less memory, more data
I Before: 5,000 rows of data
I After: 50,000 rows of data
![Page 4: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/4.jpg)
Section 1
Workflow
![Page 5: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/5.jpg)
Workflow
I Identify
I Rewrite
I Benchmark
I Test
![Page 6: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/6.jpg)
Identify
I Know your big O!
(O(n2) memory usage? probably not sogood for big data)
I Look for error messages
I Profiling with RProf
![Page 7: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/7.jpg)
Identify
I Know your big O! (O(n2) memory usage? probably not sogood for big data)
I Look for error messages
I Profiling with RProf
![Page 8: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/8.jpg)
Identify
I Know your big O! (O(n2) memory usage? probably not sogood for big data)
I Look for error messages
I Profiling with RProf
![Page 9: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/9.jpg)
Identify
I Know your big O! (O(n2) memory usage? probably not sogood for big data)
I Look for error messages
I Profiling with RProf
![Page 10: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/10.jpg)
Rewrite
High level design
I Algorithm design
I Statistical techniques: bootstrapping
![Page 11: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/11.jpg)
Rewrite
High level design
I Algorithm design
I Statistical techniques: bootstrapping
![Page 12: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/12.jpg)
Rewrite
Microbenchmarking
I Know what R is good at
I Avoid loops in favor of vectorization
I Preallocation
I Arguments are by value, not by reference
I Embrace C++
Be careful!
![Page 13: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/13.jpg)
Rewrite
Microbenchmarking
I Know what R is good at
I Avoid loops in favor of vectorization
I Preallocation
I Arguments are by value, not by reference
I Embrace C++
Be careful!
![Page 14: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/14.jpg)
Rewrite
Microbenchmarking
I Know what R is good at
I Avoid loops in favor of vectorization
I Preallocation
I Arguments are by value, not by reference
I Embrace C++
Be careful!
![Page 15: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/15.jpg)
Rewrite
Microbenchmarking
I Know what R is good at
I Avoid loops in favor of vectorization
I Preallocation
I Arguments are by value, not by reference
I Embrace C++
Be careful!
![Page 16: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/16.jpg)
Rewrite
Microbenchmarking
I Know what R is good at
I Avoid loops in favor of vectorization
I Preallocation
I Arguments are by value, not by reference
I Embrace C++
Be careful!
![Page 17: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/17.jpg)
Rewrite
Microbenchmarking
I Know what R is good at
I Avoid loops in favor of vectorization
I Preallocation
I Arguments are by value, not by reference
I Embrace C++
Be careful!
![Page 18: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/18.jpg)
Vectorizing
## Bad
vec = 1:100
for (i in 1:length(vec)) {vec[i] = vec[i]^2
}
## Better
sapply(vec, function(x) x^2)
## Best
vec^2
![Page 19: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/19.jpg)
Preallocation
## Bad
vec = c()
for (i in 1:0) {vec = c(vec, i)
}
## Better
vec = numeric(100)
for (i in 1:0) {vec[i] = i
}
![Page 20: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/20.jpg)
Pass by value
square <- function(x) {x <- x^2
return(x)
}
x <- 1:100
square(x)
![Page 21: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/21.jpg)
Benchmark
I Write several versions of a slow function
I Test them against each other
I Package: microbenchmark
![Page 22: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/22.jpg)
Benchmark
I Write several versions of a slow function
I Test them against each other
I Package: microbenchmark
![Page 23: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/23.jpg)
Benchmark
I Write several versions of a slow function
I Test them against each other
I Package: microbenchmark
![Page 24: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/24.jpg)
Test
I Regressions
I Unit Testing
I Package: testthat
![Page 25: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/25.jpg)
Test
I Regressions
I Unit Testing
I Package: testthat
![Page 26: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/26.jpg)
Test
I Regressions
I Unit Testing
I Package: testthat
![Page 27: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/27.jpg)
Test
I Regressions
I Unit Testing
I Package: testthat
![Page 28: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/28.jpg)
Section 2
Case Study: rlme
![Page 29: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/29.jpg)
Identify
Over to R!
Rprof("profile")
fit.rlme = rlme(...)
Rprof(NULL)
summaryRprof("profile")
![Page 30: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/30.jpg)
Wilcoxon Tau Estimator
I Rank based scale estimator of residuals
I Uses pairup (so already O(n2))
![Page 31: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/31.jpg)
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong? Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple timesUpdated with C++
dresd = remove.k.smallest(dresd)
![Page 32: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/32.jpg)
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong?
Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple timesUpdated with C++
dresd = remove.k.smallest(dresd)
![Page 33: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/33.jpg)
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong? Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple times
Updated with C++
dresd = remove.k.smallest(dresd)
![Page 34: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/34.jpg)
Wilcoxon Tau Estimator
Original:
dresd <- sort(abs(temp[, 1] - temp[, 2]))
dresd = dresd[(p + 1):choose(n, 2)]
What’s wrong? Bad algorithm (the sort is at least O(nlogn)),variable gets copied multiple timesUpdated with C++
dresd = remove.k.smallest(dresd)
![Page 35: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/35.jpg)
Wilcoxon Tau Estimator
Test with 2,000 residuals: better!
![Page 36: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/36.jpg)
Wilcoxon Tau
I But what about really huge inputs?
I Bootstrapping: when over 5,000 rows, repeat estimate on1000 sampled points 100 times
I Not about speed, but about memory
![Page 37: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/37.jpg)
Wilcoxon Tau
I But what about really huge inputs?
I Bootstrapping: when over 5,000 rows, repeat estimate on1000 sampled points 100 times
I Not about speed, but about memory
![Page 38: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/38.jpg)
Wilcoxon Tau
I But what about really huge inputs?
I Bootstrapping: when over 5,000 rows, repeat estimate on1000 sampled points 100 times
I Not about speed, but about memory
![Page 39: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/39.jpg)
Pairup
I Pairup function: generates every possible pair from inputvector
I Some rank-based estimators require pairwise operations
I O(n2) complexity
![Page 40: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/40.jpg)
Pairup
I Original version: vectorized (14 LOC)
I Loop version (12 LOC)
I ”Combn” version (core R function, 1 LOC)
I C++ version (12 LOC)
![Page 41: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/41.jpg)
Pairup
I Original version: vectorized (14 LOC)
I Loop version (12 LOC)
I ”Combn” version (core R function, 1 LOC)
I C++ version (12 LOC)
![Page 42: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/42.jpg)
Pairup
I Original version: vectorized (14 LOC)
I Loop version (12 LOC)
I ”Combn” version (core R function, 1 LOC)
I C++ version (12 LOC)
![Page 43: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/43.jpg)
Pairup
I Original version: vectorized (14 LOC)
I Loop version (12 LOC)
I ”Combn” version (core R function, 1 LOC)
I C++ version (12 LOC)
![Page 44: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/44.jpg)
Over to R!
![Page 45: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/45.jpg)
Covariance Estimator
I n × n covariance matrix
I change to preallocation
![Page 46: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/46.jpg)
Covariance Estimator
![Page 47: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/47.jpg)
Summary
I Identify
I Rewrite
I Benchmark
I Test
![Page 48: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/48.jpg)
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
![Page 49: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/49.jpg)
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
![Page 50: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/50.jpg)
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
![Page 51: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/51.jpg)
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
![Page 52: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/52.jpg)
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
![Page 53: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/53.jpg)
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
![Page 54: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/54.jpg)
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
![Page 55: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/55.jpg)
Keeping Ahead
I Parallelism
I Cluster: RMpi, snow
I GPU: rpud
I Probably not Hadoop, maybe Apache Spark?
I Julia Language
I Hadley Wickham (plyr, ggplot, testthat, ...)
I “Advanced R Programming”
![Page 56: Computational Techniques for the Statistical Analysis of Big Data in R](https://reader033.fdocuments.us/reader033/viewer/2022051208/54798c12b479599f098b47b5/html5/thumbnails/56.jpg)
Questions?