sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.2
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.5.2 magrittr_1.5 tools_3.5.2 htmltools_0.3.6 yaml_2.2.0 Rcpp_1.0.0 stringi_1.2.4
## [8] rmarkdown_1.11 knitr_1.21 stringr_1.3.1 xfun_0.4 digest_0.6.18 evaluate_0.12
Scientific planning: What experiments would verify/invalidate our hypotheses? What parameter settings should we consider?
Code planning: What does the code need to do? How will the code fit together? What functions will be used? What are their inputs/outputs etc.
Implementation:
Prototype functions, classes, etc., partial documentation
Write unit tests
Implement code, run unit tests, debug
Broader testing, more debugging
Profile code, identify bottlenecks
Optimize code
Conduct experiments.
Full documentation.
After profiling, what to do to improve performance?
Ask are there obvious speedups? Are things being computed unnecessarily? Are you using a data.frame
where you should be using a matrix etc.
Look up your problem (e.g., search for “lapply slow” or “speeding up lapply
” etc.)
Try the just-in-time (JIT) compiler.
Consider re-writing some or all of the code in a compiled language (e.g., C/C++).
Try parallelization.
R typical execution:
compiler
package by Luke Tierney is distributed with base R. compiler
package compiles an R function into bytecode.Brute-force for
loop for summing a vector:
sum_r <- function(x) {
sumx <- 0.0
for (i in 1:length(x)) {
sumx <- sumx + x[i]
}
return(sumx)
}
sum_r
## function(x) {
## sumx <- 0.0
## for (i in 1:length(x)) {
## sumx <- sumx + x[i]
## }
## return(sumx)
## }
Run the code on 1e6 elements:
library(microbenchmark)
library(ggplot2)
x = seq(from = 0, to = 100, by = 0.0001)
microbenchmark(sum_r(x))
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum_r(x) 38.79935 39.2748 40.12053 39.73677 40.27623 61.49744 100
Let’s compile the function into bytecode sum_rc
and benchmark again:
library(compiler)
sum_rc <- cmpfun(sum_r)
sum_rc
## function(x) {
## sumx <- 0.0
## for (i in 1:length(x)) {
## sumx <- sumx + x[i]
## }
## return(sumx)
## }
## <bytecode: 0x7feb39f6e448>
Benchmark again:
microbenchmark(sum_r(x), sum_rc(x))
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum_r(x) 38.81222 39.23851 39.70211 39.63766 40.01531 42.25711 100
## sum_rc(x) 38.75978 39.28320 40.03386 39.66122 40.24942 57.97186 100
Surprise! Surprise! Compiling into bytecode does not help at all. Following code shows that the function sum_r
is already compiled into bytecode before execution.
sum_r
## function(x) {
## sumx <- 0.0
## for (i in 1:length(x)) {
## sumx <- sumx + x[i]
## }
## return(sumx)
## }
## <bytecode: 0x7feb3a4fef20>
Let’s turn off JIT (just-in-time compilation), re-define the (same) sum_r
function, and benchmark again:
enableJIT(0) # set JIT leval to 0
## [1] 3
sum_r <- function(x) {
sumx <- 0.0
for (i in 1:length(x)) {
sumx <- sumx + x[i]
}
return(sumx)
}
microbenchmark(sum_r(x))
## Unit: milliseconds
## expr min lq mean median uq max neval
## sum_r(x) 291.0221 293.598 299.5057 295.8417 298.8963 348.293 100
Now we witness the slowness of the un-compiled sum_r
.
Documentation of enableJIT
:
enableJIT enables or disables just-in-time (JIT) compilation. JIT is disabled if the argument is 0. If level is 1 then larger closures are compiled before their first use. If level is 2, then some small closures are also compiled before their second use. If level is 3 then in addition all top level loops are compiled before they are executed. JIT level 3 requires the compiler option optimize to be 2 or 3. The JIT level can also be selected by starting R with the environment variable R_ENABLE_JIT set to one of these values. Calling enableJIT with a negative argument returns the current JIT level. The default JIT level is 3.
Since R 3.4.0 (Apr 2017), the JIT (‘Just In Time’) bytecode compiler is enabled by default at its level 3.
If you create a package, then you automatically compile the package on installation by adding
ByteCompile: true
to the DESCRIPTION
file.
Note Matlab has employed JIT technology since 2002 and Julia is designed totally based on JIT. R finally is on the same boat.
Learning sources:
- Advanced R: http://adv-r.had.co.nz/Rcpp.html
compiler
package compiles R code into bytecode, which is translated to machine code by interpreter during execution. A low-level language such as C, C++, and Fortran is compiled into machine code directly, yielding the maximum efficiency.
cppFunction
Rcpp
package provides a convenient way to embed C++ code in R code.
library(Rcpp)
cppFunction('double sum_c(NumericVector x) {
int n = x.size();
double total = 0;
for(int i = 0; i < n; ++i) {
total += x[i];
}
return total;
}')
sum_c
## function (x)
## .Call(<pointer: 0x1098ad680>, x)
Benchmark (1) compiled C++ function sum_c
together with (2) R function sum_r
, (3) compiled R function sum_rc
, and (4) the sum
function in base R:
mbm <- microbenchmark(sum_r(x), sum_rc(x), sum_c(x), sum(x))
mbm
## Unit: microseconds
## expr min lq mean median uq max neval
## sum_r(x) 286115.953 293170.2815 297998.1220 295627.684 298737.981 347555.538 100
## sum_rc(x) 38797.355 39296.7095 39838.9873 39665.329 40078.643 47973.272 100
## sum_c(x) 1153.977 1173.6640 1199.9589 1176.004 1216.265 1677.500 100
## sum(x) 873.037 895.6295 935.8121 902.612 935.079 1246.107 100
autoplot(mbm)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
Remember we turned off JIT by enableGIT(0)
earlier.
sourceCpp
In realistic projects, we write standalone C++ files and then source them into R using sourceCpp()
. For example
cat sum.cpp
## #include <Rcpp.h>
## using namespace Rcpp;
##
## // This is a simple example of exporting a C++ function to R. You can
## // source this function into an R session using the Rcpp::sourceCpp
## // function (or via the Source button on the editor toolbar). Learn
## // more about Rcpp at:
## //
## // http://www.rcpp.org/
## // http://adv-r.had.co.nz/Rcpp.html
## // http://gallery.rcpp.org/
## //
##
## // [[Rcpp::export]]
## double sum_c(NumericVector x) {
## int n = x.size();
## double total = 0;
## for(int i = 0; i < n; ++i) {
## total += x[i];
## }
## return total;
## }
##
## // You can include R code blocks in C++ files processed with sourceCpp
## // (useful for testing and development). The R code will be automatically
## // run after the compilation.
## //
##
## /*** R
## sum_c(as.double(1:10))
## */
sourceCpp("sum.cpp")
##
## > sum_c(as.double(1:10))
## [1] 55
sum_c
## function (x)
## .Call(<pointer: 0x109ba3680>, x)
As a Julia fan, I want to see how Julia works on this example. sum_j
is Julia equivalent of the sum_r
function:
# Julia code in this chunk
using BenchmarkTools
function sum_j(x)
total = zero(eltype(x))
for xi in x
total += xi
end
total
end
## sum_j (generic function with 1 method)
sum_j(1:10)
## 55
x = collect(0:0.0001:100);
@benchmark sum_j(x) samples=100
## BenchmarkTools.Trial:
## memory estimate: 16 bytes
## allocs estimate: 1
## --------------
## minimum time: 1.125 ms (0.00% GC)
## median time: 1.204 ms (0.00% GC)
## mean time: 1.264 ms (0.00% GC)
## maximum time: 1.666 ms (0.00% GC)
## --------------
## samples: 100
## evals/sample: 1
Julia retains the claritiy of a high-level language, while achieving efficiency of a low-level language!
My suggestion: avoid all these hassles of two language problem, just use Julia 😄
Fact: base R is single-threaded. Even you request a fancy instance with 96 vCPUs, running R code is just using 1/96th of its power.
To perform multi-core computation in R:
Option 1: Manually run multiple R sessions.
Option 2: Make multiple system("Rscript")
calls. Typically automated by a scripting language (Python, Perl, shell script) or within R (HW1 Q3.
Option 3: Use package parallel
.
parallel
package in R.
Authors: Brian Ripley, Luke Tieney, Simon Urbanek.
Included in base R since 2.14.0 (2011).
Based on the snow
(Luke Tierney) and multicore
(Simon Urbanek) packages.
To find the number of cores:
library(parallel)
detectCores()
## [1] 8
Let’s re-visit the simulation example considered in earlier lecture and HW1 Q4.
We have a “new” method that estimates the population mean by averaging the observations indexed by prime numbers.
## check if a given integer is prime
isPrime = function(n) {
if (n <= 3) {
return (TRUE)
}
if (any((n %% 2:floor(sqrt(n))) == 0)) {
return (FALSE)
}
return (TRUE)
}
## estimate mean only using observation with prime indices
estMeanPrimes = function(x) {
n <- length(x)
ind <- sapply(1:n, isPrime)
return (mean(x[ind]))
}
We want to compare our method to the traditional sample average estimator by simulation studies.
## compare methods: sample avg and prime-indexed avg
compare_methods <- function(dist = "gaussian", n = 100, reps = 100, seed = 123) {
# set seed according to command argument `seed`
set.seed(seed)
# preallocate space to store estimators
msePrimeAvg <- 0.0
mseSamplAvg <- 0.0
# loop over simulation replicates
for (r in 1:reps) {
# simulate data according to command arguments `n` and `distr`
if (dist == "gaussian") {
x = rnorm(n)
} else if (dist == "t1") {
x = rcauchy(n)
} else if (dist == "t5") {
x = rt(n, 5)
} else {
stop(paste("unrecognized dist: ", dist))
}
# prime indexed mean estimator and classical sample average estimator
msePrimeAvg <- msePrimeAvg + estMeanPrimes(x)^2
mseSamplAvg <- mseSamplAvg + mean(x)^2
}
mseSamplAvg <- mseSamplAvg / reps
msePrimeAvg <- msePrimeAvg / reps
return(c(mseSamplAvg, msePrimeAvg))
}
We need to loop over 3 generative models (distTypes
) and 20 samples sizes (nVals
). That are 60 embarssingly parallel tasks.
seed = 280
reps = 500
nVals = seq(100, 1000, by = 50)
distTypes = c("gaussian", "t5", "t1")
This is the serial code that double-loop over combinations of distTypes
and nVals
:
## simulation study with combination of generative model `dist` and
## sample size `n` (serial code)
simres1 = matrix(0.0, nrow = 2 * length(nVals), ncol = length(distTypes))
i = 1 # entry index
system.time(
for (dist in distTypes) {
for (n in nVals) {
#print(paste("n=", n, " dist=", dist, " seed=", seed, " reps=", reps, sep=""))
simres1[i:(i + 1)] = compare_methods(dist, n, reps, seed)
i <- i + 2
}
}
)
## user system elapsed
## 29.484 0.105 29.667
simres1
## [,1] [,2] [,3]
## [1,] 0.0103989436 0.017070603 312.4001
## [2,] 0.0410217819 0.066503177 200.3237
## [3,] 0.0065484669 0.011260420 173.9631
## [4,] 0.0297390639 0.047465397 199.5330
## [5,] 0.0056445593 0.007754855 68026.7023
## [6,] 0.0215380206 0.040039145 1230343.1341
## [7,] 0.0040803523 0.006871930 43609.7815
## [8,] 0.0165144049 0.032353077 931755.3182
## [9,] 0.0032566766 0.005417194 30283.1286
## [10,] 0.0161191554 0.026330133 684726.8788
## [11,] 0.0027565672 0.004444172 22306.1369
## [12,] 0.0145039253 0.022820075 539105.3392
## [13,] 0.0024915830 0.003798500 17119.0807
## [14,] 0.0122801335 0.022788299 435528.4778
## [15,] 0.0023706676 0.003360507 13531.4104
## [16,] 0.0112703627 0.016674640 111.4673
## [17,] 0.0020190283 0.003147367 10973.3489
## [18,] 0.0106157492 0.016027485 278.9663
## [19,] 0.0017567901 0.002863640 9069.6647
## [20,] 0.0096185720 0.016444671 261373.7646
## [21,] 0.0016441481 0.002637964 7629.9867
## [22,] 0.0081784426 0.013710539 296.6235
## [23,] 0.0015075246 0.002498450 6498.2362
## [24,] 0.0088018140 0.012909942 191986.6388
## [25,] 0.0014372130 0.002308089 5603.9395
## [26,] 0.0077292632 0.012789483 171280.5448
## [27,] 0.0012924543 0.002216936 4889.8739
## [28,] 0.0069012154 0.011052562 170.3975
## [29,] 0.0011994654 0.001987311 4299.6838
## [30,] 0.0067559611 0.011291788 178.7454
## [31,] 0.0011642413 0.001888637 3806.8472
## [32,] 0.0070131993 0.010048327 140.2389
## [33,] 0.0011566121 0.001873365 3401.9766
## [34,] 0.0065066558 0.009020973 34.1644
## [35,] 0.0010506067 0.001595430 3049.0432
## [36,] 0.0060026682 0.010338424 103578.0598
## [37,] 0.0009770234 0.001618095 2768.4517
## [38,] 0.0054705674 0.009229294 143.5544
mcmapply
Run the same task using mcmapply
function (parallel analog of mapply
) in the parallel
package:
## simulation study with combination of generative model `dist` and
## sample size `n` (parallel code using mcmapply)
library(parallel)
system.time({
simres2 <- mcmapply(compare_methods,
rep(distTypes, each = length(nVals), times = 1),
rep(nVals, each = 1, times = length(distTypes)),
reps,
seed,
mc.cores = 4)
})
## user system elapsed
## 23.927 0.465 8.682
simres2 <- matrix(unlist(simres2), ncol = length(distTypes))
simres2
## [,1] [,2] [,3]
## [1,] 0.0103989436 0.017070603 312.4001
## [2,] 0.0410217819 0.066503177 200.3237
## [3,] 0.0065484669 0.011260420 173.9631
## [4,] 0.0297390639 0.047465397 199.5330
## [5,] 0.0056445593 0.007754855 68026.7023
## [6,] 0.0215380206 0.040039145 1230343.1341
## [7,] 0.0040803523 0.006871930 43609.7815
## [8,] 0.0165144049 0.032353077 931755.3182
## [9,] 0.0032566766 0.005417194 30283.1286
## [10,] 0.0161191554 0.026330133 684726.8788
## [11,] 0.0027565672 0.004444172 22306.1369
## [12,] 0.0145039253 0.022820075 539105.3392
## [13,] 0.0024915830 0.003798500 17119.0807
## [14,] 0.0122801335 0.022788299 435528.4778
## [15,] 0.0023706676 0.003360507 13531.4104
## [16,] 0.0112703627 0.016674640 111.4673
## [17,] 0.0020190283 0.003147367 10973.3489
## [18,] 0.0106157492 0.016027485 278.9663
## [19,] 0.0017567901 0.002863640 9069.6647
## [20,] 0.0096185720 0.016444671 261373.7646
## [21,] 0.0016441481 0.002637964 7629.9867
## [22,] 0.0081784426 0.013710539 296.6235
## [23,] 0.0015075246 0.002498450 6498.2362
## [24,] 0.0088018140 0.012909942 191986.6388
## [25,] 0.0014372130 0.002308089 5603.9395
## [26,] 0.0077292632 0.012789483 171280.5448
## [27,] 0.0012924543 0.002216936 4889.8739
## [28,] 0.0069012154 0.011052562 170.3975
## [29,] 0.0011994654 0.001987311 4299.6838
## [30,] 0.0067559611 0.011291788 178.7454
## [31,] 0.0011642413 0.001888637 3806.8472
## [32,] 0.0070131993 0.010048327 140.2389
## [33,] 0.0011566121 0.001873365 3401.9766
## [34,] 0.0065066558 0.009020973 34.1644
## [35,] 0.0010506067 0.001595430 3049.0432
## [36,] 0.0060026682 0.010338424 103578.0598
## [37,] 0.0009770234 0.001618095 2768.4517
## [38,] 0.0054705674 0.009229294 143.5544
We see roughly 3x-4x speedup with mc.cores=4
.
mcmapply
, mclapply
and related functions rely on the forking capability of POSIX operating systems (e.g. Linux, MacOS) and is not available in Windows.
parLapply
, parApply
, parCapply
, parRapply
, clusterApply
, clusterMap
, and related functions create a cluster of workers based on either socket (default) or forking. Socket is available on all platforms: Linux, MacOS, and Windows.
clusterMap
The same simulation example using clusterMap
function:
cl <- makeCluster(getOption("cl.cores", 4))
clusterExport(cl, c("isPrime", "estMeanPrimes", "compare_methods"))
system.time({
simres3 <- clusterMap(cl, compare_methods,
rep(distTypes, each = length(nVals), times = 1),
rep(nVals, each = 1, times = length(distTypes)),
reps,
seed,
.scheduling = "dynamic")
})
## user system elapsed
## 0.022 0.005 6.666
simres3 <- matrix(unlist(simres3), ncol = length(distTypes))
stopCluster(cl)
simres3
## [,1] [,2] [,3]
## [1,] 0.0103989436 0.017070603 312.4001
## [2,] 0.0410217819 0.066503177 200.3237
## [3,] 0.0065484669 0.011260420 173.9631
## [4,] 0.0297390639 0.047465397 199.5330
## [5,] 0.0056445593 0.007754855 68026.7023
## [6,] 0.0215380206 0.040039145 1230343.1341
## [7,] 0.0040803523 0.006871930 43609.7815
## [8,] 0.0165144049 0.032353077 931755.3182
## [9,] 0.0032566766 0.005417194 30283.1286
## [10,] 0.0161191554 0.026330133 684726.8788
## [11,] 0.0027565672 0.004444172 22306.1369
## [12,] 0.0145039253 0.022820075 539105.3392
## [13,] 0.0024915830 0.003798500 17119.0807
## [14,] 0.0122801335 0.022788299 435528.4778
## [15,] 0.0023706676 0.003360507 13531.4104
## [16,] 0.0112703627 0.016674640 111.4673
## [17,] 0.0020190283 0.003147367 10973.3489
## [18,] 0.0106157492 0.016027485 278.9663
## [19,] 0.0017567901 0.002863640 9069.6647
## [20,] 0.0096185720 0.016444671 261373.7646
## [21,] 0.0016441481 0.002637964 7629.9867
## [22,] 0.0081784426 0.013710539 296.6235
## [23,] 0.0015075246 0.002498450 6498.2362
## [24,] 0.0088018140 0.012909942 191986.6388
## [25,] 0.0014372130 0.002308089 5603.9395
## [26,] 0.0077292632 0.012789483 171280.5448
## [27,] 0.0012924543 0.002216936 4889.8739
## [28,] 0.0069012154 0.011052562 170.3975
## [29,] 0.0011994654 0.001987311 4299.6838
## [30,] 0.0067559611 0.011291788 178.7454
## [31,] 0.0011642413 0.001888637 3806.8472
## [32,] 0.0070131993 0.010048327 140.2389
## [33,] 0.0011566121 0.001873365 3401.9766
## [34,] 0.0065066558 0.009020973 34.1644
## [35,] 0.0010506067 0.001595430 3049.0432
## [36,] 0.0060026682 0.010338424 103578.0598
## [37,] 0.0009770234 0.001618095 2768.4517
## [38,] 0.0054705674 0.009229294 143.5544
Again we see roughly 3x-4x speedup by using 4 cores.
clusterExport
copies environment of master to slaves.
Learning resources:
- Book _R Packages_by Hadley Wickham
- RStudio tutorial: https://support.rstudio.com/hc/en-us/articles/200486488-Developing-Packages-with-RStudio