sessionInfo()
## R version 3.5.2 (2018-12-20)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.2
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.5.2  magrittr_1.5    tools_3.5.2     htmltools_0.3.6 yaml_2.2.0      Rcpp_1.0.0      stringi_1.2.4  
##  [8] rmarkdown_1.11  knitr_1.21      stringr_1.3.1   xfun_0.4        digest_0.6.18   evaluate_0.12

Typical development cycle for computational statistics

  1. Scientific planning: What experiments would verify/invalidate our hypotheses? What parameter settings should we consider?

  2. Code planning: What does the code need to do? How will the code fit together? What functions will be used? What are their inputs/outputs etc.

  3. Implementation:

    1. Prototype functions, classes, etc., partial documentation

    2. Write unit tests

    3. Implement code, run unit tests, debug

    4. Broader testing, more debugging

    5. Profile code, identify bottlenecks

    6. Optimize code

  4. Conduct experiments.

  5. Full documentation.

Bytecode compilation

Example: summing a vector

Brute-force for loop for summing a vector:

sum_r <- function(x) {
  sumx <- 0.0
  for (i in 1:length(x)) {
    sumx <- sumx + x[i]
  }
  return(sumx)
}
sum_r
## function(x) {
##   sumx <- 0.0
##   for (i in 1:length(x)) {
##     sumx <- sumx + x[i]
##   }
##   return(sumx)
## }

Run the code on 1e6 elements:

library(microbenchmark)
library(ggplot2)

x = seq(from = 0, to = 100, by = 0.0001)
microbenchmark(sum_r(x))
## Unit: milliseconds
##      expr      min      lq     mean   median       uq      max neval
##  sum_r(x) 38.79935 39.2748 40.12053 39.73677 40.27623 61.49744   100

Let’s compile the function into bytecode sum_rc and benchmark again:

library(compiler)
sum_rc <- cmpfun(sum_r)
sum_rc
## function(x) {
##   sumx <- 0.0
##   for (i in 1:length(x)) {
##     sumx <- sumx + x[i]
##   }
##   return(sumx)
## }
## <bytecode: 0x7feb39f6e448>

Benchmark again:

microbenchmark(sum_r(x), sum_rc(x))
## Unit: milliseconds
##       expr      min       lq     mean   median       uq      max neval
##   sum_r(x) 38.81222 39.23851 39.70211 39.63766 40.01531 42.25711   100
##  sum_rc(x) 38.75978 39.28320 40.03386 39.66122 40.24942 57.97186   100

Surprise! Surprise! Compiling into bytecode does not help at all. Following code shows that the function sum_r is already compiled into bytecode before execution.

sum_r
## function(x) {
##   sumx <- 0.0
##   for (i in 1:length(x)) {
##     sumx <- sumx + x[i]
##   }
##   return(sumx)
## }
## <bytecode: 0x7feb3a4fef20>

Let’s turn off JIT (just-in-time compilation), re-define the (same) sum_r function, and benchmark again:

enableJIT(0) # set JIT leval to 0
## [1] 3
sum_r <- function(x) {
  sumx <- 0.0
  for (i in 1:length(x)) {
    sumx <- sumx + x[i]
  }
  return(sumx)
}
microbenchmark(sum_r(x))
## Unit: milliseconds
##      expr      min      lq     mean   median       uq     max neval
##  sum_r(x) 291.0221 293.598 299.5057 295.8417 298.8963 348.293   100

Now we witness the slowness of the un-compiled sum_r.

Documentation of enableJIT:

enableJIT enables or disables just-in-time (JIT) compilation. JIT is disabled if the argument is 0. If level is 1 then larger closures are compiled before their first use. If level is 2, then some small closures are also compiled before their second use. If level is 3 then in addition all top level loops are compiled before they are executed. JIT level 3 requires the compiler option optimize to be 2 or 3. The JIT level can also be selected by starting R with the environment variable R_ENABLE_JIT set to one of these values. Calling enableJIT with a negative argument returns the current JIT level. The default JIT level is 3.

Since R 3.4.0 (Apr 2017), the JIT (‘Just In Time’) bytecode compiler is enabled by default at its level 3.

If you create a package, then you automatically compile the package on installation by adding

ByteCompile: true

to the DESCRIPTION file.

Note Matlab has employed JIT technology since 2002 and Julia is designed totally based on JIT. R finally is on the same boat.

Rcpp

Learning sources:
- Advanced R: http://adv-r.had.co.nz/Rcpp.html

compiler package compiles R code into bytecode, which is translated to machine code by interpreter during execution. A low-level language such as C, C++, and Fortran is compiled into machine code directly, yielding the maximum efficiency.

Use cppFunction

Rcpp package provides a convenient way to embed C++ code in R code.

library(Rcpp)

cppFunction('double sum_c(NumericVector x) {
  int n = x.size();
  double total = 0;
  for(int i = 0; i < n; ++i) {
    total += x[i];
  }
  return total;
}')
sum_c
## function (x) 
## .Call(<pointer: 0x1098ad680>, x)

Benchmark (1) compiled C++ function sum_c together with (2) R function sum_r, (3) compiled R function sum_rc, and (4) the sum function in base R:

mbm <- microbenchmark(sum_r(x), sum_rc(x), sum_c(x), sum(x))
mbm
## Unit: microseconds
##       expr        min          lq        mean     median         uq        max neval
##   sum_r(x) 286115.953 293170.2815 297998.1220 295627.684 298737.981 347555.538   100
##  sum_rc(x)  38797.355  39296.7095  39838.9873  39665.329  40078.643  47973.272   100
##   sum_c(x)   1153.977   1173.6640   1199.9589   1176.004   1216.265   1677.500   100
##     sum(x)    873.037    895.6295    935.8121    902.612    935.079   1246.107   100
autoplot(mbm)
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.

Remember we turned off JIT by enableGIT(0) earlier.

Use sourceCpp

In realistic projects, we write standalone C++ files and then source them into R using sourceCpp(). For example

cat sum.cpp
## #include <Rcpp.h>
## using namespace Rcpp;
## 
## // This is a simple example of exporting a C++ function to R. You can
## // source this function into an R session using the Rcpp::sourceCpp 
## // function (or via the Source button on the editor toolbar). Learn
## // more about Rcpp at:
## //
## //   http://www.rcpp.org/
## //   http://adv-r.had.co.nz/Rcpp.html
## //   http://gallery.rcpp.org/
## //
## 
## // [[Rcpp::export]]
## double sum_c(NumericVector x) {
##   int n = x.size();
##   double total = 0;
##   for(int i = 0; i < n; ++i) {
##     total += x[i];
##   }
##   return total;
## }
## 
## // You can include R code blocks in C++ files processed with sourceCpp
## // (useful for testing and development). The R code will be automatically 
## // run after the compilation.
## //
## 
## /*** R
## sum_c(as.double(1:10))
## */
sourceCpp("sum.cpp")
## 
## > sum_c(as.double(1:10))
## [1] 55
sum_c
## function (x) 
## .Call(<pointer: 0x109ba3680>, x)

Julia

As a Julia fan, I want to see how Julia works on this example. sum_j is Julia equivalent of the sum_r function:

# Julia code in this chunk
using BenchmarkTools

function sum_j(x)
    total = zero(eltype(x))
    for xi in x
        total += xi
    end
    total
end
## sum_j (generic function with 1 method)
sum_j(1:10)
## 55

x = collect(0:0.0001:100);
@benchmark sum_j(x) samples=100
## BenchmarkTools.Trial: 
##   memory estimate:  16 bytes
##   allocs estimate:  1
##   --------------
##   minimum time:     1.125 ms (0.00% GC)
##   median time:      1.204 ms (0.00% GC)
##   mean time:        1.264 ms (0.00% GC)
##   maximum time:     1.666 ms (0.00% GC)
##   --------------
##   samples:          100
##   evals/sample:     1

Julia retains the claritiy of a high-level language, while achieving efficiency of a low-level language!

My suggestion: avoid all these hassles of two language problem, just use Julia 😄

Parallel computing

Simulation example

Let’s re-visit the simulation example considered in earlier lecture and HW1 Q4.

We have a “new” method that estimates the population mean by averaging the observations indexed by prime numbers.

## check if a given integer is prime
isPrime = function(n) {
  if (n <= 3) {
    return (TRUE)
  }
  if (any((n %% 2:floor(sqrt(n))) == 0)) {
    return (FALSE)
  }
  return (TRUE)
}

## estimate mean only using observation with prime indices
estMeanPrimes = function(x) {
  n <- length(x)
  ind <- sapply(1:n, isPrime)
  return (mean(x[ind]))
}

We want to compare our method to the traditional sample average estimator by simulation studies.

## compare methods: sample avg and prime-indexed avg
compare_methods <- function(dist = "gaussian", n = 100, reps = 100, seed = 123) {
  # set seed according to command argument `seed`
  set.seed(seed)
  
  # preallocate space to store estimators
  msePrimeAvg <- 0.0
  mseSamplAvg <- 0.0
  # loop over simulation replicates
  for (r in 1:reps) {
    # simulate data according to command arguments `n` and `distr`
    if (dist == "gaussian") {
      x = rnorm(n)
    } else if (dist == "t1") {
      x = rcauchy(n)
    } else if (dist == "t5") {
      x = rt(n, 5)
    } else {
      stop(paste("unrecognized dist: ", dist))
    }
    # prime indexed mean estimator and classical sample average estimator
    msePrimeAvg <- msePrimeAvg + estMeanPrimes(x)^2 
    mseSamplAvg <- mseSamplAvg + mean(x)^2
  }
  mseSamplAvg <- mseSamplAvg / reps
  msePrimeAvg <- msePrimeAvg / reps
  return(c(mseSamplAvg, msePrimeAvg))
}

Serial code

We need to loop over 3 generative models (distTypes) and 20 samples sizes (nVals). That are 60 embarssingly parallel tasks.

seed = 280
reps = 500
nVals = seq(100, 1000, by = 50)
distTypes = c("gaussian", "t5", "t1")

This is the serial code that double-loop over combinations of distTypes and nVals:

## simulation study with combination of generative model `dist` and
## sample size `n` (serial code)
simres1 = matrix(0.0, nrow = 2 * length(nVals), ncol = length(distTypes))
i = 1 # entry index
system.time(
  for (dist in distTypes) {
    for (n in nVals) {
      #print(paste("n=", n, " dist=", dist, " seed=", seed, " reps=", reps, sep=""))
      simres1[i:(i + 1)] = compare_methods(dist, n, reps, seed)
      i <- i + 2
    }
  }
)
##    user  system elapsed 
##  29.484   0.105  29.667
simres1
##               [,1]        [,2]         [,3]
##  [1,] 0.0103989436 0.017070603     312.4001
##  [2,] 0.0410217819 0.066503177     200.3237
##  [3,] 0.0065484669 0.011260420     173.9631
##  [4,] 0.0297390639 0.047465397     199.5330
##  [5,] 0.0056445593 0.007754855   68026.7023
##  [6,] 0.0215380206 0.040039145 1230343.1341
##  [7,] 0.0040803523 0.006871930   43609.7815
##  [8,] 0.0165144049 0.032353077  931755.3182
##  [9,] 0.0032566766 0.005417194   30283.1286
## [10,] 0.0161191554 0.026330133  684726.8788
## [11,] 0.0027565672 0.004444172   22306.1369
## [12,] 0.0145039253 0.022820075  539105.3392
## [13,] 0.0024915830 0.003798500   17119.0807
## [14,] 0.0122801335 0.022788299  435528.4778
## [15,] 0.0023706676 0.003360507   13531.4104
## [16,] 0.0112703627 0.016674640     111.4673
## [17,] 0.0020190283 0.003147367   10973.3489
## [18,] 0.0106157492 0.016027485     278.9663
## [19,] 0.0017567901 0.002863640    9069.6647
## [20,] 0.0096185720 0.016444671  261373.7646
## [21,] 0.0016441481 0.002637964    7629.9867
## [22,] 0.0081784426 0.013710539     296.6235
## [23,] 0.0015075246 0.002498450    6498.2362
## [24,] 0.0088018140 0.012909942  191986.6388
## [25,] 0.0014372130 0.002308089    5603.9395
## [26,] 0.0077292632 0.012789483  171280.5448
## [27,] 0.0012924543 0.002216936    4889.8739
## [28,] 0.0069012154 0.011052562     170.3975
## [29,] 0.0011994654 0.001987311    4299.6838
## [30,] 0.0067559611 0.011291788     178.7454
## [31,] 0.0011642413 0.001888637    3806.8472
## [32,] 0.0070131993 0.010048327     140.2389
## [33,] 0.0011566121 0.001873365    3401.9766
## [34,] 0.0065066558 0.009020973      34.1644
## [35,] 0.0010506067 0.001595430    3049.0432
## [36,] 0.0060026682 0.010338424  103578.0598
## [37,] 0.0009770234 0.001618095    2768.4517
## [38,] 0.0054705674 0.009229294     143.5544

Using mcmapply

Run the same task using mcmapply function (parallel analog of mapply) in the parallel package:

## simulation study with combination of generative model `dist` and
## sample size `n` (parallel code using mcmapply)
library(parallel)
system.time({
  simres2 <- mcmapply(compare_methods, 
                      rep(distTypes, each = length(nVals), times = 1),
                      rep(nVals, each = 1, times = length(distTypes)),
                      reps, 
                      seed,
                      mc.cores = 4)
})
##    user  system elapsed 
##  23.927   0.465   8.682
simres2 <- matrix(unlist(simres2), ncol = length(distTypes))
simres2
##               [,1]        [,2]         [,3]
##  [1,] 0.0103989436 0.017070603     312.4001
##  [2,] 0.0410217819 0.066503177     200.3237
##  [3,] 0.0065484669 0.011260420     173.9631
##  [4,] 0.0297390639 0.047465397     199.5330
##  [5,] 0.0056445593 0.007754855   68026.7023
##  [6,] 0.0215380206 0.040039145 1230343.1341
##  [7,] 0.0040803523 0.006871930   43609.7815
##  [8,] 0.0165144049 0.032353077  931755.3182
##  [9,] 0.0032566766 0.005417194   30283.1286
## [10,] 0.0161191554 0.026330133  684726.8788
## [11,] 0.0027565672 0.004444172   22306.1369
## [12,] 0.0145039253 0.022820075  539105.3392
## [13,] 0.0024915830 0.003798500   17119.0807
## [14,] 0.0122801335 0.022788299  435528.4778
## [15,] 0.0023706676 0.003360507   13531.4104
## [16,] 0.0112703627 0.016674640     111.4673
## [17,] 0.0020190283 0.003147367   10973.3489
## [18,] 0.0106157492 0.016027485     278.9663
## [19,] 0.0017567901 0.002863640    9069.6647
## [20,] 0.0096185720 0.016444671  261373.7646
## [21,] 0.0016441481 0.002637964    7629.9867
## [22,] 0.0081784426 0.013710539     296.6235
## [23,] 0.0015075246 0.002498450    6498.2362
## [24,] 0.0088018140 0.012909942  191986.6388
## [25,] 0.0014372130 0.002308089    5603.9395
## [26,] 0.0077292632 0.012789483  171280.5448
## [27,] 0.0012924543 0.002216936    4889.8739
## [28,] 0.0069012154 0.011052562     170.3975
## [29,] 0.0011994654 0.001987311    4299.6838
## [30,] 0.0067559611 0.011291788     178.7454
## [31,] 0.0011642413 0.001888637    3806.8472
## [32,] 0.0070131993 0.010048327     140.2389
## [33,] 0.0011566121 0.001873365    3401.9766
## [34,] 0.0065066558 0.009020973      34.1644
## [35,] 0.0010506067 0.001595430    3049.0432
## [36,] 0.0060026682 0.010338424  103578.0598
## [37,] 0.0009770234 0.001618095    2768.4517
## [38,] 0.0054705674 0.009229294     143.5544
  • We see roughly 3x-4x speedup with mc.cores=4.

  • mcmapply, mclapply and related functions rely on the forking capability of POSIX operating systems (e.g. Linux, MacOS) and is not available in Windows.

  • parLapply, parApply, parCapply, parRapply, clusterApply, clusterMap, and related functions create a cluster of workers based on either socket (default) or forking. Socket is available on all platforms: Linux, MacOS, and Windows.

Using clusterMap

The same simulation example using clusterMap function:

cl <- makeCluster(getOption("cl.cores", 4))
clusterExport(cl, c("isPrime", "estMeanPrimes", "compare_methods"))
system.time({
  simres3 <- clusterMap(cl, compare_methods,
                        rep(distTypes, each = length(nVals), times = 1),
                        rep(nVals, each = 1, times = length(distTypes)),
                        reps,
                        seed,
                        .scheduling = "dynamic")
})
##    user  system elapsed 
##   0.022   0.005   6.666
simres3 <- matrix(unlist(simres3), ncol = length(distTypes))
stopCluster(cl)
simres3
##               [,1]        [,2]         [,3]
##  [1,] 0.0103989436 0.017070603     312.4001
##  [2,] 0.0410217819 0.066503177     200.3237
##  [3,] 0.0065484669 0.011260420     173.9631
##  [4,] 0.0297390639 0.047465397     199.5330
##  [5,] 0.0056445593 0.007754855   68026.7023
##  [6,] 0.0215380206 0.040039145 1230343.1341
##  [7,] 0.0040803523 0.006871930   43609.7815
##  [8,] 0.0165144049 0.032353077  931755.3182
##  [9,] 0.0032566766 0.005417194   30283.1286
## [10,] 0.0161191554 0.026330133  684726.8788
## [11,] 0.0027565672 0.004444172   22306.1369
## [12,] 0.0145039253 0.022820075  539105.3392
## [13,] 0.0024915830 0.003798500   17119.0807
## [14,] 0.0122801335 0.022788299  435528.4778
## [15,] 0.0023706676 0.003360507   13531.4104
## [16,] 0.0112703627 0.016674640     111.4673
## [17,] 0.0020190283 0.003147367   10973.3489
## [18,] 0.0106157492 0.016027485     278.9663
## [19,] 0.0017567901 0.002863640    9069.6647
## [20,] 0.0096185720 0.016444671  261373.7646
## [21,] 0.0016441481 0.002637964    7629.9867
## [22,] 0.0081784426 0.013710539     296.6235
## [23,] 0.0015075246 0.002498450    6498.2362
## [24,] 0.0088018140 0.012909942  191986.6388
## [25,] 0.0014372130 0.002308089    5603.9395
## [26,] 0.0077292632 0.012789483  171280.5448
## [27,] 0.0012924543 0.002216936    4889.8739
## [28,] 0.0069012154 0.011052562     170.3975
## [29,] 0.0011994654 0.001987311    4299.6838
## [30,] 0.0067559611 0.011291788     178.7454
## [31,] 0.0011642413 0.001888637    3806.8472
## [32,] 0.0070131993 0.010048327     140.2389
## [33,] 0.0011566121 0.001873365    3401.9766
## [34,] 0.0065066558 0.009020973      34.1644
## [35,] 0.0010506067 0.001595430    3049.0432
## [36,] 0.0060026682 0.010338424  103578.0598
## [37,] 0.0009770234 0.001618095    2768.4517
## [38,] 0.0054705674 0.009229294     143.5544
  • Again we see roughly 3x-4x speedup by using 4 cores.

  • clusterExport copies environment of master to slaves.

Package development

Learning resources:
- Book _R Packages_by Hadley Wickham
- RStudio tutorial: https://support.rstudio.com/hc/en-us/articles/200486488-Developing-Packages-with-RStudio