sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.3  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2 tools_3.4.3     htmltools_0.3.6 yaml_2.1.18    
##  [8] Rcpp_0.12.15    stringi_1.1.6   rmarkdown_1.9   knitr_1.20      stringr_1.3.0   digest_0.6.15   evaluate_0.10.1

Typical development cycle for computational statistics

  1. Scientific planning: What experiments would verify/invalidate our hypotheses? What parameter settings should we consider?

  2. Code planning: What does the code need to do? How will the code fit together? What functions will be used? What are their inputs/outputs etc.

  3. Implementation:

    1. Prototype functions, classes, etc., partial documentation

    2. Write unit tests

    3. Implement code, run unit tests, debug

    4. Broader testing, more debugging

    5. Profile code, identify bottlenecks

    6. Optimize code

  4. Conduct experiments.

  5. Full documentation.

Bytecode compilation

Example: summing a vector

Brute-force for loop for summing a vector:

sum_r <- function(x) {
  sumx <- 0.0
  for (i in 1:length(x)) {
    sumx <- sumx + x[i]
  }
  return(sumx)
}
sum_r
## function(x) {
##   sumx <- 0.0
##   for (i in 1:length(x)) {
##     sumx <- sumx + x[i]
##   }
##   return(sumx)
## }

Run the code on 1e6 elements:

library(microbenchmark)
library(ggplot2)

x = seq(from = 0, to = 100, by = 0.0001)
microbenchmark(sum_r(x))
## Unit: milliseconds
##      expr      min       lq     mean   median      uq    max neval
##  sum_r(x) 29.82172 30.97363 32.46076 32.13255 33.8475 40.895   100

Let’s compile the function into bytecode ctestfor and benchmark again:

library(compiler)
sum_rc <- cmpfun(sum_r)
sum_rc
## function(x) {
##   sumx <- 0.0
##   for (i in 1:length(x)) {
##     sumx <- sumx + x[i]
##   }
##   return(sumx)
## }
## <bytecode: 0x7fd085863c30>

Benchmark again:

microbenchmark(sum_r(x), sum_rc(x))
## Unit: milliseconds
##       expr      min       lq     mean   median       uq      max neval
##   sum_r(x) 29.97965 30.74283 32.22137 31.61499 33.52942 37.89567   100
##  sum_rc(x) 29.90078 31.30309 32.77624 32.45946 34.06413 38.40853   100

Surprise! Surprise! Compiling into bytecode does not help at all. Following code shows that the function sum_r is already compiled into bytecode before execution.

sum_r
## function(x) {
##   sumx <- 0.0
##   for (i in 1:length(x)) {
##     sumx <- sumx + x[i]
##   }
##   return(sumx)
## }
## <bytecode: 0x7fd0869c5318>

Let’s turn off JIT (just-in-time compilation), re-define the (same) sum_r function, and benchmark again:

enableJIT(0) # set JIT leval to 0
## [1] 3
sum_r <- function(x) {
  sumx <- 0.0
  for (i in 1:length(x)) {
    sumx <- sumx + x[i]
  }
  return(sumx)
}
microbenchmark(sum_r(x))
## Unit: milliseconds
##      expr      min       lq     mean   median       uq      max neval
##  sum_r(x) 267.5341 287.3711 312.8531 296.7157 339.8439 475.8979   100

Now we witness the slowness of the un-compiled sum_r.

Documentation of enableJIT:

enableJIT enables or disables just-in-time (JIT) compilation. JIT is disabled if the argument is 0. If level is 1 then larger closures are compiled before their first use. If level is 2, then some small closures are also compiled before their second use. If level is 3 then in addition all top level loops are compiled before they are executed. JIT level 3 requires the compiler option optimize to be 2 or 3. The JIT level can also be selected by starting R with the environment variable R_ENABLE_JIT set to one of these values. Calling enableJIT with a negative argument returns the current JIT level. The default JIT level is 3.

Since R 3.4.0 (Apr 2017), the JIT (‘Just In Time’) bytecode compiler is enabled by default at its level 3.

If you create a package, then you automatically compile the package on installation by adding

ByteCompile: true

to the DESCRIPTION file.

We realize Matlab has employed JIT technology since 2002 and Julia is designed totally based on JIT. R finally is on the same boat.

Rcpp

Learning sources:
- Advanced R: http://adv-r.had.co.nz/Rcpp.html

compiler package compiles R code into bytecode, which is translated to machine code by interpreter during execution. A low-level language such as C, C++, and Fortran is compiled into machine code directly, yielding the maximum efficiency.

Use cppFunction

Rcpp package provides a convenient way to embed C++ code in R code.

library(Rcpp)

cppFunction('double sum_c(NumericVector x) {
  int n = x.size();
  double total = 0;
  for(int i = 0; i < n; ++i) {
    total += x[i];
  }
  return total;
}')
sum_c
## function (x) 
## .Primitive(".Call")(<pointer: 0x10d94d610>, x)

Benchmark (1) compiled C++ function sum_c together with (2) R function sum_r, (3) compiled R function sum_rc, and (4) the sum function in base R:

mbm <- microbenchmark(sum_r(x), sum_rc(x), sum_c(x), sum(x))
mbm
## Unit: microseconds
##       expr        min         lq       mean     median         uq        max neval
##   sum_r(x) 270604.051 300864.541 333554.617 328140.384 361100.377 466015.998   100
##  sum_rc(x)  30361.578  32239.901  34584.683  33613.658  35300.762  49891.463   100
##   sum_c(x)   1140.445   1226.534   1337.845   1260.515   1436.327   2284.460   100
##     sum(x)    869.578    955.242   1079.188   1001.925   1142.850   1917.645   100
autoplot(mbm)

Remember we turned off JIT by enableGIT(0) earlier.

Use sourceCpp

In realistic projects, we write standalone C++ files and then source them into R using sourceCpp(). For example

cat sum.cpp
## #include <Rcpp.h>
## using namespace Rcpp;
## 
## // This is a simple example of exporting a C++ function to R. You can
## // source this function into an R session using the Rcpp::sourceCpp 
## // function (or via the Source button on the editor toolbar). Learn
## // more about Rcpp at:
## //
## //   http://www.rcpp.org/
## //   http://adv-r.had.co.nz/Rcpp.html
## //   http://gallery.rcpp.org/
## //
## 
## // [[Rcpp::export]]
## double sum_c(NumericVector x) {
##   int n = x.size();
##   double total = 0;
##   for(int i = 0; i < n; ++i) {
##     total += x[i];
##   }
##   return total;
## }
## 
## // You can include R code blocks in C++ files processed with sourceCpp
## // (useful for testing and development). The R code will be automatically 
## // run after the compilation.
## //
## 
## /*** R
## sum_c(as.double(1:10))
## */
sourceCpp("sum.cpp")
## 
## > sum_c(as.double(1:10))
## [1] 55
sum_c
## function (x) 
## .Primitive(".Call")(<pointer: 0x10dc86610>, x)

Julia

As a Julia fan, I want to see how Julia works on this example. sum_j is Julia equivalent of the sum_r function:

# Julia code in this chunk
using BenchmarkTools

function sum_j(x)
    total = zero(eltype(x))
    for xi in x
        total += xi
    end
    total
end
## sum_j (generic function with 1 method)
sum_j(1:10)
## 55

x = collect(0:0.0001:100);
@benchmark sum_j(x) samples=100
## BenchmarkTools.Trial: 
##   memory estimate:  16 bytes
##   allocs estimate:  1
##   --------------
##   minimum time:     1.152 ms (0.00% GC)
##   median time:      1.162 ms (0.00% GC)
##   mean time:        1.253 ms (0.00% GC)
##   maximum time:     2.120 ms (0.00% GC)
##   --------------
##   samples:          100
##   evals/sample:     1

Julia retains the claritiy of a high-level language, while achieving efficiency of a low-level language!

My suggestion: avoid all these hassles of two language problem, just use Julia 😄

Parallel computing

Simulation example

Let’s re-visit the simulation example considered in earlier lecture and HW1 Q3.

We have a “new” method that estimates the population mean by averaging the observations indexed by prime numbers.

## check if a given integer is prime
isPrime = function(n) {
  if (n <= 3) {
    return (TRUE)
  }
  if (any((n %% 2:floor(sqrt(n))) == 0)) {
    return (FALSE)
  }
  return (TRUE)
}

## estimate mean only using observation with prime indices
estMeanPrimes = function(x) {
  n <- length(x)
  ind <- sapply(1:n, isPrime)
  return (mean(x[ind]))
}

We want to compare our method to the traditional sample average estimator by simulation studies.

## compare methods: sample avg and prime-indexed avg
compare_methods <- function(dist = "gaussian", n = 100, reps = 100, seed = 123) {
  # set seed according to command argument `seed`
  set.seed(seed)
  
  # preallocate space to store estimators
  msePrimeAvg <- 0.0
  mseSamplAvg <- 0.0
  # loop over simulation replicates
  for (r in 1:reps) {
    # simulate data according to command arguments `n` and `distr`
    if (dist == "gaussian") {
      x = rnorm(n)
    } else if (dist == "t1") {
      x = rcauchy(n)
    } else if (dist == "t5") {
      x = rt(n, 5)
    } else {
      stop(paste("unrecognized dist: ", dist))
    }
    # prime indexed mean estimator and classical sample average estimator
    msePrimeAvg <- msePrimeAvg + estMeanPrimes(x)^2 
    mseSamplAvg <- mseSamplAvg + mean(x)^2
  }
  mseSamplAvg <- mseSamplAvg / reps
  msePrimeAvg <- msePrimeAvg / reps
  return(c(mseSamplAvg, msePrimeAvg))
}

Serial code

We need to loop over 3 generative models (distTypes) and 20 samples sizes (nVals). That are 60 embarssingly parallel tasks.

seed = 280
reps = 500
nVals = seq(100, 1000, by = 50)
distTypes = c("gaussian", "t5", "t1")

This is the serial code that double-loop over combinations of distTypes and nVals:

## simulation study with combination of generative model `dist` and
## sample size `n` (serial code)
simres1 = matrix(0.0, nrow = 2 * length(nVals), ncol = length(distTypes))
i = 1 # entry index
system.time(
  for (dist in distTypes) {
    for (n in nVals) {
      #print(paste("n=", n, " dist=", dist, " seed=", seed, " reps=", reps, sep=""))
      simres1[i:(i + 1)] = compare_methods(dist, n, reps, seed)
      i <- i + 2
    }
  }
)
##    user  system elapsed 
##  28.054   0.097  28.240
simres1
##               [,1]        [,2]         [,3]
##  [1,] 0.0103989436 0.017070603     312.4001
##  [2,] 0.0410217819 0.066503177     200.3237
##  [3,] 0.0065484669 0.011260420     173.9631
##  [4,] 0.0297390639 0.047465397     199.5330
##  [5,] 0.0056445593 0.007754855   68026.7023
##  [6,] 0.0215380206 0.040039145 1230343.1341
##  [7,] 0.0040803523 0.006871930   43609.7815
##  [8,] 0.0165144049 0.032353077  931755.3182
##  [9,] 0.0032566766 0.005417194   30283.1286
## [10,] 0.0161191554 0.026330133  684726.8788
## [11,] 0.0027565672 0.004444172   22306.1369
## [12,] 0.0145039253 0.022820075  539105.3392
## [13,] 0.0024915830 0.003798500   17119.0807
## [14,] 0.0122801335 0.022788299  435528.4778
## [15,] 0.0023706676 0.003360507   13531.4104
## [16,] 0.0112703627 0.016674640     111.4673
## [17,] 0.0020190283 0.003147367   10973.3489
## [18,] 0.0106157492 0.016027485     278.9663
## [19,] 0.0017567901 0.002863640    9069.6647
## [20,] 0.0096185720 0.016444671  261373.7646
## [21,] 0.0016441481 0.002637964    7629.9867
## [22,] 0.0081784426 0.013710539     296.6235
## [23,] 0.0015075246 0.002498450    6498.2362
## [24,] 0.0088018140 0.012909942  191986.6388
## [25,] 0.0014372130 0.002308089    5603.9395
## [26,] 0.0077292632 0.012789483  171280.5448
## [27,] 0.0012924543 0.002216936    4889.8739
## [28,] 0.0069012154 0.011052562     170.3975
## [29,] 0.0011994654 0.001987311    4299.6838
## [30,] 0.0067559611 0.011291788     178.7454
## [31,] 0.0011642413 0.001888637    3806.8472
## [32,] 0.0070131993 0.010048327     140.2389
## [33,] 0.0011566121 0.001873365    3401.9766
## [34,] 0.0065066558 0.009020973      34.1644
## [35,] 0.0010506067 0.001595430    3049.0432
## [36,] 0.0060026682 0.010338424  103578.0598
## [37,] 0.0009770234 0.001618095    2768.4517
## [38,] 0.0054705674 0.009229294     143.5544

Using mcmapply

Run the same task using mcmapply function (parallel analog of mapply) in the parallel package:

## simulation study with combination of generative model `dist` and
## sample size `n` (parallel code using mcmapply)
library(parallel)
system.time({
  simres2 <- mcmapply(compare_methods, 
                      rep(distTypes, each = length(nVals), times = 1),
                      rep(nVals, each = 1, times = length(distTypes)),
                      reps, 
                      seed,
                      mc.cores = 4)
})
##    user  system elapsed 
##  21.363   0.374   7.760
simres2 <- matrix(unlist(simres2), ncol = length(distTypes))
simres2
##               [,1]        [,2]         [,3]
##  [1,] 0.0103989436 0.017070603     312.4001
##  [2,] 0.0410217819 0.066503177     200.3237
##  [3,] 0.0065484669 0.011260420     173.9631
##  [4,] 0.0297390639 0.047465397     199.5330
##  [5,] 0.0056445593 0.007754855   68026.7023
##  [6,] 0.0215380206 0.040039145 1230343.1341
##  [7,] 0.0040803523 0.006871930   43609.7815
##  [8,] 0.0165144049 0.032353077  931755.3182
##  [9,] 0.0032566766 0.005417194   30283.1286
## [10,] 0.0161191554 0.026330133  684726.8788
## [11,] 0.0027565672 0.004444172   22306.1369
## [12,] 0.0145039253 0.022820075  539105.3392
## [13,] 0.0024915830 0.003798500   17119.0807
## [14,] 0.0122801335 0.022788299  435528.4778
## [15,] 0.0023706676 0.003360507   13531.4104
## [16,] 0.0112703627 0.016674640     111.4673
## [17,] 0.0020190283 0.003147367   10973.3489
## [18,] 0.0106157492 0.016027485     278.9663
## [19,] 0.0017567901 0.002863640    9069.6647
## [20,] 0.0096185720 0.016444671  261373.7646
## [21,] 0.0016441481 0.002637964    7629.9867
## [22,] 0.0081784426 0.013710539     296.6235
## [23,] 0.0015075246 0.002498450    6498.2362
## [24,] 0.0088018140 0.012909942  191986.6388
## [25,] 0.0014372130 0.002308089    5603.9395
## [26,] 0.0077292632 0.012789483  171280.5448
## [27,] 0.0012924543 0.002216936    4889.8739
## [28,] 0.0069012154 0.011052562     170.3975
## [29,] 0.0011994654 0.001987311    4299.6838
## [30,] 0.0067559611 0.011291788     178.7454
## [31,] 0.0011642413 0.001888637    3806.8472
## [32,] 0.0070131993 0.010048327     140.2389
## [33,] 0.0011566121 0.001873365    3401.9766
## [34,] 0.0065066558 0.009020973      34.1644
## [35,] 0.0010506067 0.001595430    3049.0432
## [36,] 0.0060026682 0.010338424  103578.0598
## [37,] 0.0009770234 0.001618095    2768.4517
## [38,] 0.0054705674 0.009229294     143.5544
  • We see roughly 3x-4x speedup with mc.cores=4.

  • mcmapply, mclapply and related functions rely on the forking capability of POSIX operating systems (e.g. Linux, MacOS) and is not available in Windows.

  • parLapply, parApply, parCapply, parRapply, clusterApply, clusterMap, and related functions create a cluster of workers based on either socket (default) or forking. Socket is available on all platforms: Linux, MacOS, and Windows.

Using clusterMap

The same simulation example using clusterMap function:

cl <- makeCluster(getOption("cl.cores", 4))
clusterExport(cl, c("isPrime", "estMeanPrimes", "compare_methods"))
system.time({
  simres3 <- clusterMap(cl, compare_methods,
                        rep(distTypes, each = length(nVals), times = 1),
                        rep(nVals, each = 1, times = length(distTypes)),
                        reps,
                        seed,
                        .scheduling = "dynamic")
})
##    user  system elapsed 
##   0.022   0.005   5.775
simres3 <- matrix(unlist(simres3), ncol = length(distTypes))
stopCluster(cl)
simres3
##               [,1]        [,2]         [,3]
##  [1,] 0.0103989436 0.017070603     312.4001
##  [2,] 0.0410217819 0.066503177     200.3237
##  [3,] 0.0065484669 0.011260420     173.9631
##  [4,] 0.0297390639 0.047465397     199.5330
##  [5,] 0.0056445593 0.007754855   68026.7023
##  [6,] 0.0215380206 0.040039145 1230343.1341
##  [7,] 0.0040803523 0.006871930   43609.7815
##  [8,] 0.0165144049 0.032353077  931755.3182
##  [9,] 0.0032566766 0.005417194   30283.1286
## [10,] 0.0161191554 0.026330133  684726.8788
## [11,] 0.0027565672 0.004444172   22306.1369
## [12,] 0.0145039253 0.022820075  539105.3392
## [13,] 0.0024915830 0.003798500   17119.0807
## [14,] 0.0122801335 0.022788299  435528.4778
## [15,] 0.0023706676 0.003360507   13531.4104
## [16,] 0.0112703627 0.016674640     111.4673
## [17,] 0.0020190283 0.003147367   10973.3489
## [18,] 0.0106157492 0.016027485     278.9663
## [19,] 0.0017567901 0.002863640    9069.6647
## [20,] 0.0096185720 0.016444671  261373.7646
## [21,] 0.0016441481 0.002637964    7629.9867
## [22,] 0.0081784426 0.013710539     296.6235
## [23,] 0.0015075246 0.002498450    6498.2362
## [24,] 0.0088018140 0.012909942  191986.6388
## [25,] 0.0014372130 0.002308089    5603.9395
## [26,] 0.0077292632 0.012789483  171280.5448
## [27,] 0.0012924543 0.002216936    4889.8739
## [28,] 0.0069012154 0.011052562     170.3975
## [29,] 0.0011994654 0.001987311    4299.6838
## [30,] 0.0067559611 0.011291788     178.7454
## [31,] 0.0011642413 0.001888637    3806.8472
## [32,] 0.0070131993 0.010048327     140.2389
## [33,] 0.0011566121 0.001873365    3401.9766
## [34,] 0.0065066558 0.009020973      34.1644
## [35,] 0.0010506067 0.001595430    3049.0432
## [36,] 0.0060026682 0.010338424  103578.0598
## [37,] 0.0009770234 0.001618095    2768.4517
## [38,] 0.0054705674 0.009229294     143.5544
  • Again we see roughly 3x-4x speedup by using 4 cores.

  • clusterExport copies environment of master to slaves.

Package development

Learning resources:
- Book _R Packages_by Hadley Wickham
- RStudio tutorial: https://support.rstudio.com/hc/en-us/articles/200486488-Developing-Packages-with-RStudio