Select Topics in R Programming

Typical development cycle for computational statistics
Bytecode compilation
- Example: summing a vector
Rcpp
Parallel computing
Package development

sessionInfo()

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.3
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
##  [1] compiler_3.4.3  backports_1.1.2 magrittr_1.5    rprojroot_1.3-2 tools_3.4.3     htmltools_0.3.6 yaml_2.1.18    
##  [8] Rcpp_0.12.15    stringi_1.1.6   rmarkdown_1.9   knitr_1.20      stringr_1.3.0   digest_0.6.15   evaluate_0.10.1

Typical development cycle for computational statistics

Scientific planning: What experiments would verify/invalidate our hypotheses? What parameter settings should we consider?
Code planning: What does the code need to do? How will the code fit together? What functions will be used? What are their inputs/outputs etc.
Implementation:
1. Prototype functions, classes, etc., partial documentation
2. Write unit tests
3. Implement code, run unit tests, debug
4. Broader testing, more debugging
5. Profile code, identify bottlenecks
6. Optimize code
Conduct experiments.
Full documentation.

Bytecode compilation

After profiling, what to do to improve performance?
1. Ask are there obvious speedups? Are things being computed unnecessarily? Are you using a data.frame where you should be using a matrix etc.
2. Look up your problem (e.g., search for “lapply slow” or “speeding up lapply” etc.)
3. Try the just-in-time (JIT) compiler.
4. Consider re-writing some or all of the code in a compiled language (e.g., C/C++).
5. Try parallelization.
R typical execution:

Julia illustration.
Since R 2.1.4, the compiler package by Luke Tierney is distributed with base R. compiler package compiles an R function into bytecode.

Example: summing a vector

Brute-force for loop for summing a vector:

sum_r <- function(x) {
  sumx <- 0.0
  for (i in 1:length(x)) {
    sumx <- sumx + x[i]
  }
  return(sumx)
}
sum_r

## function(x) {
##   sumx <- 0.0
##   for (i in 1:length(x)) {
##     sumx <- sumx + x[i]
##   }
##   return(sumx)
## }

Run the code on 1e6 elements:

library(microbenchmark)
library(ggplot2)

x = seq(from = 0, to = 100, by = 0.0001)
microbenchmark(sum_r(x))

## Unit: milliseconds
##      expr      min       lq     mean   median      uq    max neval
##  sum_r(x) 29.82172 30.97363 32.46076 32.13255 33.8475 40.895   100

Let’s compile the function into bytecode ctestfor and benchmark again:

library(compiler)
sum_rc <- cmpfun(sum_r)
sum_rc

## function(x) {
##   sumx <- 0.0
##   for (i in 1:length(x)) {
##     sumx <- sumx + x[i]
##   }
##   return(sumx)
## }
## <bytecode: 0x7fd085863c30>

Benchmark again:

microbenchmark(sum_r(x), sum_rc(x))

## Unit: milliseconds
##       expr      min       lq     mean   median       uq      max neval
##   sum_r(x) 29.97965 30.74283 32.22137 31.61499 33.52942 37.89567   100
##  sum_rc(x) 29.90078 31.30309 32.77624 32.45946 34.06413 38.40853   100

Surprise! Surprise! Compiling into bytecode does not help at all. Following code shows that the function sum_r is already compiled into bytecode before execution.

sum_r

## function(x) {
##   sumx <- 0.0
##   for (i in 1:length(x)) {
##     sumx <- sumx + x[i]
##   }
##   return(sumx)
## }
## <bytecode: 0x7fd0869c5318>

Let’s turn off JIT (just-in-time compilation), re-define the (same) sum_r function, and benchmark again:

enableJIT(0) # set JIT leval to 0

## [1] 3

sum_r <- function(x) {
  sumx <- 0.0
  for (i in 1:length(x)) {
    sumx <- sumx + x[i]
  }
  return(sumx)
}
microbenchmark(sum_r(x))

## Unit: milliseconds
##      expr      min       lq     mean   median       uq      max neval
##  sum_r(x) 267.5341 287.3711 312.8531 296.7157 339.8439 475.8979   100

Now we witness the slowness of the un-compiled sum_r.

Documentation of enableJIT:

enableJIT enables or disables just-in-time (JIT) compilation. JIT is disabled if the argument is 0. If level is 1 then larger closures are compiled before their first use. If level is 2, then some small closures are also compiled before their second use. If level is 3 then in addition all top level loops are compiled before they are executed. JIT level 3 requires the compiler option optimize to be 2 or 3. The JIT level can also be selected by starting R with the environment variable R_ENABLE_JIT set to one of these values. Calling enableJIT with a negative argument returns the current JIT level. The default JIT level is 3.

Since R 3.4.0 (Apr 2017), the JIT (‘Just In Time’) bytecode compiler is enabled by default at its level 3.

If you create a package, then you automatically compile the package on installation by adding

ByteCompile: true

to the DESCRIPTION file.

We realize Matlab has employed JIT technology since 2002 and Julia is designed totally based on JIT. R finally is on the same boat.

Rcpp

Learning sources:
- Advanced R: http://adv-r.had.co.nz/Rcpp.html

compiler package compiles R code into bytecode, which is translated to machine code by interpreter during execution. A low-level language such as C, C++, and Fortran is compiled into machine code directly, yielding the maximum efficiency.

Use `cppFunction`

Rcpp package provides a convenient way to embed C++ code in R code.

library(Rcpp)

cppFunction('double sum_c(NumericVector x) {
  int n = x.size();
  double total = 0;
  for(int i = 0; i < n; ++i) {
    total += x[i];
  }
  return total;
}')
sum_c

## function (x) 
## .Primitive(".Call")(<pointer: 0x10d94d610>, x)

Benchmark (1) compiled C++ function sum_c together with (2) R function sum_r, (3) compiled R function sum_rc, and (4) the sum function in base R:

mbm <- microbenchmark(sum_r(x), sum_rc(x), sum_c(x), sum(x))
mbm

## Unit: microseconds
##       expr        min         lq       mean     median         uq        max neval
##   sum_r(x) 270604.051 300864.541 333554.617 328140.384 361100.377 466015.998   100
##  sum_rc(x)  30361.578  32239.901  34584.683  33613.658  35300.762  49891.463   100
##   sum_c(x)   1140.445   1226.534   1337.845   1260.515   1436.327   2284.460   100
##     sum(x)    869.578    955.242   1079.188   1001.925   1142.850   1917.645   100

autoplot(mbm)

Remember we turned off JIT by enableGIT(0) earlier.

Use `sourceCpp`

In realistic projects, we write standalone C++ files and then source them into R using sourceCpp(). For example

cat sum.cpp

## #include <Rcpp.h>
## using namespace Rcpp;
## 
## // This is a simple example of exporting a C++ function to R. You can
## // source this function into an R session using the Rcpp::sourceCpp 
## // function (or via the Source button on the editor toolbar). Learn
## // more about Rcpp at:
## //
## //   http://www.rcpp.org/
## //   http://adv-r.had.co.nz/Rcpp.html
## //   http://gallery.rcpp.org/
## //
## 
## // [[Rcpp::export]]
## double sum_c(NumericVector x) {
##   int n = x.size();
##   double total = 0;
##   for(int i = 0; i < n; ++i) {
##     total += x[i];
##   }
##   return total;
## }
## 
## // You can include R code blocks in C++ files processed with sourceCpp
## // (useful for testing and development). The R code will be automatically 
## // run after the compilation.
## //
## 
## /*** R
## sum_c(as.double(1:10))
## */

sourceCpp("sum.cpp")

## 
## > sum_c(as.double(1:10))
## [1] 55

sum_c

## function (x) 
## .Primitive(".Call")(<pointer: 0x10dc86610>, x)

Julia

As a Julia fan, I want to see how Julia works on this example. sum_j is Julia equivalent of the sum_r function:

# Julia code in this chunk
using BenchmarkTools

function sum_j(x)
    total = zero(eltype(x))
    for xi in x
        total += xi
    end
    total
end

## sum_j (generic function with 1 method)

sum_j(1:10)

## 55


x = collect(0:0.0001:100);
@benchmark sum_j(x) samples=100

## BenchmarkTools.Trial: 
##   memory estimate:  16 bytes
##   allocs estimate:  1
##   --------------
##   minimum time:     1.152 ms (0.00% GC)
##   median time:      1.162 ms (0.00% GC)
##   mean time:        1.253 ms (0.00% GC)
##   maximum time:     2.120 ms (0.00% GC)
##   --------------
##   samples:          100
##   evals/sample:     1

Julia retains the claritiy of a high-level language, while achieving efficiency of a low-level language!

My suggestion: avoid all these hassles of two language problem, just use Julia 😄

Parallel computing

Fact: base R is single-threaded. Even you request a fancy instance with 96 vCPUs, running R code is just using 1/96th of its power.
To perform multi-core computation in R:
1. Option 1: Manually run multiple R sessions.
2. Option 2: Make multiple system("Rscript") calls. Typically automated by a scripting language (Python, Perl, shell script) or within R (HW1 Q3).
3. Option 3: Use package parallel.
parallel package in R.
- Authors: Brian Ripley, Luke Tieney, Simon Urbanek.
- Included in base R since 2.14.0 (2011).
- Based on the snow (Luke Tierney) and multicore (Simon Urbanek) packages.
- To find the number of cores:
```
library(parallel)
detectCores()
```
```
## [1] 8
```

Simulation example

Let’s re-visit the simulation example considered in earlier lecture and HW1 Q3.

We have a “new” method that estimates the population mean by averaging the observations indexed by prime numbers.

## check if a given integer is prime
isPrime = function(n) {
  if (n <= 3) {
    return (TRUE)
  }
  if (any((n %% 2:floor(sqrt(n))) == 0)) {
    return (FALSE)
  }
  return (TRUE)
}

## estimate mean only using observation with prime indices
estMeanPrimes = function(x) {
  n <- length(x)
  ind <- sapply(1:n, isPrime)
  return (mean(x[ind]))
}

We want to compare our method to the traditional sample average estimator by simulation studies.

## compare methods: sample avg and prime-indexed avg
compare_methods <- function(dist = "gaussian", n = 100, reps = 100, seed = 123) {
  # set seed according to command argument `seed`
  set.seed(seed)
  
  # preallocate space to store estimators
  msePrimeAvg <- 0.0
  mseSamplAvg <- 0.0
  # loop over simulation replicates
  for (r in 1:reps) {
    # simulate data according to command arguments `n` and `distr`
    if (dist == "gaussian") {
      x = rnorm(n)
    } else if (dist == "t1") {
      x = rcauchy(n)
    } else if (dist == "t5") {
      x = rt(n, 5)
    } else {
      stop(paste("unrecognized dist: ", dist))
    }
    # prime indexed mean estimator and classical sample average estimator
    msePrimeAvg <- msePrimeAvg + estMeanPrimes(x)^2 
    mseSamplAvg <- mseSamplAvg + mean(x)^2
  }
  mseSamplAvg <- mseSamplAvg / reps
  msePrimeAvg <- msePrimeAvg / reps
  return(c(mseSamplAvg, msePrimeAvg))
}

Serial code

We need to loop over 3 generative models (distTypes) and 20 samples sizes (nVals). That are 60 embarssingly parallel tasks.

seed = 280
reps = 500
nVals = seq(100, 1000, by = 50)
distTypes = c("gaussian", "t5", "t1")

This is the serial code that double-loop over combinations of distTypes and nVals:

## simulation study with combination of generative model `dist` and
## sample size `n` (serial code)
simres1 = matrix(0.0, nrow = 2 * length(nVals), ncol = length(distTypes))
i = 1 # entry index
system.time(
  for (dist in distTypes) {
    for (n in nVals) {
      #print(paste("n=", n, " dist=", dist, " seed=", seed, " reps=", reps, sep=""))
      simres1[i:(i + 1)] = compare_methods(dist, n, reps, seed)
      i <- i + 2
    }
  }
)

##    user  system elapsed 
##  28.054   0.097  28.240

simres1

##               [,1]        [,2]         [,3]
##  [1,] 0.0103989436 0.017070603     312.4001
##  [2,] 0.0410217819 0.066503177     200.3237
##  [3,] 0.0065484669 0.011260420     173.9631
##  [4,] 0.0297390639 0.047465397     199.5330
##  [5,] 0.0056445593 0.007754855   68026.7023
##  [6,] 0.0215380206 0.040039145 1230343.1341
##  [7,] 0.0040803523 0.006871930   43609.7815
##  [8,] 0.0165144049 0.032353077  931755.3182
##  [9,] 0.0032566766 0.005417194   30283.1286
## [10,] 0.0161191554 0.026330133  684726.8788
## [11,] 0.0027565672 0.004444172   22306.1369
## [12,] 0.0145039253 0.022820075  539105.3392
## [13,] 0.0024915830 0.003798500   17119.0807
## [14,] 0.0122801335 0.022788299  435528.4778
## [15,] 0.0023706676 0.003360507   13531.4104
## [16,] 0.0112703627 0.016674640     111.4673
## [17,] 0.0020190283 0.003147367   10973.3489
## [18,] 0.0106157492 0.016027485     278.9663
## [19,] 0.0017567901 0.002863640    9069.6647
## [20,] 0.0096185720 0.016444671  261373.7646
## [21,] 0.0016441481 0.002637964    7629.9867
## [22,] 0.0081784426 0.013710539     296.6235
## [23,] 0.0015075246 0.002498450    6498.2362
## [24,] 0.0088018140 0.012909942  191986.6388
## [25,] 0.0014372130 0.002308089    5603.9395
## [26,] 0.0077292632 0.012789483  171280.5448
## [27,] 0.0012924543 0.002216936    4889.8739
## [28,] 0.0069012154 0.011052562     170.3975
## [29,] 0.0011994654 0.001987311    4299.6838
## [30,] 0.0067559611 0.011291788     178.7454
## [31,] 0.0011642413 0.001888637    3806.8472
## [32,] 0.0070131993 0.010048327     140.2389
## [33,] 0.0011566121 0.001873365    3401.9766
## [34,] 0.0065066558 0.009020973      34.1644
## [35,] 0.0010506067 0.001595430    3049.0432
## [36,] 0.0060026682 0.010338424  103578.0598
## [37,] 0.0009770234 0.001618095    2768.4517
## [38,] 0.0054705674 0.009229294     143.5544

Using `mcmapply`

Run the same task using mcmapply function (parallel analog of mapply) in the parallel package:

## simulation study with combination of generative model `dist` and
## sample size `n` (parallel code using mcmapply)
library(parallel)
system.time({
  simres2 <- mcmapply(compare_methods, 
                      rep(distTypes, each = length(nVals), times = 1),
                      rep(nVals, each = 1, times = length(distTypes)),
                      reps, 
                      seed,
                      mc.cores = 4)
})

##    user  system elapsed 
##  21.363   0.374   7.760

simres2 <- matrix(unlist(simres2), ncol = length(distTypes))
simres2

##               [,1]        [,2]         [,3]
##  [1,] 0.0103989436 0.017070603     312.4001
##  [2,] 0.0410217819 0.066503177     200.3237
##  [3,] 0.0065484669 0.011260420     173.9631
##  [4,] 0.0297390639 0.047465397     199.5330
##  [5,] 0.0056445593 0.007754855   68026.7023
##  [6,] 0.0215380206 0.040039145 1230343.1341
##  [7,] 0.0040803523 0.006871930   43609.7815
##  [8,] 0.0165144049 0.032353077  931755.3182
##  [9,] 0.0032566766 0.005417194   30283.1286
## [10,] 0.0161191554 0.026330133  684726.8788
## [11,] 0.0027565672 0.004444172   22306.1369
## [12,] 0.0145039253 0.022820075  539105.3392
## [13,] 0.0024915830 0.003798500   17119.0807
## [14,] 0.0122801335 0.022788299  435528.4778
## [15,] 0.0023706676 0.003360507   13531.4104
## [16,] 0.0112703627 0.016674640     111.4673
## [17,] 0.0020190283 0.003147367   10973.3489
## [18,] 0.0106157492 0.016027485     278.9663
## [19,] 0.0017567901 0.002863640    9069.6647
## [20,] 0.0096185720 0.016444671  261373.7646
## [21,] 0.0016441481 0.002637964    7629.9867
## [22,] 0.0081784426 0.013710539     296.6235
## [23,] 0.0015075246 0.002498450    6498.2362
## [24,] 0.0088018140 0.012909942  191986.6388
## [25,] 0.0014372130 0.002308089    5603.9395
## [26,] 0.0077292632 0.012789483  171280.5448
## [27,] 0.0012924543 0.002216936    4889.8739
## [28,] 0.0069012154 0.011052562     170.3975
## [29,] 0.0011994654 0.001987311    4299.6838
## [30,] 0.0067559611 0.011291788     178.7454
## [31,] 0.0011642413 0.001888637    3806.8472
## [32,] 0.0070131993 0.010048327     140.2389
## [33,] 0.0011566121 0.001873365    3401.9766
## [34,] 0.0065066558 0.009020973      34.1644
## [35,] 0.0010506067 0.001595430    3049.0432
## [36,] 0.0060026682 0.010338424  103578.0598
## [37,] 0.0009770234 0.001618095    2768.4517
## [38,] 0.0054705674 0.009229294     143.5544

We see roughly 3x-4x speedup with mc.cores=4.
mcmapply, mclapply and related functions rely on the forking capability of POSIX operating systems (e.g. Linux, MacOS) and is not available in Windows.
parLapply, parApply, parCapply, parRapply, clusterApply, clusterMap, and related functions create a cluster of workers based on either socket (default) or forking. Socket is available on all platforms: Linux, MacOS, and Windows.

Using `clusterMap`

The same simulation example using clusterMap function:

cl <- makeCluster(getOption("cl.cores", 4))
clusterExport(cl, c("isPrime", "estMeanPrimes", "compare_methods"))
system.time({
  simres3 <- clusterMap(cl, compare_methods,
                        rep(distTypes, each = length(nVals), times = 1),
                        rep(nVals, each = 1, times = length(distTypes)),
                        reps,
                        seed,
                        .scheduling = "dynamic")
})

##    user  system elapsed 
##   0.022   0.005   5.775

simres3 <- matrix(unlist(simres3), ncol = length(distTypes))
stopCluster(cl)
simres3

##               [,1]        [,2]         [,3]
##  [1,] 0.0103989436 0.017070603     312.4001
##  [2,] 0.0410217819 0.066503177     200.3237
##  [3,] 0.0065484669 0.011260420     173.9631
##  [4,] 0.0297390639 0.047465397     199.5330
##  [5,] 0.0056445593 0.007754855   68026.7023
##  [6,] 0.0215380206 0.040039145 1230343.1341
##  [7,] 0.0040803523 0.006871930   43609.7815
##  [8,] 0.0165144049 0.032353077  931755.3182
##  [9,] 0.0032566766 0.005417194   30283.1286
## [10,] 0.0161191554 0.026330133  684726.8788
## [11,] 0.0027565672 0.004444172   22306.1369
## [12,] 0.0145039253 0.022820075  539105.3392
## [13,] 0.0024915830 0.003798500   17119.0807
## [14,] 0.0122801335 0.022788299  435528.4778
## [15,] 0.0023706676 0.003360507   13531.4104
## [16,] 0.0112703627 0.016674640     111.4673
## [17,] 0.0020190283 0.003147367   10973.3489
## [18,] 0.0106157492 0.016027485     278.9663
## [19,] 0.0017567901 0.002863640    9069.6647
## [20,] 0.0096185720 0.016444671  261373.7646
## [21,] 0.0016441481 0.002637964    7629.9867
## [22,] 0.0081784426 0.013710539     296.6235
## [23,] 0.0015075246 0.002498450    6498.2362
## [24,] 0.0088018140 0.012909942  191986.6388
## [25,] 0.0014372130 0.002308089    5603.9395
## [26,] 0.0077292632 0.012789483  171280.5448
## [27,] 0.0012924543 0.002216936    4889.8739
## [28,] 0.0069012154 0.011052562     170.3975
## [29,] 0.0011994654 0.001987311    4299.6838
## [30,] 0.0067559611 0.011291788     178.7454
## [31,] 0.0011642413 0.001888637    3806.8472
## [32,] 0.0070131993 0.010048327     140.2389
## [33,] 0.0011566121 0.001873365    3401.9766
## [34,] 0.0065066558 0.009020973      34.1644
## [35,] 0.0010506067 0.001595430    3049.0432
## [36,] 0.0060026682 0.010338424  103578.0598
## [37,] 0.0009770234 0.001618095    2768.4517
## [38,] 0.0054705674 0.009229294     143.5544

Again we see roughly 3x-4x speedup by using 4 cores.
clusterExport copies environment of master to slaves.

Package development

Learning resources:
- Book _R Packages_by Hadley Wickham
- RStudio tutorial: https://support.rstudio.com/hc/en-us/articles/200486488-Developing-Packages-with-RStudio

Select Topics in R Programming - Part II

Biostat M280

Dr. Hua Zhou

Mar 16, 2018

Typical development cycle for computational statistics

Bytecode compilation

Example: summing a vector

Rcpp

Use `cppFunction`

Use `sourceCpp`

Julia

Parallel computing

Simulation example

Serial code

Using `mcmapply`

Using `clusterMap`

Package development

Select Topics in R Programming - Part II

Biostat M280

Dr. Hua Zhou

Mar 16, 2018

Typical development cycle for computational statistics

Bytecode compilation

Example: summing a vector

Rcpp

Use cppFunction

Use sourceCpp

Julia

Parallel computing

Simulation example

Serial code

Using mcmapply

Using clusterMap

Package development

Use `cppFunction`

Use `sourceCpp`

Using `mcmapply`

Using `clusterMap`