Session informnation for reproducibility:
sessionInfo()
## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS High Sierra 10.13.3
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## loaded via a namespace (and not attached):
## [1] compiler_3.4.3 backports_1.1.2 magrittr_1.5 rprojroot_1.3-2 tools_3.4.3 htmltools_0.3.6 yaml_2.1.16
## [8] Rcpp_0.12.15 stringi_1.1.6 rmarkdown_1.8 knitr_1.19 stringr_1.2.0 digest_0.6.15 evaluate_0.10.1
To gain a deep understanding of how R works, the book Advanced R by Hadley Wickham is a must read. Read now to save numerous hours you might waste in future.
We cover select topics on coding style, benchmarking, profiling, debugging, parallel computing, byte code compiling, Rcpp, and package development.
Most important is to be consistent in your code style.
Sources:
- Advanced R: http://adv-r.had.co.nz/Performance.html
- Blog: http://www.alexejgossmann.com/benchmarking_r/
In order to identify performance issue, we need to measure runtime accurately.
system.time
set.seed(280)
x <- runif(1e6)
system.time({sqrt(x)})
## user system elapsed
## 0.005 0.003 0.008
system.time({x ^ 0.5})
## user system elapsed
## 0.032 0.000 0.032
system.time({exp(log(x) / 2)})
## user system elapsed
## 0.02 0.00 0.02
From William Dunlap:
“User CPU time” gives the CPU time spent by the current process (i.e., the current R session) and “system CPU time” gives the CPU time spent by the kernel (the operating system) on behalf of the current process. The operating system is used for things like opening files, doing input or output, starting other processes, and looking at the system clock: operations that involve resources that many processes must share. Different operating systems will have different things done by the operating system.
rbenchmark
library("rbenchmark")
benchmark(
"sqrt(x)" = {sqrt(x)},
"x^0.5" = {x ^ 0.5},
"exp(log(x)/2)" = {exp(log(x) / 2)},
replications = 100,
order = "elapsed"
)
## test replications elapsed relative user.self sys.self user.child sys.child
## 1 sqrt(x) 100 0.421 1.000 0.363 0.057 0 0
## 3 exp(log(x)/2) 100 1.716 4.076 1.635 0.063 0 0
## 2 x^0.5 100 2.521 5.988 2.442 0.069 0 0
relative
is the ratio with the fastest one.
microbenchmark
library("microbenchmark")
library("ggplot2")
mbm <- microbenchmark(
sqrt(x),
x ^ 0.5,
exp(log(x) / 2)
)
mbm
## Unit: milliseconds
## expr min lq mean median uq max neval
## sqrt(x) 2.417661 2.622038 3.451668 2.750339 3.35440 7.696437 100
## x^0.5 21.241232 22.213121 23.877836 23.224167 25.49159 30.653866 100
## exp(log(x)/2) 14.238262 15.026485 16.718186 15.772166 17.74524 56.245006 100
Results from microbenchmark
can be nicely plotted in base R or ggplot2.
boxplot(mbm)
autoplot(mbm)
Premature optimization is the root of all evil (or at least most of it) in programming.
-Don Knuth
Sources: - http://adv-r.had.co.nz/Profiling.html
- https://rstudio.github.io/profvis/
- https://support.rstudio.com/hc/en-us/articles/218221837-Profiling-with-RStudio
library(profvis)
profvis({
data(diamonds, package = "ggplot2")
plot(price ~ carat, data = diamonds)
m <- lm(price ~ carat, data = diamonds)
abline(m, col = "red")
})
First generate test data:
times <- 4e5
cols <- 150
data <-
as.data.frame(x = matrix(rnorm(times * cols, mean = 5),
ncol = cols))
data <- cbind(id = paste0("g", seq_len(times)), data)
Original code for centering columns of a dataframe:
profvis({
# Store in another variable for this run
data1 <- data
# Get column means
means <- apply(data1[, names(data1) != "id"], 2, mean)
# Subtract mean from each column
for (i in seq_along(means)) {
data1[, names(data1) != "id"][, i] <-
data1[, names(data1) != "id"][, i] - means[i]
}
})
Profile apply
vs colMeans
vs lapply
vs vapply
:
profvis({
data1 <- data
# Four different ways of getting column means
means <- apply(data1[, names(data1) != "id"], 2, mean)
means <- colMeans(data1[, names(data1) != "id"])
means <- lapply(data1[, names(data1) != "id"], mean)
means <- vapply(data1[, names(data1) != "id"], mean, numeric(1))
})
We decide to use vapply
:
profvis({
data1 <- data
means <- vapply(data1[, names(data1) != "id"], mean, numeric(1))
for (i in seq_along(means)) {
data1[, names(data1) != "id"][, i] <- data1[, names(data1) != "id"][, i] - means[i]
}
})
Calculate mean and center in one pass:
profvis({
data1 <- data
# Given a column, normalize values and return them
col_norm <- function(col) {
col - mean(col)
}
# Apply the normalizer function over all columns except id
data1[, names(data1) != "id"] <-
lapply(data1[, names(data1) != "id"], col_norm)
})
Original code for cumulative sums:
profvis({
data <- data.frame(value = runif(1e5))
data$sum[1] <- data$value[1]
for (i in seq(2, nrow(data))) {
data$sum[i] <- data$sum[i-1] + data$value[i]
}
})
Write a function to avoid expensive indexing by $
:
profvis({
csum <- function(x) {
if (length(x) < 2) return(x)
sum <- x[1]
for (i in seq(2, length(x))) {
sum[i] <- sum[i-1] + x[i]
}
sum
}
data$sum <- csum(data$value)
})
Pre-allocate vector:
profvis({
csum2 <- function(x) {
if (length(x) < 2) return(x)
sum <- numeric(length(x)) # Preallocate
sum[1] <- x[1]
for (i in seq(2, length(x))) {
sum[i] <- sum[i-1] + x[i]
}
sum
}
data$sum <- csum2(data$value)
})
Modularize big projects into small functions. Profile functions as early and as frequently as possible.
Learning sources:
- Video: https://vimeo.com/99375765
- Advanced R: http://adv-r.had.co.nz/Exceptions-Debugging.html
- RStudio tutorial: https://support.rstudio.com/hc/en-us/articles/205612627-Debugging-with-RStudio
Demo code: parlindrome.R, crazy-talk.R.
breakpoint
step in/through function
browser()
traceback()
options(error = browser)
, options(error = NULL)
, Debug
-> On Error
-> Break in Code