R Programming

Yiwen Zhang
Aug 28 2014

Outline

  • Functions
  • Vectorization
  • Compiled R code
  • Parallel computing
  • Some useful packages

About functions

  • Input variable, can be any R object, vector, matrix, data frame, list, function
  • Output variable:
    • named vector names
    • named matrix colnames, rownames
    • data frame data.frame(a=, b=, )
    • named list
  • Save function output
  • Namespace
  • Saving R function, .R file, or .RData file
  • Function scope

Debugging

  • options(error=browser)
  • Editor breakpoints
  • browser()
  • debug()
  • debugSource()
  • traceback()

Vectorize your code

  • Generic R function: some are vectorized: gamma, beta, …
  • More general: apply, sapply, lapply, tapply, mapply

    • pay attention to the output!
    • might not give you any speed up.

The Pursuit of Speed

Profile your code first! Rprof

  • Compiled R code: Package compiler
    • cmpfun
    • cmpfile, loadcmp
  • Parallel computing

Parallel Computing

  • Different kinds of parallel
  • R: embarrassing parallel
  • R package parallel
compare.tests <- function (n.pattern, sigma2.ratio, level = 0.05, null.size = 200000, mc.size = 10000)
  • Serial code, two for loops

Parallel Computing

  • Different levels of parallelism
  • R: embarrassing parallel
  • R package parallel
    • forking
require(parallel)
result.mcmapply <- mcmapply ( 
  compare.tests,
  rep (n.pattern.list, each = length (sigma2.ratio.list), times = 1),
  rep (sigma2.ratio.list, each = 1, times = length (n.pattern.list)),
  MoreArgs = list (mc.size = 10000), mc.cores = 12))

Parallel Computing

  • Different kinds of parallel
  • R: embarrassing parallel
  • R package parallel
    • forking and socket
## build cluster
cl <- makeCluster (getOption ("cl.cores", 12) )
clusterSetRNGStream(cl, 123)
clusterExport (cl, c("generate.design", "generate.response", "lme", "pdIdent", "simulate.null.samples", "LRTSim", "RLRTSim"))

Parallel Computing

  • Different kinds of parallel
  • R: embarrassing parallel
  • R package parallel
    • forking and socket
## Running the code
result.clusterMap <- clusterMap ( 
  cl, compare.tests,rep (n.pattern.list, each = length (sigma2.ratio.list), times = 1),rep (sigma2.ratio.list, each = 1, times = length (n.pattern.list)), MoreArgs = list (mc.size = 10000), .scheduling = "static")

## Close down the cluster
stopCluster (cl)

Parallel Computing

  • Different kinds of parallel
  • R: embarrassing parallel
  • R package parallel
    • forking
    • socket
  • Other contributed packages: foreach, pbdR
  • Use script to run on a multi-node cluster
R CMD BATCH --vanilla --args -1 practice.R Rout

Other Tools

  • Regular expression: gsub, grep, regexec

  • Visualization: ggplot2, ggvis

  • Data manipulation: dplyr, reshape2, data.table

  • Data frame with large data sets: H2O

  • Use C in R: Rcpp, Rcpp11

  • Use CUDA in R: gputools

    Again, profile your code first! Rprof