Head-to-head comparison of R, Python, Julia, and C/C++

System information (for reproducibility)

In [1]:
versioninfo()
Julia Version 1.7.3
Commit 742b9abb4d (2022-05-06 12:58 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin21.4.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 4

Types of computer languages

  • Compiled languages (low-level languages): C/C++, Fortran, ...

    • Directly compiled to machine code that is executed by CPU
    • Pros: fast, memory efficient
    • Cons: longer development time, hard to debug
  • Interpreted language (high-level languages): R, Matlab, Python, SAS IML, JavaScript, ...

    • Interpreted by interpreter
    • Pros: fast prototyping
    • Cons: excruciatingly slow for loops
  • Mixed (dynamic) languages: Matlab (JIT), R (compiler package), Julia, Cython, JAVA, ...

    • Pros and cons: between the compiled and interpreted languages
  • Script languages: Linux shell scripts, Perl, ...

    • Extremely useful for some data preprocessing and manipulation
  • Database languages: SQL, Hive (Hadoop).

    • Data analysis never happens if we do not know how to retrieve data from databases

How high-level languages work?

  • Typical execution of a high-level language such as R, Python, and Matlab.

  • To improve efficiency of interpreted languages such as R, Matlab and Python, conventional wisdom is to avoid loops as much as possible. Aka, vectorize code

    The only loop you are allowed to have is that for an iterative algorithm.

  • When looping is unavoidable, need to code in C, C++, or Fortran. This creates the notorious two language problem
    Success stories: the popular glmnet package in R is coded in Fortran; tidyverse and data.table packages use a lot RCpp/C++.

  • High-level languages have made many efforts to bring themselves closer to the performance of low-level languages such as C, C++, or Fortran, with a variety levels of success.

    • Matlab has employed JIT (just-in-time compilation) technology since 2002.
    • Since R 3.4.0 (Apr 2017), the JIT bytecode compiler is enabled by default at its level 3.
    • Cython is a compiler system based on Python.
  • Modern languages such as Julia tries to solve the two language problem. That is to achieve efficiency without vectorizing code.

  • Julia execution.

Gibbs sampler example by Doug Bates

  • Doug Bates (member of R-Core, author of popular R packages Matrix, lme4, RcppEigen, etc)

    As some of you may know, I have had a (rather late) mid-life crisis and run off with another language called Julia.

    -- Doug Bates (on the `knitr` Google Group)

  • An example from Dr. Doug Bates's slides Julia for R Programmers.

  • The task is to create a Gibbs sampler for the density
    $$ f(x, y) = k x^2 exp(- x y^2 - y^2 + 2y - 4x), x > 0 $$ using the conditional distributions $$ \begin{eqnarray*} X | Y &\sim& \Gamma \left( 3, \frac{1}{y^2 + 4} \right) \\ Y | X &\sim& N \left(\frac{1}{1+x}, \frac{1}{2(1+x)} \right). \end{eqnarray*} $$

  • R solution. The RCall.jl package allows us to execute R code without leaving the Julia environment. We first define an R function Rgibbs().
In [2]:
using RCall

# show R information
R"""
sessionInfo()
"""
┌ Info: Precompiling RCall [6f49c342-dc21-5d91-9882-a32aef131414]
└ @ Base loading.jl:1423
Out[2]:
RObject{VecSxp}
R version 4.1.2 (2021-11-01)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 10.16

Matrix products: default
BLAS:   /Applications/Julia-1.7.app/Contents/Resources/julia/lib/julia/libblastrampoline.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.1.2
In [3]:
# define a function for Gibbs sampling

R"""
library(Matrix)

Rgibbs <- function(N, thin) {
  mat <- matrix(0, nrow=N, ncol=2)
  x <- y <- 0
  for (i in 1:N) {
    for (j in 1:thin) {
      x <- rgamma(1, 3, y * y + 4) # 3rd arg is rate
      y <- rnorm(1, 1 / (x + 1), 1 / sqrt(2 * (x + 1)))
    }
    mat[i,] <- c(x, y)
  }
  mat
}
"""
Out[3]:
RObject{ClosSxp}
function (N, thin) 
{
    mat <- matrix(0, nrow = N, ncol = 2)
    x <- y <- 0
    for (i in 1:N) {
        for (j in 1:thin) {
            x <- rgamma(1, 3, y * y + 4)
            y <- rnorm(1, 1/(x + 1), 1/sqrt(2 * (x + 1)))
        }
        mat[i, ] <- c(x, y)
    }
    mat
}

To generate a sample of size 10,000 with a thinning of 500. How long does it take?

In [4]:
R"""
system.time(Rgibbs(10000, 500))
"""
Out[4]:
RObject{RealSxp}
   user  system elapsed 
 23.901   1.814  25.867 
  • This is a Julia function for the same Gibbs sampler:
In [5]:
using Distributions

function jgibbs(N, thin)
    mat = zeros(N, 2)
    x = y = 0.0
    for i in 1:N
        for j in 1:thin
            x = rand(Gamma(3, 1 / (y * y + 4)))
            y = rand(Normal(1 / (x + 1), 1 / sqrt(2(x + 1))))
        end
        mat[i, 1] = x
        mat[i, 2] = y
    end
    mat
end
┌ Info: Precompiling Distributions [31c24e10-a181-5473-b8eb-7969acd0382f]
└ @ Base loading.jl:1423
Out[5]:
jgibbs (generic function with 1 method)

Generate the same number of samples. How long does it take?

In [6]:
jgibbs(100, 5); # warm-up
@elapsed jgibbs(10000, 500)
Out[6]:
0.281129432

We see 40-80 fold speed up of Julia over R on this example, with similar coding effort!

Comparing C, R, Python, and Julia

To better understand how these languages work, we consider a simple task: summing a vector.

Let's first generate data: 1 million double precision numbers from uniform [0, 1].

In [7]:
using Random # standard library
Random.seed!(123) # seed
x = rand(1_000_000) # 1 million random numbers in [0, 1)
sum(x)
Out[7]:
500220.6882968192

In this course, we extensively use the package BenchmarkTools.jl for robust benchmarking. It's analog of microbenchmark or bench package in R.

In [8]:
using BenchmarkTools

C

We would use the low-level C code as the baseline for copmarison. In Julia, we can easily run compiled C code using the ccall function.

In [9]:
using Libdl

C_code = """
#include <stddef.h>
double c_sum(size_t n, double *X) {
    double s = 0.0;
    for (size_t i = 0; i < n; ++i) {
        s += X[i];
    }
    return s;
}
"""

const Clib = tempname()   # make a temporary file

# compile to a shared library by piping C_code to gcc
# (works only if you have gcc installed):

open(`gcc -std=c99 -fPIC -O3 -msse3 -xc -shared -o $(Clib * "." * Libdl.dlext) -`, "w") do f
    print(f, C_code) 
end

# define a Julia function that calls the C function:
c_sum(X::Array{Float64}) = ccall(("c_sum", Clib), Float64, (Csize_t, Ptr{Float64}), length(X), X)
Out[9]:
c_sum (generic function with 1 method)
In [10]:
# make sure it gives same answer
c_sum(x)
Out[10]:
500220.6882968189
In [11]:
bm = @benchmark c_sum($x)
Out[11]:
BenchmarkTools.Trial: 3926 samples with 1 evaluation.
 Range (minmax):  1.073 ms  2.215 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.226 ms                GC (median):    0.00%
 Time  (mean ± σ):   1.266 ms ± 140.309 μs   GC (mean ± σ):  0.00% ± 0.00%

         █▁   ▁                                               
  ▂▁▁▁▂▄▂██▄▆▇█▄▅▃▃▂▂▃▂▂▂▂▁▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  1.07 ms         Histogram: frequency by time        1.77 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
In [12]:
using Statistics # standard library
benchmark_result = Dict() # a dictionary to store median runtime (in milliseconds)

# store median runtime (in milliseconds)
benchmark_result["C"] = median(bm.times) / 1e6
Out[12]:
1.226119

R, buildin sum

Next we compare to the build in sum function in R, which is implemented using C.

In [13]:
using RCall

R"""
library(bench)
y <- $x
rbm_builtin <- bench::mark(sum(y))
"""
Out[13]:
RObject{VecSxp}
# A tibble: 1 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 sum(y)        2.4ms 2.58ms      381.        0B        0   191     0      502ms
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
In [14]:
# store median runtime (in milliseconds)
@rget rbm_builtin # dataframe
benchmark_result["R builtin"] = median(rbm_builtin[!, :median]) * 1000
Out[14]:
2.5748809999974753

R, handwritten loop

Handwritten loop in R is much slower.

In [15]:
using RCall

R"""
sum_r <- function(x) {
  s <- 0
  for (xi in x) {
    s <- s + xi
  }
  s
}
library(bench)
y <- $x
rbm_loop <- bench::mark(sum_r(y))
"""
Out[15]:
RObject{VecSxp}
# A tibble: 1 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr> <bch:tm> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 sum_r(y)     12.7ms 13.4ms      74.0    12.7KB        0    38     0      514ms
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
In [16]:
# store median runtime (in milliseconds)
@rget rbm_loop # dataframe
benchmark_result["R loop"] = median(rbm_loop[!, :median]) * 1000
Out[16]:
13.424493500000523

R, Rcpp

Rcpp package is the easiest way to incorporate C++ code in R.

In [17]:
R"""
library(Rcpp)

cppFunction('double rcpp_sum(NumericVector x) {
  int n = x.size();
  double total = 0;
  for(int i = 0; i < n; ++i) {
    total += x[i];
  }
  return total;
}')
rcpp_sum
"""
Out[17]:
RObject{ClosSxp}
function (x) 
.Call(<pointer: 0x115b73850>, x)
In [18]:
R"""
rbm_rcpp <- bench::mark(rcpp_sum(y))
"""
Out[18]:
RObject{VecSxp}
# A tibble: 1 × 13
  expression      min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
  <bch:expr>  <bch:t> <bch:>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
1 rcpp_sum(y)  1.21ms 1.37ms      697.    2.49KB        0   349     0      500ms
# … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
In [19]:
# store median runtime (in milliseconds)
@rget rbm_rcpp # dataframe
benchmark_result["R Rcpp"] = median(rbm_rcpp[!, :median]) * 1000
Out[19]:
1.3716060000028563

Python, builtin sum

Built in function sum in Python.

In [20]:
using PyCall
PyCall.pyversion
┌ Info: Precompiling PyCall [438e738f-606a-5dbb-bf0a-cddfbfd45ab0]
└ @ Base loading.jl:1423
Out[20]:
v"3.7.6"
In [21]:
# get the Python built-in "sum" function:
pysum = pybuiltin("sum")
bm = @benchmark $pysum($x)
Out[21]:
BenchmarkTools.Trial: 66 samples with 1 evaluation.
 Range (minmax):  73.881 ms83.834 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     76.129 ms               GC (median):    0.00%
 Time  (mean ± σ):   76.274 ms ±  1.430 ms   GC (mean ± σ):  0.00% ± 0.00%

                 ▁  ▆ ▁▆   █         ▁                        
  ▄▁▄▁▄▁▄▁▇▄▄▁▁▇▇█▄▁█▄██▇▄█▇▇▄▄▇▄▁▄▇█▄▁▁▄▁▄▁▁▁▁▁▁▄▁▁▁▁▁▁▁▁▇ ▁
  73.9 ms         Histogram: frequency by time        79.4 ms <

 Memory estimate: 240 bytes, allocs estimate: 6.
In [22]:
# store median runtime (in miliseconds)
benchmark_result["Python builtin"] = median(bm.times) / 1e6
Out[22]:
76.1290385

Python, handwritten loop

In [23]:
using PyCall

py"""
def py_sum(A):
    s = 0.0
    for a in A:
        s += a
    return s
"""

sum_py = py"py_sum"

bm = @benchmark $sum_py($x)
Out[23]:
BenchmarkTools.Trial: 51 samples with 1 evaluation.
 Range (minmax):  91.627 ms113.009 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     98.220 ms                GC (median):    0.00%
 Time  (mean ± σ):   99.584 ms ±   4.576 ms   GC (mean ± σ):  0.00% ± 0.00%

                 █ ▃  ▁                                    
  ▄▁▁▁▁▄▁▄▁▁▄▄▇▇▇█▇█▄█▄▄▄█▇▄▁▄▁▄▄▁▁▁▇▁▁▁▁▁▁▁▁▁▁▄▄▁▁▁▁▁▁▁▁▁▄▄ ▁
  91.6 ms         Histogram: frequency by time          113 ms <

 Memory estimate: 240 bytes, allocs estimate: 6.
In [24]:
# store median runtime (in miliseconds)
benchmark_result["Python loop"] = median(bm.times) / 1e6
Out[24]:
98.219855

Python, numpy

Numpy is the high-performance scientific computing library for Python.

In [25]:
# bring in sum function from Numpy 
numpy_sum = pyimport("numpy")."sum"
Out[25]:
PyObject <function sum at 0x7fb2addd0b90>
In [26]:
bm = @benchmark $numpy_sum($x)
Out[26]:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  391.407 μs 1.392 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     459.204 μs               GC (median):    0.00%
 Time  (mean ± σ):   474.591 μs ± 62.441 μs   GC (mean ± σ):  0.00% ± 0.00%

      ▁ ▁▁▅██▄▃▁                                               
  ▁▃▃▅██████████▇▇▇▅▅▄▄▃▃▃▂▂▃▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▃
  391 μs          Histogram: frequency by time          731 μs <

 Memory estimate: 240 bytes, allocs estimate: 6.
In [27]:
# store median runtime (in miliseconds)
benchmark_result["Python numpy"] = median(bm.times) / 1e6
Out[27]:
0.4592035

Numpy performance is on a par with Julia build-in sum function. Both are about 3 times faster than C, possibly because of usage of SIMD.

Julia, builtin sum

@time, @elapsed, @allocated macros in Julia report run times and memory allocation.

In [28]:
@time sum(x) # no compilation time after first run
  0.000612 seconds (1 allocation: 16 bytes)
Out[28]:
500220.6882968192

For more robust benchmarking, we use BenchmarkTools.jl package.

In [29]:
bm = @benchmark sum($x)
Out[29]:
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (minmax):  220.515 μs  4.065 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     319.913 μs                GC (median):    0.00%
 Time  (mean ± σ):   341.570 μs ± 107.341 μs   GC (mean ± σ):  0.00% ± 0.00%

           ▄█▅                                                  
  ▂▅▇▅▅▃▄▅█████▇▆▄▄▃▃▃▂▂▂▂▂▂▂▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁ ▂
  221 μs           Histogram: frequency by time          742 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.
In [30]:
benchmark_result["Julia builtin"] = median(bm.times) / 1e6
Out[30]:
0.3199135

Julia, handwritten loop

Let's also write a loop and benchmark.

In [31]:
function jl_sum(A)
    s = zero(eltype(A))
    for a in A
        s += a
    end
    s
end

bm = @benchmark jl_sum($x)
Out[31]:
BenchmarkTools.Trial: 3965 samples with 1 evaluation.
 Range (minmax):  1.137 ms 1.827 ms   GC (min … max): 0.00% … 0.00%
 Time  (median):     1.241 ms               GC (median):    0.00%
 Time  (mean ± σ):   1.255 ms ± 67.139 μs   GC (mean ± σ):  0.00% ± 0.00%

      ▃▄▂▁▃▅▅▆█▇▅▄▄▃▂▂▁▁▁▁                                ▂
  ▆▄▆▆██████████████████████▇██▇▇▇▇▇▆▇▆▅▅▆▆▅▅▄▆▆▅▅▆▃▅▅▄▅▃ █
  1.14 ms      Histogram: log(frequency) by time     1.57 ms <

 Memory estimate: 0 bytes, allocs estimate: 0.
In [32]:
benchmark_result["Julia loop"] = median(bm.times) / 1e6
Out[32]:
1.240817

Exercise: annotate the loop by @simd and benchmark again.

Summary

In [33]:
sort(collect(benchmark_result), by = x -> x[2])
Out[33]:
9-element Vector{Pair{Any, Any}}:
  "Julia builtin" => 0.3199135
   "Python numpy" => 0.4592035
              "C" => 1.226119
     "Julia loop" => 1.240817
         "R Rcpp" => 1.3716060000028563
      "R builtin" => 2.5748809999974753
         "R loop" => 13.424493500000523
 "Python builtin" => 76.1290385
    "Python loop" => 98.219855
  • C, R builtin are the baseline C performance (gold standard).

  • Python builtin and Python loop are 80-100 fold slower than C because the loop is interpreted.

  • R loop is about 30 folder slower than C and indicates the performance of JIT bytecode generated by its compiler package (turned on by default since R v3.4.0 (Apr 2017)).

  • Julia loop is close to C performance, because Julia code is JIT compiled.

  • Julia builtin and Python numpy are 3-4 folder faster than C because of SIMD.

Take home message for computational scientists

  • High-level language (R, Python, Matlab) programmers should be familiar with existing high-performance packages. Don't reinvent wheels.

    • R: RcppEigen, tidyverse, ...
    • Python: numpy, scipy, ...
  • In most research projects, looping is unavoidable. Then we need to use a low-level language.

    • R: Rcpp, ...
    • Python: Cython, ...
  • In this course, we use Julia, which seems to circumvent the two language problem. So we can spend more time on algorithms, not on juggling $\ge 2$ languages.