In [1]:
versioninfo()
Julia Version 1.1.0
Commit 80516ca202 (2019-01-21 21:24 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i7-6920HQ CPU @ 2.90GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = code

Numerical linear algebra: introduction

  • Topics in numerical algebra:

    • BLAS
    • solve linear equations $\mathbf{A} \mathbf{x} = \mathbf{b}$
    • regression computations $\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}$
    • eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{x}$
    • generalized eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{B} \mathbf{x}$
    • singular value decompositions $\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T$
    • iterative methods for numerical linear algebra
  • Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the building blocks of most statistical computing tasks (optimization, MCMC).

  • Our major goal (or learning objectives) is to

    1. know the complexity (flop count) of each task
    2. be familiar with the BLAS and LAPACK functions (what they do)
    3. do not re-invent wheels by implementing these dense linear algebra subroutines by yourself
    4. understand the need for iterative methods
    5. apply appropriate numerical algebra tools to various statistical problems
  • All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra.

    • Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See documentation.

BLAS

  • BLAS stands for basic linear algebra subprograms.

  • See netlib for a complete list of standardized BLAS functions.

  • There are many implementations of BLAS.

    • Netlib provides a reference implementation.
    • Matlab uses Intel's MKL (mathematical kernel libaries). MKL implementation is the gold standard on market. It is not open source but the compiled library is free for Linux and MacOS.
    • Julia uses OpenBLAS. OpenBLAS is the best open source implementation.
  • There are 3 levels of BLAS functions.

Level Example Operation Name Dimension Flops
1 $\alpha \gets \mathbf{x}^T \mathbf{y}$ dot product $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ $2n$
1 $\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$ axpy $\alpha \in \mathbb{R}$, $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ $2n$
2 $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ gaxpy $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$ $2mn$
2 $\mathbf{A} \gets \mathbf{A} + \mathbf{y} \mathbf{x}^T$ rank one update $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$ $2mn$
3 $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$ matrix multiplication $\mathbf{A} \in \mathbb{R}^{m \times p}$, $\mathbf{B} \in \mathbb{R}^{p \times n}$, $\mathbf{C} \in \mathbb{R}^{m \times n}$ $2mnp$
  • Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z).

Examples

The form of a mathematical expression and the way the expression should be evaluated in actual practice may be quite different.

Some operations appear as level-3 but indeed are level-2.

Example 1. A common operation in statistics is column scaling or row scaling $$ \begin{eqnarray*} \mathbf{A} &=& \mathbf{A} \mathbf{D} \quad \text{(column scaling)} \\ \mathbf{A} &=& \mathbf{D} \mathbf{A} \quad \text{(row scaling)}, \end{eqnarray*} $$ where $\mathbf{D}$ is diagonal. For example, in generalized linear models (GLMs), the Fisher information matrix takes the form
$$ \mathbf{X}^T \mathbf{W} \mathbf{X}, $$ where $\mathbf{W}$ is a diagonal matrix with observation weights on diagonal.

Column and row scalings are essentially level-2 operations!

In [2]:
using BenchmarkTools, LinearAlgebra, Random

Random.seed!(123) # seed
n = 2000
A = rand(n, n) # n-by-n matrix
d = rand(n)  # n vector
D = Diagonal(d) # diagonal matrix with d as diagonal
Out[2]:
2000×2000 Diagonal{Float64,Array{Float64,1}}:
 0.140972   ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅      
  ⋅        0.143596   ⋅         ⋅             ⋅         ⋅         ⋅      
  ⋅         ⋅        0.612494   ⋅             ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅        0.0480573      ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      
 ⋮                                        ⋱                              
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅         …   ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅            0.882415   ⋅         ⋅      
  ⋅         ⋅         ⋅         ⋅             ⋅        0.450904   ⋅      
  ⋅         ⋅         ⋅         ⋅             ⋅         ⋅        0.814614
In [3]:
Dfull = convert(Matrix, D) # convert to full matrix
Out[3]:
2000×2000 Array{Float64,2}:
 0.140972  0.0       0.0       0.0        …  0.0       0.0       0.0     
 0.0       0.143596  0.0       0.0           0.0       0.0       0.0     
 0.0       0.0       0.612494  0.0           0.0       0.0       0.0     
 0.0       0.0       0.0       0.0480573     0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           0.0       0.0       0.0     
 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           0.0       0.0       0.0     
 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           0.0       0.0       0.0     
 ⋮                                        ⋱                              
 0.0       0.0       0.0       0.0           0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           0.0       0.0       0.0     
 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           0.0       0.0       0.0     
 0.0       0.0       0.0       0.0        …  0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           0.0       0.0       0.0     
 0.0       0.0       0.0       0.0           0.882415  0.0       0.0     
 0.0       0.0       0.0       0.0           0.0       0.450904  0.0     
 0.0       0.0       0.0       0.0           0.0       0.0       0.814614
In [4]:
# this is calling BLAS routine for matrix multiplication: O(n^3) flops
# this is SLOW!
@benchmark $A * $Dfull
Out[4]:
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  2
  --------------
  minimum time:     100.961 ms (2.61% GC)
  median time:      102.280 ms (2.60% GC)
  mean time:        103.678 ms (3.29% GC)
  maximum time:     138.528 ms (26.71% GC)
  --------------
  samples:          49
  evals/sample:     1
In [5]:
# dispatch to special method for diagonal matrix multiplication.
# columnwise scaling: O(n^2) flops
@benchmark $A * $D
Out[5]:
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  2
  --------------
  minimum time:     9.051 ms (3.33% GC)
  median time:      11.632 ms (24.18% GC)
  mean time:        11.911 ms (24.67% GC)
  maximum time:     53.310 ms (83.51% GC)
  --------------
  samples:          420
  evals/sample:     1
In [6]:
# in-place: avoid allocate space for result
# rmul!: compute matrix-matrix product AB, overwriting A, and return the result.
@benchmark rmul!(A, D)
Out[6]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     5.276 ms (0.00% GC)
  median time:      10.364 ms (0.00% GC)
  mean time:        11.183 ms (0.00% GC)
  maximum time:     19.343 ms (0.00% GC)
  --------------
  samples:          447
  evals/sample:     1

Note: In R or Matlab, diag(d) will create a full matrix. Be cautious using diag function: do we really need a full diagonal matrix?

In [7]:
using RCall

R"""
d <- runif(5)
diag(d)
"""
┌ Info: Recompiling stale cache file /Users/huazhou/.julia/compiled/v1.1/RCall/8GFyb.ji for RCall [6f49c342-dc21-5d91-9882-a32aef131414]
└ @ Base loading.jl:1184
Out[7]:
RObject{RealSxp}
          [,1]       [,2]      [,3]      [,4]     [,5]
[1,] 0.1639009 0.00000000 0.0000000 0.0000000 0.000000
[2,] 0.0000000 0.03372306 0.0000000 0.0000000 0.000000
[3,] 0.0000000 0.00000000 0.7544716 0.0000000 0.000000
[4,] 0.0000000 0.00000000 0.0000000 0.3005411 0.000000
[5,] 0.0000000 0.00000000 0.0000000 0.0000000 0.144251
In [8]:
using MATLAB

mat"""
d = rand(5, 1)
diag(d)
"""
>> >> >> 
d =

    0.8147
    0.9058
    0.1270
    0.9134
    0.6324

Out[8]:
5×5 Array{Float64,2}:
 0.814724  0.0       0.0       0.0       0.0     
 0.0       0.905792  0.0       0.0       0.0     
 0.0       0.0       0.126987  0.0       0.0     
 0.0       0.0       0.0       0.913376  0.0     
 0.0       0.0       0.0       0.0       0.632359

Example 2. Innter product between two matrices $\mathbf{A}, \mathbf{B} \in \mathbb{R}^{m \times n}$ is often written as $$ \text{trace}(\mathbf{A}^T \mathbf{B}), \text{trace}(\mathbf{B} \mathbf{A}^T), \text{trace}(\mathbf{A} \mathbf{B}^T), \text{ or } \text{trace}(\mathbf{B}^T \mathbf{A}). $$ They appear as level-3 operation (matrix multiplication with $O(m^2n)$ or $O(mn^2)$ flops).

In [9]:
Random.seed!(123)
n = 2000
A, B = randn(n, n), randn(n, n)

# slow way to evaluate this thing
@benchmark tr(transpose($A) * $B)
Out[9]:
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  2
  --------------
  minimum time:     104.447 ms (0.43% GC)
  median time:      110.943 ms (2.73% GC)
  mean time:        110.926 ms (2.70% GC)
  maximum time:     117.164 ms (2.65% GC)
  --------------
  samples:          46
  evals/sample:     1

But $\text{trace}(\mathbf{A}^T \mathbf{B}) = <\text{vec}(\mathbf{A}), \text{vec}(\mathbf{B})>$. The latter is level-2 operation with $O(mn)$ flops.

In [10]:
@benchmark dot($A, $B)
Out[10]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.390 ms (0.00% GC)
  median time:      3.613 ms (0.00% GC)
  mean time:        3.618 ms (0.00% GC)
  maximum time:     5.299 ms (0.00% GC)
  --------------
  samples:          1375
  evals/sample:     1

Example 3. Similarly $\text{diag}(\mathbf{A}^T \mathbf{B})$ can be calculated in $O(mn)$ flops.

In [11]:
# slow way to evaluate this thing
@benchmark diag(transpose($A) * $B)
Out[11]:
BenchmarkTools.Trial: 
  memory estimate:  30.53 MiB
  allocs estimate:  3
  --------------
  minimum time:     105.656 ms (2.72% GC)
  median time:      110.821 ms (2.75% GC)
  mean time:        111.252 ms (2.69% GC)
  maximum time:     119.092 ms (2.59% GC)
  --------------
  samples:          45
  evals/sample:     1
In [12]:
# smarter
@benchmark Diagonal(vec(sum($A .* $B, dims=1)))
Out[12]:
BenchmarkTools.Trial: 
  memory estimate:  30.53 MiB
  allocs estimate:  6
  --------------
  minimum time:     9.309 ms (5.34% GC)
  median time:      11.412 ms (22.53% GC)
  mean time:        11.468 ms (22.75% GC)
  maximum time:     13.552 ms (34.13% GC)
  --------------
  samples:          436
  evals/sample:     1

To get rid of allocation of intermediate array at all, we can just write a double loop or use dot function.

In [13]:
function diag_matmul(A, B)
    m, n = size(A)
    @assert size(B) == (m, n) "A and B should have same size"
    @views d = [dot(A[:, j], B[:, j]) for j in 1:n]
#    d = zeros(eltype(A), n)
#    for j in 1:n, i in 1:m
#        d[j] += A[i, j] * B[i, j]
#    end
    Diagonal(d)
end

@benchmark diag_matmul($A, $B)
Out[13]:
BenchmarkTools.Trial: 
  memory estimate:  203.33 KiB
  allocs estimate:  4004
  --------------
  minimum time:     3.277 ms (0.00% GC)
  median time:      3.503 ms (0.00% GC)
  mean time:        3.598 ms (0.39% GC)
  maximum time:     6.018 ms (29.67% GC)
  --------------
  samples:          1386
  evals/sample:     1

Memory hierarchy and level-3 fraction

Key to high performance is effective use of memory hierarchy. True on all architectures.

  • Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.

  • Numbers everyone should know
Operation Time
L1 cache reference 0.5 ns
L2 cache reference 7 ns
Main memory reference 100 ns
Read 1 MB sequentially from memory 250,000 ns
Read 1 MB sequentially from SSD 1,000,000 ns
Read 1 MB sequentially from disk 20,000,000 ns

Source: https://gist.github.com/jboner/2841832

  • For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s.

  • Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog?
    Answer: use high-level BLAS as much as possible.

BLAS Dimension Mem. Refs. Flops Ratio
Level 1: $\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$ $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ $3n$ $2n$ 3:2
Level 2: $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$, $\mathbf{A} \in \mathbb{R}^{n \times n}$ $n^2$ $2n^2$ 1:2
Level 3: $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$ $\mathbf{A}, \mathbf{B}, \mathbf{C} \in\mathbb{R}^{n \times n}$ $4n^2$ $2n^3$ 2:n
  • Higher level BLAS (3 or 2) make more effective use of arithmetic logic units (ALU) by keeping them busy. Surface-to-volume effect.
    See Dongarra slides.

  • A distinction between LAPACK and LINPACK (older version of R uses LINPACK) is that LAPACK makes use of higher level BLAS as much as possible (usually by smart partitioning) to increase the so-called level-3 fraction.

  • To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see the Quora question, especially the video. Bottomline is

Get familiar with (good implementations of) BLAS/LAPACK and use them as much as possible.

Effect of data layout

  • Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.

  • Storage mode: column-major (Fortran, Matlab, R, Julia) vs row-major (C/C++).

  • Cache line is the minimum amount of cache which can be loaded and stored to memory.

    • x86 CPUs: 64 bytes
    • ARM CPUs: 32 bytes

  • Accessing column-major stored matrix by rows ($ij$ looping) causes lots of cache misses.

  • Take matrix multiplication as an example $$ \mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}, \quad \mathbf{A} \in \mathbb{R}^{m \times p}, \mathbf{B} \in \mathbb{R}^{p \times n}, \mathbf{C} \in \mathbb{R}^{m \times n}. $$ Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops.

    • jki or kji looping:
      # inner most loop
        for i = 1:m
            C[i, j] = C[i, j] + A[i, k] * B[k, j]
        end
      
    • ikj or kij looping:
      # inner most loop        
        for j = 1:n
            C[i, j] = C[i, j] + A[i, k] * B[k, j]
        end
      
    • ijk or jik looping:
      # inner most loop        
        for k = 1:p
            C[i, j] = C[i, j] + A[i, k] * B[k, j]
        end
      
  • We pay attention to the innermost loop, where the vector calculation occurs. The associated stride when accessing the three matrices in memory (assuming column-major storage) is
Variant A Stride B Stride C Stride
$jki$ or $kji$ Unit 0 Unit
$ikj$ or $kij$ 0 Non-Unit Non-Unit
$ijk$ or $jik$ Non-Unit Unit 0

Apparently the variants $jki$ or $kji$ are preferred.

In [14]:
"""
    matmul_by_loop!(A, B, C, order)

Overwrite `C` by `A * B`. `order` indicates the looping order for triple loop.
"""
function matmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String)
    
    m = size(A, 1)
    p = size(A, 2)
    n = size(B, 2)
    fill!(C, 0)
    
    if order == "jki"
        for j = 1:n, k = 1:p, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kji"
        for k = 1:p, j = 1:n, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ikj"
        for i = 1:m, k = 1:p, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kij"
        for k = 1:p, i = 1:m, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ijk"
        for i = 1:m, j = 1:n, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "jik"
        for j = 1:n, i = 1:m, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
end

using Random

Random.seed!(123)
m, p, n = 2000, 100, 2000
A = rand(m, p)
B = rand(p, n)
C = zeros(m, n);
  • $jki$ and $kji$ looping:
In [15]:
using BenchmarkTools

@benchmark matmul_by_loop!($A, $B, $C, "jki")
Out[15]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     276.404 ms (0.00% GC)
  median time:      282.157 ms (0.00% GC)
  mean time:        283.423 ms (0.00% GC)
  maximum time:     291.927 ms (0.00% GC)
  --------------
  samples:          18
  evals/sample:     1
In [16]:
@benchmark matmul_by_loop!($A, $B, $C, "kji")
Out[16]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     335.573 ms (0.00% GC)
  median time:      355.841 ms (0.00% GC)
  mean time:        362.742 ms (0.00% GC)
  maximum time:     409.550 ms (0.00% GC)
  --------------
  samples:          14
  evals/sample:     1
  • $ikj$ and $kij$ looping:
In [17]:
@benchmark matmul_by_loop!($A, $B, $C, "ikj")
Out[17]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.405 s (0.00% GC)
  median time:      2.406 s (0.00% GC)
  mean time:        2.431 s (0.00% GC)
  maximum time:     2.480 s (0.00% GC)
  --------------
  samples:          3
  evals/sample:     1
In [18]:
@benchmark matmul_by_loop!($A, $B, $C, "kij")
Out[18]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.714 s (0.00% GC)
  median time:      2.726 s (0.00% GC)
  mean time:        2.726 s (0.00% GC)
  maximum time:     2.738 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1
  • $ijk$ and $jik$ looping:
In [19]:
@benchmark matmul_by_loop!($A, $B, $C, "ijk")
Out[19]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     960.266 ms (0.00% GC)
  median time:      975.456 ms (0.00% GC)
  mean time:        979.488 ms (0.00% GC)
  maximum time:     1.001 s (0.00% GC)
  --------------
  samples:          6
  evals/sample:     1
In [20]:
@benchmark matmul_by_loop!($A, $B, $C, "ijk")
Out[20]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.017 s (0.00% GC)
  median time:      1.046 s (0.00% GC)
  mean time:        1.040 s (0.00% GC)
  maximum time:     1.058 s (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1
  • Julia wraps BLAS library for matrix multiplication. We see BLAS library wins hands down (multi-threading, Strassen algorithm, higher level-3 fraction by block outer product).
In [21]:
@benchmark mul!($C, $A, $B)
Out[21]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.078 ms (0.00% GC)
  median time:      7.229 ms (0.00% GC)
  mean time:        7.805 ms (0.00% GC)
  maximum time:     10.858 ms (0.00% GC)
  --------------
  samples:          640
  evals/sample:     1
In [22]:
# direct call of BLAS wrapper function
@benchmark LinearAlgebra.BLAS.gemm!('N', 'N', 1.0, $A, $B, 0.0, $C)
Out[22]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     7.209 ms (0.00% GC)
  median time:      8.820 ms (0.00% GC)
  mean time:        8.798 ms (0.00% GC)
  maximum time:     11.884 ms (0.00% GC)
  --------------
  samples:          568
  evals/sample:     1

BLAS in R

  • Tip for R user. Standard R distribution from CRAN uses a very out-dated BLAS/LAPACK library.
In [23]:
using RCall

R"""
library("microbenchmark")
microbenchmark($A %*% $B)
"""
Out[23]:
RObject{VecSxp}
Unit: milliseconds
                expr      min       lq     mean   median       uq      max
 `#JL`$A %*% `#JL`$B 226.5638 234.8593 242.5344 242.8393 250.3439 259.4928
 neval
   100
  • Re-build R from source using OpenBLAS or MKL will immediately boost linear algebra performance in R. Google build R using MKL to get started. Similarly we can build Julia using MKL.

  • Matlab uses MKL. Usually it's very hard to beat Matlab in terms of linear algebra.

In [24]:
using MATLAB

mat"""
f = @() $A * $B;
timeit(f)
"""
Out[24]:
0.0099658959695

Avoid memory allocation: some examples

  1. Transposing matrix is an expensive memory operation.
    • In R, the command
      t(A) %*% x
      
      will first transpose A then perform matrix multiplication, causing unnecessary memory allocation
    • Julia is smart to avoid transposing matrix if possible.
In [25]:
using Random, LinearAlgebra, BenchmarkTools
Random.seed!(123)

n = 1000
A = rand(n, n)
x = rand(n);
In [26]:
typeof(transpose(A))
Out[26]:
Transpose{Float64,Array{Float64,2}}
In [27]:
fieldnames(typeof(transpose(A)))
Out[27]:
(:parent,)
In [28]:
# same data in tranpose(A) and original matrix A
pointer(transpose(A).parent), pointer(A)
Out[28]:
(Ptr{Float64} @0x0000000142a9c000, Ptr{Float64} @0x0000000142a9c000)
In [29]:
# dispatch to BLAS
# does *not* actually transpose the matrix
@benchmark transpose($A) * $x
Out[29]:
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     98.703 μs (0.00% GC)
  median time:      141.025 μs (0.00% GC)
  mean time:        156.860 μs (0.00% GC)
  maximum time:     369.885 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [30]:
# pre-allocate result
out = zeros(size(A, 2))
@benchmark mul!($out, transpose($A), $x)
Out[30]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     103.759 μs (0.00% GC)
  median time:      139.496 μs (0.00% GC)
  mean time:        157.336 μs (0.00% GC)
  maximum time:     399.354 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [31]:
# or call BLAS wrapper directly
@benchmark LinearAlgebra.BLAS.gemv!('T', 1.0, $A, $x, 0.0, $out)
Out[31]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     106.017 μs (0.00% GC)
  median time:      137.537 μs (0.00% GC)
  mean time:        146.369 μs (0.00% GC)
  maximum time:     736.091 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
  1. Broadcasting in Julia achieves vectorized code without creating intermediate arrays.

    Suppose we want to calculate elementsize maximum of absolute values of two large arrays. In R or Matlab, the command

    max(abs(X), abs(Y))
    

    will create two intermediate arrays and then one result array.

In [32]:
using RCall
Random.seed!(123)
X, Y = rand(1000, 1000), rand(1000, 1000)

R"""
library(microbenchmark)
microbenchmark(max(abs($X), abs($Y)))
"""
Out[32]:
RObject{VecSxp}
Unit: milliseconds
                            expr      min       lq    mean   median      uq
 max(abs(`#JL`$X), abs(`#JL`$Y)) 5.752931 6.047952 7.97932 6.821782 8.24881
      max neval
 26.40112   100

In Julia, dot operations are fused so no intermediate arrays are created.

In [33]:
# no intermediate arrays created, only result array created
@benchmark max.(abs.($X), abs.($Y))
Out[33]:
BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  2
  --------------
  minimum time:     1.675 ms (0.00% GC)
  median time:      1.903 ms (0.00% GC)
  mean time:        2.529 ms (27.30% GC)
  maximum time:     5.326 ms (54.28% GC)
  --------------
  samples:          1972
  evals/sample:     1

Pre-allocating result array gets rid of memory allocation at all.

In [34]:
# no memory allocation at all!
Z = zeros(size(X)) # zero matrix of same size as X
@benchmark $Z .= max.(abs.($X), abs.($Y)) # .= (vs =) is important!
Out[34]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.135 ms (0.00% GC)
  median time:      1.197 ms (0.00% GC)
  mean time:        1.282 ms (0.00% GC)
  maximum time:     1.985 ms (0.00% GC)
  --------------
  samples:          3878
  evals/sample:     1
  1. View-1) avoids creating extra copy of matrix data.
In [35]:
Random.seed!(123)
A = randn(1000, 1000)

# sum entries in a sub-matrix
@benchmark sum($A[1:2:500, 1:2:500])
Out[35]:
BenchmarkTools.Trial: 
  memory estimate:  488.39 KiB
  allocs estimate:  2
  --------------
  minimum time:     70.431 μs (0.00% GC)
  median time:      304.043 μs (0.00% GC)
  mean time:        287.967 μs (14.85% GC)
  maximum time:     2.843 ms (84.42% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [36]:
# view avoids creating a separate sub-matrix
@benchmark sum(@view $A[1:2:500, 1:2:500])
Out[36]:
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  7
  --------------
  minimum time:     118.440 μs (0.00% GC)
  median time:      128.581 μs (0.00% GC)
  mean time:        128.805 μs (0.00% GC)
  maximum time:     484.048 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

The @views macro, which can be useful in some operations.

In [37]:
@benchmark @views sum($A[1:2:500, 1:2:500])
Out[37]:
BenchmarkTools.Trial: 
  memory estimate:  208 bytes
  allocs estimate:  7
  --------------
  minimum time:     118.367 μs (0.00% GC)
  median time:      120.747 μs (0.00% GC)
  mean time:        124.625 μs (0.00% GC)
  maximum time:     342.429 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1