Numerical linear algebra: introduction

  • Topics in numerical algebra:

    • BLAS
    • solve linear equations $\mathbf{A} \mathbf{x} = \mathbf{b}$
    • regression computations $\mathbf{X}^T \mathbf{X} \beta = \mathbf{X}^T \mathbf{y}$
    • eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{x}$
    • generalized eigen-problems $\mathbf{A} \mathbf{x} = \lambda \mathbf{B} \mathbf{x}$
    • singular value decompositions $\mathbf{A} = \mathbf{U} \Sigma \mathbf{V}^T$
    • iterative methods
  • Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the building blocks of most statistical computing tasks (optimization, MCMC).

  • Our major goal (or learning objectives) is to

    1. know the computational cost (flop count) of each task
    2. be familiar with the BLAS and LAPACK functions (what they do)
    3. do not re-invent wheels by implementing these dense linear algebra subroutines by yourself
    4. understand the need for iterative methods
    5. apply appropriate numerical algebra tools to various statistical problems
  • All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra.

    • Julia offers more flexibility by exposing interfaces to many BLAS/LAPACK subroutines directly. See documentation.

BLAS

  • BLAS stands for basic linear algebra subprograms.

  • See netlib for a complete list of standardized BLAS functions.

  • There are many implementations of BLAS.

    • Netlib provides a reference implementation
    • Matlab uses Intel's MKL (mathematical kernel libaries)
    • Julia uses OpenBLAS
  • There are 3 levels of BLAS functions.

    • Level 1: vector-vector operation
    • Level 2: matrix-vector operation
    • Level 3: matrix-matrix operation
Level Example Operation Name Dimension Flops
1 $\alpha \gets \mathbf{x}^T \mathbf{y}$ dot product $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ $2n$
$\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$ axpy $\alpha \in \mathbb{R}$, $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ $2n$
2 $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ gaxpy $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$ $2mn$
$\mathbf{A} \gets \mathbf{A} + \mathbf{y} \mathbf{x}^T$ rank one update $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$ $2mn$
3 $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$ matrix multiplication $\mathbf{A} \in \mathbb{R}^{m \times p}$, $\mathbf{B} \in \mathbb{R}^{p \times n}$, $\mathbf{C} \in \mathbb{R}^{m \times n}$ $2mnp$
  • Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z).

  • Some operations appear as level-3 but indeed are level-2.

    • A common operation in statistics is column scaling or row scaling $$ \begin{eqnarray*} \mathbf{A} &=& \mathbf{A} \mathbf{D} \quad \text{(column scaling)} \\ \mathbf{A} &=& \mathbf{D} \mathbf{A} \quad \text{(row scaling)} \end{eqnarray*} $$ where $\mathbf{D}$ is diagonal.
    • These are essentially level-2 operation!
In [1]:
using BenchmarkTools

n = 2000

A = rand(n, n)
d = rand(n)  # d vector
D = diagm(d) # diagonal matrix with d as diagonal

# this is calling BLAS routine for matrix multiplication
@benchmark A * D
Out[1]:
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  2
  --------------
  minimum time:     101.964 ms (3.48% GC)
  median time:      103.480 ms (3.55% GC)
  mean time:        105.687 ms (4.21% GC)
  maximum time:     143.148 ms (28.83% GC)
  --------------
  samples:          48
  evals/sample:     1
In [2]:
# columnwise scaling
@benchmark scale(A, d)
WARNING: scale(A::AbstractMatrix,x::AbstractVector) is deprecated, use A * Diagonal(x) instead.
 in depwarn(::String, ::Symbol) at ./deprecated.jl:64
 in scale(::Array{Float64,2}, ::Array{Float64,1}) at ./deprecated.jl:50
 in ##core#278() at /Users/huazhou/.julia/v0.5/BenchmarkTools/src/execution.jl:290
 in ##sample#279(::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.5/BenchmarkTools/src/execution.jl:296
 in #_run#2(::Bool, ::String, ::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#277")}, ::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.5/BenchmarkTools/src/execution.jl:324
 in (::BenchmarkTools.#kw##_run)(::Array{Any,1}, ::BenchmarkTools.#_run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#277")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 in anonymous at ./<missing>:?
 in #run_result#16(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#277")}, ::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.5/BenchmarkTools/src/execution.jl:40
 in (::BenchmarkTools.#kw##run_result)(::Array{Any,1}, ::BenchmarkTools.#run_result, ::BenchmarkTools.Benchmark{Symbol("##benchmark#277")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 in #run#17(::Array{Any,1}, ::Function, ::BenchmarkTools.Benchmark{Symbol("##benchmark#277")}, ::BenchmarkTools.Parameters) at /Users/huazhou/.julia/v0.5/BenchmarkTools/src/execution.jl:43
 in (::Base.#kw##run)(::Array{Any,1}, ::Base.#run, ::BenchmarkTools.Benchmark{Symbol("##benchmark#277")}, ::BenchmarkTools.Parameters) at ./<missing>:0
 in warmup(::BenchmarkTools.Benchmark{Symbol("##benchmark#277")}) at /Users/huazhou/.julia/v0.5/BenchmarkTools/src/execution.jl:78
 in include_string(::String, ::String) at ./loading.jl:441
 in execute_request(::ZMQ.Socket, ::IJulia.Msg) at /Users/huazhou/.julia/v0.5/IJulia/src/execute_request.jl:157
 in eventloop(::ZMQ.Socket) at /Users/huazhou/.julia/v0.5/IJulia/src/eventloop.jl:8
 in (::IJulia.##13#19)() at ./task.jl:360
while loading In[2], in expression starting on line 208
Out[2]:
BenchmarkTools.Trial: 
  memory estimate:  30.53 MiB
  allocs estimate:  68
  --------------
  minimum time:     8.908 ms (2.80% GC)
  median time:      11.488 ms (31.38% GC)
  mean time:        11.722 ms (31.10% GC)
  maximum time:     57.167 ms (84.93% GC)
  --------------
  samples:          426
  evals/sample:     1
In [3]:
# another way for columnwise scaling
@benchmark A * Diagonal(d)
Out[3]:
BenchmarkTools.Trial: 
  memory estimate:  30.52 MiB
  allocs estimate:  3
  --------------
  minimum time:     8.430 ms (2.81% GC)
  median time:      10.937 ms (31.55% GC)
  mean time:        11.139 ms (31.97% GC)
  maximum time:     56.490 ms (85.42% GC)
  --------------
  samples:          448
  evals/sample:     1

Memory hierarchy and level-3 fraction

Key to high performance is effective use of memory hierarchy. True on all architectures.

  • Flop count is not the sole determinant of algorithm efficiency. Another important factor is data movement through the memory hierarchy.

  • For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s.

  • Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog?
    Answer: use high-level BLAS as much as possible.

BLAS Dimension Mem. Refs. Flops Ratio
Level 1: $\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$ $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ $3n$ $2n$ 3:2
Level 2: $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$, $\mathbf{A} \in \mathbb{R}^{n \times n}$ $n^2$ $2n^2$ 1:2
Level 3: $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$ $\mathbf{A}, \mathbf{B}, \mathbf{C} \in\mathbb{R}^{n \times n}$ $4n^2$ $2n^3$ 2:n
  • Higher level BLAS (3 or 2) make more effective use of arithmetic logic units (ALU) by keeping them busy. Surface-to-volume effect.
    See Dongarra slides.

  • A distinction between LAPACK and LINPACK (older version of R uses LINPACK) is that LAPACK makes use of higher level BLAS as much as possible (usually by smart partitioning) to increase the so-called level-3 fraction.

Effect of data layout

  • Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.

  • Storage mode: column-major (Fortran, Matlab, R, Julia) vs row-major (C/C++).

  • Cache line is the minimum amount of cache which can be loaded and stored to memory.

    • x86 CPUs: 64 bytes
    • ARM CPUS: 32 bytes

  • Accessing column-major stored matrix by rows causes lots of cache misses.

  • Take matrix multiplication as an example $$ \mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}, \quad \mathbf{A} \in \mathbb{R}^{m \times p}, \mathbf{B} \in \mathbb{R}^{p \times n}, \mathbf{C} \in \mathbb{R}^{m \times n}. $$ Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops.

    • jki or kji looping:
      for i = 1:m
            C[i, j] = C[i, j] + A[i, k] * B[k, j]
        end
      
    • ikj or kij looping:
      for j = 1:n
            C[i, j] = C[i, j] + A[i, k] * B[k, j]
        end
      
    • ijk or jik looping:
      for k = 1:p
            C[i, j] = C[i, j] + A[i, k] * B[k, j]
        end
      
  • We pay attention to the innermost loop, where the vector calculation occurs. The associated stride when accessing the three matrices in memory (assuming column-major storage) is
Variant A Stride B Stride C Stride
$jki$ or $kji$ Unit 0 Unit
$ikj$ or $kij$ 0 Non-Unit Non-Unit
$ijk$ or $jik$ Non-Unit Unit 0

Apparently the variants $jki$ or $kji$ are preferred.

In [4]:
function matmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String)
    
    m = size(A, 1)
    p = size(A, 2)
    n = size(B, 2)
    
    if order == "jki"
        for j = 1:n, k = 1:p, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kji"
        for k = 1:p, j = 1:n, i = 1:m
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ikj"
        for i = 1:m, k = 1:p, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end

    if order == "kij"
        for k = 1:p, i = 1:m, j = 1:n
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "ijk"
        for i = 1:m, j = 1:n, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
    if order == "jik"
        for j = 1:n, i = 1:m, k = 1:p
            C[i, j] += A[i, k] * B[k, j]
        end
    end
    
end

srand(123)
m, n, p = 2000, 100, 2000
A = rand(m, n)
B = rand(n, p)
C = zeros(m, p);
  • $jki$ and $kji$ looping:
In [5]:
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "jki")
Out[5]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     429.100 ms (0.00% GC)
  median time:      442.814 ms (0.00% GC)
  mean time:        444.246 ms (0.00% GC)
  maximum time:     459.781 ms (0.00% GC)
  --------------
  samples:          12
  evals/sample:     1
In [6]:
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "kji")
Out[6]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     522.318 ms (0.00% GC)
  median time:      548.960 ms (0.00% GC)
  mean time:        543.765 ms (0.00% GC)
  maximum time:     556.943 ms (0.00% GC)
  --------------
  samples:          10
  evals/sample:     1
  • $ikj$ and $kij$ looping:
In [7]:
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "ikj")
Out[7]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.658 s (0.00% GC)
  median time:      2.769 s (0.00% GC)
  mean time:        2.769 s (0.00% GC)
  maximum time:     2.880 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1
In [8]:
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "kij")
Out[8]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     2.431 s (0.00% GC)
  median time:      2.744 s (0.00% GC)
  mean time:        2.744 s (0.00% GC)
  maximum time:     3.056 s (0.00% GC)
  --------------
  samples:          2
  evals/sample:     1
  • $ijk$ and $jik$ looping:
In [9]:
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "ijk")
Out[9]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     996.499 ms (0.00% GC)
  median time:      1.006 s (0.00% GC)
  mean time:        1.012 s (0.00% GC)
  maximum time:     1.031 s (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1
In [10]:
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "ijk")
Out[10]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     1.004 s (0.00% GC)
  median time:      1.033 s (0.00% GC)
  mean time:        1.091 s (0.00% GC)
  maximum time:     1.248 s (0.00% GC)
  --------------
  samples:          5
  evals/sample:     1
  • Julia wrapper of BLAS function. We see BLAS library wins hands down (multi-threading, Strassen algorithm, higher level-3 fraction by block outer product).
In [11]:
fill!(C, 0.0)
@benchmark Base.LinAlg.BLAS.gemm!('N', 'N', 1.0, A, B, 1.0, C)
Out[11]:
BenchmarkTools.Trial: 
  memory estimate:  0 bytes
  allocs estimate:  0
  --------------
  minimum time:     4.844 ms (0.00% GC)
  median time:      5.224 ms (0.00% GC)
  mean time:        5.436 ms (0.00% GC)
  maximum time:     10.025 ms (0.00% GC)
  --------------
  samples:          914
  evals/sample:     1

To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see the Quora question, especially the video. Bottomline is

Make friends with (good implementations of) BLAS/LAPACK and use them as much as possible.

Avoid memory allocation: some examples

  • Transposing matrix is an expensive memory operation.
    • In R, the command
      t(A) %*% x
      
      will first transpose A then perform matrix multiplication, causing unnecessary memory allocation
    • Julia is smart to avoid transposing matrix if possible.
In [12]:
srand(123)

n = 1000
A = rand(n, n)
x = rand(n)

# dispatch to At_mul_B (and then to BLAS)
# does *not* actually transpose the matrix
@benchmark A' * x
Out[12]:
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     92.976 μs (0.00% GC)
  median time:      143.767 μs (0.00% GC)
  mean time:        229.011 μs (0.00% GC)
  maximum time:     1.740 ms (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [13]:
# dispatch to BLAS
@benchmark At_mul_B(A, x)
Out[13]:
BenchmarkTools.Trial: 
  memory estimate:  7.94 KiB
  allocs estimate:  1
  --------------
  minimum time:     94.267 μs (0.00% GC)
  median time:      129.143 μs (0.00% GC)
  mean time:        150.209 μs (0.00% GC)
  maximum time:     889.597 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [14]:
# let's force transpose
@benchmark transpose(A) * x
Out[14]:
BenchmarkTools.Trial: 
  memory estimate:  7.64 MiB
  allocs estimate:  3
  --------------
  minimum time:     2.995 ms (0.00% GC)
  median time:      3.702 ms (0.00% GC)
  mean time:        4.038 ms (21.29% GC)
  maximum time:     9.451 ms (39.54% GC)
  --------------
  samples:          1225
  evals/sample:     1
  • Broadcasting in Julia achieves vectorized code without creating intermediate arrays.
In [15]:
srand(123)
X, Y = rand(1000,1000), rand(1000,1000)

# two temporary arrays are created
@benchmark max(abs(X), abs(Y))
Out[15]:
BenchmarkTools.Trial: 
  memory estimate:  22.89 MiB
  allocs estimate:  10
  --------------
  minimum time:     16.330 ms (9.89% GC)
  median time:      19.290 ms (10.55% GC)
  mean time:        19.408 ms (10.62% GC)
  maximum time:     30.123 ms (12.41% GC)
  --------------
  samples:          258
  evals/sample:     1
In [16]:
# no temporary arrays created
@benchmark max.(abs.(X), abs.(Y))
Out[16]:
BenchmarkTools.Trial: 
  memory estimate:  7.63 MiB
  allocs estimate:  37
  --------------
  minimum time:     2.766 ms (0.00% GC)
  median time:      3.613 ms (0.00% GC)
  mean time:        4.419 ms (22.56% GC)
  maximum time:     10.487 ms (38.26% GC)
  --------------
  samples:          1125
  evals/sample:     1
In [17]:
# no memory allocation at all!
Z = zeros(X)
@benchmark Z .= max.(abs.(X), abs.(Y))
Out[17]:
BenchmarkTools.Trial: 
  memory estimate:  96 bytes
  allocs estimate:  3
  --------------
  minimum time:     1.668 ms (0.00% GC)
  median time:      1.879 ms (0.00% GC)
  mean time:        1.995 ms (0.00% GC)
  maximum time:     5.185 ms (0.00% GC)
  --------------
  samples:          2473
  evals/sample:     1
  • View avoids creating extra copy of matrix data.
In [18]:
srand(123)
A = randn(1000, 1000)

# sum entries in a sub-matrix
@benchmark sum(A[1:2:500, 1:2:500])
Out[18]:
BenchmarkTools.Trial: 
  memory estimate:  488.47 KiB
  allocs estimate:  5
  --------------
  minimum time:     62.532 μs (0.00% GC)
  median time:      237.506 μs (0.00% GC)
  mean time:        255.955 μs (14.62% GC)
  maximum time:     4.642 ms (82.11% GC)
  --------------
  samples:          10000
  evals/sample:     1
In [19]:
# view avoids creating a separate sub-matrix
@benchmark sum(@view A[1:2:500, 1:2:500])
Out[19]:
BenchmarkTools.Trial: 
  memory estimate:  256 bytes
  allocs estimate:  7
  --------------
  minimum time:     77.294 μs (0.00% GC)
  median time:      84.070 μs (0.00% GC)
  mean time:        89.548 μs (0.00% GC)
  maximum time:     353.716 μs (0.00% GC)
  --------------
  samples:          10000
  evals/sample:     1

Julia 0.6 adds the @views macro, which can be useful in some operations.