Topics in numerical algebra:
Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the building blocks of most statistical computing tasks (optimization, MCMC).
Our major goal (or learning objectives) is to
All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra.
BLAS stands for basic linear algebra subprograms.
See netlib for a complete list of standardized BLAS functions.
There are many implementations of BLAS.
There are 3 levels of BLAS functions.
Level | Example Operation | Name | Dimension | Flops |
---|---|---|---|---|
1 | $\alpha \gets \mathbf{x}^T \mathbf{y}$ | dot product | $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ | $2n$ |
$\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$ | axpy | $\alpha \in \mathbb{R}$, $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ | $2n$ | |
2 | $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ | gaxpy | $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$ | $2mn$ |
$\mathbf{A} \gets \mathbf{A} + \mathbf{y} \mathbf{x}^T$ | rank one update | $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$ | $2mn$ | |
3 | $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$ | matrix multiplication | $\mathbf{A} \in \mathbb{R}^{m \times p}$, $\mathbf{B} \in \mathbb{R}^{p \times n}$, $\mathbf{C} \in \mathbb{R}^{m \times n}$ | $2mnp$ |
Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z).
Some operations appear as level-3 but indeed are level-2.
using BenchmarkTools
n = 2000
A = rand(n, n)
d = rand(n) # d vector
D = diagm(d) # diagonal matrix with d as diagonal
# this is calling BLAS routine for matrix multiplication
@benchmark A * D
# columnwise scaling
@benchmark scale(A, d)
# another way for columnwise scaling
@benchmark A * Diagonal(d)
Key to high performance is effective use of memory hierarchy. True on all architectures.
For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s.
Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog?
Answer: use high-level BLAS as much as possible.
BLAS | Dimension | Mem. Refs. | Flops | Ratio |
---|---|---|---|---|
Level 1: $\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$ | $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ | $3n$ | $2n$ | 3:2 |
Level 2: $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ | $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$, $\mathbf{A} \in \mathbb{R}^{n \times n}$ | $n^2$ | $2n^2$ | 1:2 |
Level 3: $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$ | $\mathbf{A}, \mathbf{B}, \mathbf{C} \in\mathbb{R}^{n \times n}$ | $4n^2$ | $2n^3$ | 2:n |
Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.
Storage mode: column-major (Fortran, Matlab, R, Julia) vs row-major (C/C++).
Cache line is the minimum amount of cache which can be loaded and stored to memory.
Accessing column-major stored matrix by rows causes lots of cache misses.
Take matrix multiplication as an example $$ \mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}, \quad \mathbf{A} \in \mathbb{R}^{m \times p}, \mathbf{B} \in \mathbb{R}^{p \times n}, \mathbf{C} \in \mathbb{R}^{m \times n}. $$ Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops.
jki
or kji
looping:for i = 1:m
C[i, j] = C[i, j] + A[i, k] * B[k, j]
end
ikj
or kij
looping:for j = 1:n
C[i, j] = C[i, j] + A[i, k] * B[k, j]
end
ijk
or jik
looping:for k = 1:p
C[i, j] = C[i, j] + A[i, k] * B[k, j]
end
Variant | A Stride | B Stride | C Stride |
---|---|---|---|
$jki$ or $kji$ | Unit | 0 | Unit |
$ikj$ or $kij$ | 0 | Non-Unit | Non-Unit |
$ijk$ or $jik$ | Non-Unit | Unit | 0 |
Apparently the variants $jki$ or $kji$ are preferred.
function matmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String)
m = size(A, 1)
p = size(A, 2)
n = size(B, 2)
if order == "jki"
for j = 1:n, k = 1:p, i = 1:m
C[i, j] += A[i, k] * B[k, j]
end
end
if order == "kji"
for k = 1:p, j = 1:n, i = 1:m
C[i, j] += A[i, k] * B[k, j]
end
end
if order == "ikj"
for i = 1:m, k = 1:p, j = 1:n
C[i, j] += A[i, k] * B[k, j]
end
end
if order == "kij"
for k = 1:p, i = 1:m, j = 1:n
C[i, j] += A[i, k] * B[k, j]
end
end
if order == "ijk"
for i = 1:m, j = 1:n, k = 1:p
C[i, j] += A[i, k] * B[k, j]
end
end
if order == "jik"
for j = 1:n, i = 1:m, k = 1:p
C[i, j] += A[i, k] * B[k, j]
end
end
end
srand(123)
m, n, p = 2000, 100, 2000
A = rand(m, n)
B = rand(n, p)
C = zeros(m, p);
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "jki")
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "kji")
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "ikj")
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "kij")
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "ijk")
fill!(C, 0.0)
@benchmark matmul_by_loop!(A, B, C, "ijk")
fill!(C, 0.0)
@benchmark Base.LinAlg.BLAS.gemm!('N', 'N', 1.0, A, B, 1.0, C)
To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see the Quora question, especially the video. Bottomline is
Make friends with (good implementations of) BLAS/LAPACK and use them as much as possible.
t(A) %*% x
A
then perform matrix multiplication, causing unnecessary memory allocationsrand(123)
n = 1000
A = rand(n, n)
x = rand(n)
# dispatch to At_mul_B (and then to BLAS)
# does *not* actually transpose the matrix
@benchmark A' * x
# dispatch to BLAS
@benchmark At_mul_B(A, x)
# let's force transpose
@benchmark transpose(A) * x
srand(123)
X, Y = rand(1000,1000), rand(1000,1000)
# two temporary arrays are created
@benchmark max(abs(X), abs(Y))
# no temporary arrays created
@benchmark max.(abs.(X), abs.(Y))
# no memory allocation at all!
Z = zeros(X)
@benchmark Z .= max.(abs.(X), abs.(Y))
srand(123)
A = randn(1000, 1000)
# sum entries in a sub-matrix
@benchmark sum(A[1:2:500, 1:2:500])
# view avoids creating a separate sub-matrix
@benchmark sum(@view A[1:2:500, 1:2:500])
Julia 0.6 adds the @views
macro, which can be useful in some operations.