Topics in numerical algebra:
Except for the iterative methods, most of these numerical linear algebra tasks are implemented in the BLAS and LAPACK libraries. They form the building blocks of most statistical computing tasks (optimization, MCMC).
Our major goal (or learning objectives) is to
All high-level languages (R, Matlab, Julia) call BLAS and LAPACK for numerical linear algebra.
BLAS stands for basic linear algebra subprograms.
See netlib for a complete list of standardized BLAS functions.
There are many implementations of BLAS.
There are 3 levels of BLAS functions.
Level | Example Operation | Name | Dimension | Flops |
---|---|---|---|---|
1 | $\alpha \gets \mathbf{x}^T \mathbf{y}$ | dot product | $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ | $2n$ |
$\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$ | axpy | $\alpha \in \mathbb{R}$, $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ | $2n$ | |
2 | $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ | gaxpy | $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$ | $2mn$ |
$\mathbf{A} \gets \mathbf{A} + \mathbf{y} \mathbf{x}^T$ | rank one update | $\mathbf{A} \in \mathbb{R}^{m \times n}$, $\mathbf{x} \in \mathbb{R}^n$, $\mathbf{y} \in \mathbb{R}^m$ | $2mn$ | |
3 | $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$ | matrix multiplication | $\mathbf{A} \in \mathbb{R}^{m \times p}$, $\mathbf{B} \in \mathbb{R}^{p \times n}$, $\mathbf{C} \in \mathbb{R}^{m \times n}$ | $2mnp$ |
Typical BLAS functions support single precision (S), double precision (D), complex (C), and double complex (Z).
Some operations appear as level-3 but indeed are level-2.
using BenchmarkTools
srand(123) # seed
n = 2000
A = rand(n, n)
d = rand(n) # d vector
D = diagm(d) # diagonal matrix with d as diagonal
# this is calling BLAS routine for matrix multiplication: O(n^3) flops
@benchmark A * D
Diagonal(d)
# current way for columnwise scaling: O(n^2) flops
@benchmark A * Diagonal(d)
@which A * Diagonal(d)
# in-place: avoid allocate space for result
@benchmark scale!(A, d)
Key to high performance is effective use of memory hierarchy. True on all architectures.
Operation | Time |
---|---|
L1 cache reference | 0.5 ns |
L2 cache reference | 7 ns |
Main memory reference | 100 ns |
Read 1 MB sequentially from memory | 250,000 ns |
Read 1 MB sequentially from SSD | 1,000,000 ns |
Read 1 MB sequentially from disk | 20,000,000 ns |
Source: https://gist.github.com/jboner/2841832
For example, Xeon X5650 CPU has a theoretical throughput of 128 DP GFLOPS but a max memory bandwidth of 32GB/s.
Can we keep CPU cores busy with enough deliveries of matrix data and ship the results to memory fast enough to avoid backlog?
Answer: use high-level BLAS as much as possible.
BLAS | Dimension | Mem. Refs. | Flops | Ratio |
---|---|---|---|---|
Level 1: $\mathbf{y} \gets \mathbf{y} + \alpha \mathbf{x}$ | $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$ | $3n$ | $2n$ | 3:2 |
Level 2: $\mathbf{y} \gets \mathbf{y} + \mathbf{A} \mathbf{x}$ | $\mathbf{x}, \mathbf{y} \in \mathbb{R}^n$, $\mathbf{A} \in \mathbb{R}^{n \times n}$ | $n^2$ | $2n^2$ | 1:2 |
Level 3: $\mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}$ | $\mathbf{A}, \mathbf{B}, \mathbf{C} \in\mathbb{R}^{n \times n}$ | $4n^2$ | $2n^3$ | 2:n |
Data layout in memory affects algorithmic efficiency too. It is much faster to move chunks of data in memory than retrieving/writing scattered data.
Storage mode: column-major (Fortran, Matlab, R, Julia) vs row-major (C/C++).
Cache line is the minimum amount of cache which can be loaded and stored to memory.
Accessing column-major stored matrix by rows causes lots of cache misses.
Take matrix multiplication as an example $$ \mathbf{C} \gets \mathbf{C} + \mathbf{A} \mathbf{B}, \quad \mathbf{A} \in \mathbb{R}^{m \times p}, \mathbf{B} \in \mathbb{R}^{p \times n}, \mathbf{C} \in \mathbb{R}^{m \times n}. $$ Assume the storage is column-major, such as in Julia. There are 6 variants of the algorithms according to the order in the triple loops.
jki
or kji
looping:# inner most loop
for i = 1:m
C[i, j] = C[i, j] + A[i, k] * B[k, j]
end
ikj
or kij
looping:# inner most loop
for j = 1:n
C[i, j] = C[i, j] + A[i, k] * B[k, j]
end
ijk
or jik
looping:# inner most loop
for k = 1:p
C[i, j] = C[i, j] + A[i, k] * B[k, j]
end
Variant | A Stride | B Stride | C Stride |
---|---|---|---|
$jki$ or $kji$ | Unit | 0 | Unit |
$ikj$ or $kij$ | 0 | Non-Unit | Non-Unit |
$ijk$ or $jik$ | Non-Unit | Unit | 0 |
Apparently the variants $jki$ or $kji$ are preferred.
"""
matmul_by_loop!(A, B, C, order)
Overwrite `C` by `A * B`. `order` indicates the looping order for triple loop.
"""
function matmul_by_loop!(A::Matrix, B::Matrix, C::Matrix, order::String)
m = size(A, 1)
p = size(A, 2)
n = size(B, 2)
fill!(C, 0)
if order == "jki"
for j = 1:n, k = 1:p, i = 1:m
C[i, j] += A[i, k] * B[k, j]
end
end
if order == "kji"
for k = 1:p, j = 1:n, i = 1:m
C[i, j] += A[i, k] * B[k, j]
end
end
if order == "ikj"
for i = 1:m, k = 1:p, j = 1:n
C[i, j] += A[i, k] * B[k, j]
end
end
if order == "kij"
for k = 1:p, i = 1:m, j = 1:n
C[i, j] += A[i, k] * B[k, j]
end
end
if order == "ijk"
for i = 1:m, j = 1:n, k = 1:p
C[i, j] += A[i, k] * B[k, j]
end
end
if order == "jik"
for j = 1:n, i = 1:m, k = 1:p
C[i, j] += A[i, k] * B[k, j]
end
end
end
srand(123)
m, n, p = 2000, 100, 2000
A = rand(m, n)
B = rand(n, p)
C = zeros(m, p);
using BenchmarkTools
@benchmark matmul_by_loop!(A, B, C, "jki")
@benchmark matmul_by_loop!(A, B, C, "kji")
@benchmark matmul_by_loop!(A, B, C, "ikj")
@benchmark matmul_by_loop!(A, B, C, "kij")
@benchmark matmul_by_loop!(A, B, C, "ijk")
@benchmark matmul_by_loop!(A, B, C, "ijk")
@which A_mul_B!(C, A, B)
@benchmark A_mul_B!(C, A, B)
@benchmark Base.LinAlg.BLAS.gemm!('N', 'N', 1.0, A, B, 1.0, C)
To appreciate the efforts in an optimized BLAS implementation such as OpenBLAS (evolved from GotoBLAS), see the Quora question, especially the video. Bottomline is
Get familiar with (good implementations of) BLAS/LAPACK and use them as much as possible.
t(A) %*% x
A
then perform matrix multiplication, causing unnecessary memory allocationsrand(123)
n = 1000
A = rand(n, n)
x = rand(n)
# dispatch to At_mul_B (and then to BLAS)
# does *not* actually transpose the matrix
@benchmark A' * x
@which A' * x
# dispatch to BLAS
@benchmark At_mul_B(A, x)
# let's force transpose
@benchmark transpose(A) * x
# pre-allocate result
out = zeros(size(A, 2))
@benchmark At_mul_B!(out, A, x)
srand(123)
X, Y = rand(1000,1000), rand(1000,1000)
# two temporary arrays are created
@benchmark max(abs(X), abs(Y))
# no temporary arrays created
@benchmark max.(abs.(X), abs.(Y))
# no memory allocation at all!
Z = zeros(X)
@benchmark Z .= max.(abs.(X), abs.(Y))
srand(123)
A = randn(1000, 1000)
# sum entries in a sub-matrix
@benchmark sum(A[1:2:500, 1:2:500])
# view avoids creating a separate sub-matrix
@benchmark sum(@view A[1:2:500, 1:2:500])
Julia 0.6 adds the @views
macro, which can be useful in some operations.
@benchmark @views sum(A[1:2:500, 1:2:500])
versioninfo()